A neutral, worked-example reference for what a production chatbot on Amazon Bedrock really costs per month in 2026. We fix a set of assumptions — users, messages per user, tokens per message, the system prompt, conversation history, and RAG context — then do the token math out loud, build a cost matrix across three model tiers (Amazon Nova Lite, Claude Haiku, Claude Sonnet) and three volumes (low / medium / high), show exactly how much prompt caching and history truncation cut the bill, and name the places cost quietly hides. Then: how AWS credits cover all of it, so the build costs you $0.
Before any table, it helps to see that the entire monthly bill for a Bedrock chatbot reduces to one short equation. Everything else on this page is just plugging realistic numbers into it — and understanding which input quietly gets large.
A chatbot on Amazon Bedrock is billed the same way as any text model: per token (a token ≈ ¾ of a word in English, so 1,000 tokens ≈ 750 words), metered separately for input (everything you send the model) and output (everything it generates). The monthly cost is therefore:
monthly cost = messages/month × (input tokens/message × input rate + output tokens/message × output rate)
That is it. The model’s published rates set the two multipliers; your product sets the two token counts and the volume. The reason Bedrock chatbot bills are mis-estimated is almost never the rate — it is the input tokens per message. People picture a user typing a 20-token question and forget that the request actually sent to the model on that turn also contains the system prompt, the conversation history accumulated so far, and any retrieved context (RAG). The 20-token question can ride on top of 3,000 tokens of baggage, and you pay input price for all of it, on every single turn.
Output tokens, by contrast, are usually modest for a chatbot — a few hundred tokens per reply — but priced 3–5× higher than input for the same model. So a chatbot’s bill is a tug-of-war: lots of cheap-rate input tokens (inflated by history and context) versus fewer expensive-rate output tokens. Which side dominates depends entirely on how much context you carry, which is why the levers later in this page (caching, truncation, retrieval tuning) are all about controlling input.
Caveat, stated once and meant throughout: every dollar figure here is representative as of 2026, chosen to show relative cost and the shape of a bill. Foundation-model prices change often as providers compete. Confirm current rates on the official AWS Bedrock pricing page, and use the amazon-bedrock-pricing-calculator sibling to plug in your own numbers.
It is not the model rate — it is input tokens per message. Every turn re-sends the system prompt + full conversation history + RAG context, so a short question can carry thousands of billed input tokens. Control that number and you control the bill.
To compute a real number we have to commit to assumptions. These are deliberately middle-of-the-road for a customer-facing or internal support assistant with light RAG. Swap in your own values — the method is what matters.
A "message" below means one user turn and the model’s reply to it (one request/response round-trip). The token counts are the per-turn totals actually sent to and returned by the model, not just what the user typed.
~3,500 input tokens (700 system + 1,500 history + 1,200 RAG + 100 question) and ~350 output tokens per message. Adjust these four numbers and the whole bill moves. The two you can most easily shrink are the system prompt (via caching) and history (via truncation).
Cost scales linearly with message count, so we define three round volume tiers and carry them through the rest of the page. Pick the row closest to your reality.
These are total assistant messages per month (user turns answered), not unique users. A useful mental conversion: monthly messages ≈ active users × conversations per user × turns per conversation. The three tiers below correspond loosely to an early beta, a live small-to-mid product, and a popular consumer or large-internal deployment.
| Tier | Messages / month | Rough shape | Input tokens / mo (@3,500) | Output tokens / mo (@350) |
|---|---|---|---|---|
| Low | 50,000 | ~1,600/day — beta or internal tool | 175M | 17.5M |
| Medium | 500,000 | ~16,000/day — live product | 1,750M (1.75B) | 175M |
| High | 5,000,000 | ~165,000/day — consumer scale | 17,500M (17.5B) | 1,750M |
This is the table most people come for: the modeled monthly Bedrock bill for the assistant defined above, across three model tiers and three traffic volumes, on plain on-demand pricing with no caching yet. It shows both the absolute numbers and the ~40× spread that model choice alone creates.
The three models span the practical range for a chatbot: Amazon Nova Lite (cheap, fast, multimodal — Amazon’s value tier), Claude Haiku (fast, capable, popular for support bots), and Claude Sonnet (the all-round workhorse for quality-sensitive assistants). Representative 2026 on-demand rates per 1M tokens: Nova Lite $0.06 in / $0.24 out; Claude Haiku $0.25 in / $1.25 out; Claude Sonnet $3.00 in / $15.00 out. Each cell below is (input tokens/mo × input rate) + (output tokens/mo × output rate), rounded.
Read the table two ways. Down a column: cost scales straight-line with volume (10× the messages, 10× the bill). Across a row: switching from Nova Lite to Sonnet multiplies the same workload by roughly 40× — the single largest cost decision you make. The middle option, Haiku, is the common sweet spot: meaningfully more capable than the value tier, a fraction of Sonnet’s cost.
| Model (in / out per 1M) | Low (50K msg) | Medium (500K msg) | High (5M msg) | Cost / 1K messages |
|---|---|---|---|---|
| Amazon Nova Lite ($0.06 / $0.24) | ~$15 | ~$147 | ~$1,470 | ~$0.29 |
| Claude Haiku ($0.25 / $1.25) | ~$66 | ~$656 | ~$6,560 | ~$1.31 |
| Claude Sonnet ($3.00 / $15.00) | ~$788 | ~$7,875 | ~$78,750 | ~$15.75 |
The numbers above re-pay full input price for the system prompt and RAG context on every single turn. Prompt caching attacks exactly that waste — and on a chatbot, where the system prompt is identical across millions of calls, it is the highest-leverage change you can make without touching the model.
Bedrock prompt caching lets you mark a stable prefix of the prompt — typically the system prompt, and often the retrieved context — so that after the first request writes it to cache, subsequent requests read it back at a steep discount instead of paying full input price again. Representative behavior in 2026: cache reads cost roughly 10% of the normal input rate, with a one-time cache-write surcharge (~25% above input) on the call that populates it. On a workload where the same prefix repeats across thousands of calls, the write cost amortizes to near zero and you effectively pay ~10% for that portion of the input.
Apply it to our assumptions. Of the 3,500 input tokens per message, the 700-token system prompt is perfectly cacheable (identical every turn) and much of the 1,200-token RAG context is cacheable when the same documents are reused across a session or across users. Even caching just the system prompt, the per-message input drops from 3,500 to roughly 2,870 effective tokens (the 700 now billed at ~10%); cache the common RAG context too and effective input can fall toward ~1,800–2,000 tokens — a 40–50% cut to the input side of the bill. Because input dominates a context-heavy chatbot, that flows straight to the bottom line.
The table below re-prices the medium volume tier (500K messages) with system-prompt + RAG caching applied, holding output unchanged. The savings are largest, in percentage terms, on the cheaper models — because for them input is an even larger share of total cost — but the dollar savings are largest on Sonnet.
| Model | No caching | With caching (system + RAG) | Saving | Why |
|---|---|---|---|---|
| Amazon Nova Lite | ~$147 | ~$95 | ~35% | Input is ~70% of its bill; caching cuts most of it |
| Claude Haiku | ~$656 | ~$435 | ~34% | Same shape — input-heavy, output cheap |
| Claude Sonnet | ~$7,875 | ~$5,500 | ~30% | Largest absolute saving (~$2.4K/mo) despite costlier output |
If a chunk of your prompt is identical across many calls (system prompt, tool definitions, a reference doc), cache it. On a chatbot the system prompt alone repeats on 100% of turns — caching it is close to free money. It only helps for repeated prefixes; a fully unique prompt gains nothing.
The sneakiest line in a chatbot bill is conversation history. Because each turn re-sends the transcript so far, the input grows with every exchange — and if you naively send the full history, cost rises super-linearly across a long conversation. This is the single most common reason a Bedrock chatbot bill comes in higher than the back-of-envelope estimate.
Consider one conversation with no truncation. Turn 1 sends the system prompt + question. Turn 2 sends system prompt + turn-1 question + turn-1 answer + turn-2 question. Turn 10 re-sends everything from turns 1–9 plus the new question. The history portion of input therefore grows roughly linearly per turn, which means the cumulative input tokens across an N-turn conversation grow with N². A 20-turn conversation does not cost 20× a single turn — it costs far more, because the later turns are each carrying the entire transcript.
A worked illustration: assume each turn adds ~450 tokens to the transcript (a ~100-token question + ~350-token answer). With no truncation, a 20-turn conversation accumulates history input of roughly 450 × (1+2+…+19) ≈ 450 × 190 ≈ 85,500 history tokens across the conversation — on top of system prompt and RAG on every turn. With a truncation window that keeps only the last ~6 turns (≈ 2,700 tokens of history cap), the history input across the same 20 turns is roughly 20 × ~1,500 average ≈ 30,000 tokens — a ~65% reduction in the history component, with minimal quality loss for most assistants because distant turns rarely matter.
This is why our base assumptions used a truncated steady-state history of ~1,500 tokens rather than an unbounded transcript. If you skip truncation, every long power-user conversation quietly multiplies your input bill. The standard fixes, in order of simplicity: a fixed-window truncation (keep the last K turns), summarized history (periodically replace old turns with a short running summary — a small extra output cost that pays for itself), and combining either with caching so even the retained history is discounted on repeat.
| Strategy | History tokens (cumulative, 20 turns) | Relative | Quality impact | Notes |
|---|---|---|---|---|
| Full history (no truncation) | ~85,500 | 100% | Best continuity | Grows with N² — punishes long chats |
| Last-6-turns window | ~30,000 | ~35% | Negligible for most bots | Simplest effective fix |
| Rolling summary + last 3 turns | ~18,000 | ~21% | Slight; keeps long-range facts | Adds small summarization output cost |
Everything above prices a Bedrock chatbot if you pay AWS directly. For most startups and many companies the relevant number is different — because AWS will frequently fund the build with credits, and every line in the matrix draws those credits down before it ever touches your card.
Bedrock usage is fully credit-eligible — inference, embeddings, fine-tuning, Knowledge Bases, the lot — and credits apply automatically against your AWS bill until exhausted. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed precisely at proving out a GenAI use case like a chatbot; and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). Put concretely: a $10K–$50K POC pool covers years of the low-volume Haiku bot, or many months of the medium tier — long enough to find product-market fit before a single dollar leaves your runway.
The practical catch is that most of these pools are partner-filed: they are requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams route through an AWS partner rather than applying alone — and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the chatbot itself — the RAG pipeline, the prompt caching, the history-truncation logic, the tiered model router that keeps the bill down. The customer pays $0: AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
Combined with the levers on this page, the picture for a startup is: build the chatbot on the cheapest model that clears your quality bar, cache the system prompt and shared context, truncate history, and draw the whole thing down against a $25K–$100K credit pool while usage — and ideally revenue — scales. Related: see the cross-cluster pages on AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.
One consolidated view: the modeled monthly Bedrock bill for the assistant defined on this page, across all three model tiers and all three volumes, on-demand with no caching. This is the scannable summary — apply caching (~30–50% off input) and history truncation on top for your real number. Figures are representative 2026 illustrations, not quotes.
| Model (in / out per 1M) | Low · 50K msg | Medium · 500K msg | High · 5M msg | Cost / 1K msg | Best fit |
|---|---|---|---|---|---|
| Amazon Nova Lite ($0.06 / $0.24) | ~$15 | ~$147 | ~$1,470 | ~$0.29 | Cost-sensitive, high volume, simple turns |
| Claude Haiku ($0.25 / $1.25) | ~$66 | ~$656 | ~$6,560 | ~$1.31 | The common sweet spot — capable + cheap |
| Claude Sonnet ($3.00 / $15.00) | ~$788 | ~$7,875 | ~$78,750 | ~$15.75 | Quality-critical assistants; route only hard turns here |
Situation: The team had prototyped on Claude Sonnet for every turn, on-demand, sending the full conversation history and 8 RAG chunks per message. Their modeled bill at launch volume was roughly $2.6K/month and rising — and the system prompt plus uncapped history meant long support threads were the worst offenders. They wanted both to bring the number down and to avoid paying any of it from a runway earmarked for hiring.
What CloudRoute did: CloudRoute matched them in under 24 hours to an EU-based AWS partner with GenAI cost-engineering experience. The partner (1) introduced a tiered router — Claude Haiku for the ~80% routine support turns, Sonnet only for the genuinely complex ones; (2) turned on prompt caching for the fixed system prompt and the most-reused doc chunks; (3) cut retrieval from 8 chunks to a reranked top-3 and added a last-6-turns history window; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the launch.
Outcome: Modeled inference cost fell from ~$2.6K to ~$520/month through model-routing, caching, retrieval tuning, and history truncation — and even that was fully covered by the approved credits, so the team paid $0 during the build and early launch. CloudRoute’s commission was paid by the partner from AWS engagement funding, not by the customer.
cost cut: ~$2.6K → ~$520/mo modeled · levers: routing + caching + rerank + truncation · credits: POC + Activate · out-of-pocket: $0
Whatever your Bedrock chatbot would cost — $15/month or $78K/month — AWS credits can cover it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who files the application and builds a cost-tuned bot (caching, truncation, tiered routing). Customer pays $0.