Bedrock bills are dominated by tokens, and tokens are dominated by a handful of design decisions most teams make once and never revisit. This guide walks through every cost lever — model right-sizing, intelligent routing, prompt caching, batch inference, provisioned throughput, output-token reduction, RAG vs long context, distillation, and embeddings — each with the underlying mechanism, the typical savings, and where it ranks. We close with how AWS credits zero the bill entirely during the build phase.
You cannot optimize what you cannot decompose. Almost every dollar on a Bedrock invoice traces back to one formula, and understanding it tells you exactly which levers matter and which are noise.
On-demand Bedrock pricing is per-token, billed separately for input and output, and the rates differ by model. The core formula for a single request is: cost = (input_tokens × input_rate) + (output_tokens × output_rate). Multiply by request volume and you have your bill. Two facts about this formula drive nearly all of FinOps-for-Bedrock.
Fact one: output tokens are far more expensive than input tokens. Across the frontier model families on Bedrock the output rate is typically 4× to 5× the input rate. A model priced at $3 per million input tokens commonly charges $15 per million output tokens. That asymmetry means a request that reads 4,000 tokens of context and writes a 1,000-token answer can have its cost split roughly evenly between input and output despite the 4:1 token ratio — and a chatty model that pads answers is quietly one of the most expensive habits in your stack.
Fact two: not all tokens are billed equally once you add features. Prompt caching reads are billed at a steep discount to normal input tokens. Batch inference discounts the entire request by half. Provisioned Throughput stops billing per-token altogether and bills per-hour for reserved model capacity. Each of these is a different way of changing one of the two terms in the formula, which is why the levers below are organized around "make the rate smaller" versus "make the token count smaller."
There is a third, smaller line item worth naming: embeddings. Retrieval-augmented generation (RAG) requires embedding both your documents (once, at ingestion) and every incoming query (at request time). Embedding models are cheap per token relative to generation, but at scale — millions of documents re-embedded on every schema change, or high query volume — embeddings become a real line item that has its own optimization story (covered in Section IX).
The practical takeaway: before you tune anything, pull your Bedrock usage by model and by input-versus-output tokens from Cost Explorer or CloudWatch. Most teams discover that 70–90% of spend sits in one or two models and that output tokens are a larger share than they assumed. That distribution tells you which of the nine levers below will move your bill and which are rounding errors for your specific workload.
Before optimizing, compute your output-to-input token ratio and your spend concentration by model. If output tokens are >40% of spend, prioritize output discipline and model right-sizing. If a single frontier model is >70% of spend, intelligent routing is your highest-impact move. The data decides the order — not the guide.
The single largest source of waste in production GenAI is sending every request to a frontier model when most requests would be answered identically by a model that costs a fraction as much. Right-sizing fixes the static case; intelligent routing fixes the dynamic one.
The price spread between model tiers on Bedrock is enormous — often more than 10× per token between the smallest and largest model in the same family, and larger still across families. A small model (Claude Haiku-class, Nova Lite/Micro-class, Llama 8B-class) can be one to two orders of magnitude cheaper than a frontier model (Claude Opus-class, Nova Premier-class). When a classification, extraction, routing, or short-answer task runs on a frontier model, you are typically paying 10–30× more than necessary for output that a small model produces just as well.
Right-sizing is the static version: for each distinct task in your application, find the cheapest model that clears your quality bar, and pin that task to it. The discipline is to evaluate per task, not per application — a single product often has a "draft the marketing email" task that genuinely needs a frontier model alongside a "categorize this support ticket" task that a small model nails. Pinning both to the frontier model because it is simpler is the most common and most expensive default in the field.
Intelligent routing is the dynamic version: at request time, a lightweight router decides which model should handle this specific input based on predicted difficulty, then escalates to a larger model only when needed. Amazon Bedrock Intelligent Prompt Routing implements this natively — it routes each prompt to the best-fit model within a family and, in AWS's published benchmarks, can reduce cost by up to roughly 30% without a meaningful quality drop on mixed traffic. Teams that build their own routing layer (a cheap classifier in front, escalation on low confidence) frequently report larger savings because they can route across families and tune the escalation threshold to their own quality tolerance.
The two techniques compose. Right-size every task first so your baseline is sane, then add routing on the tasks where input difficulty varies enough that a single model is either over-provisioned for the easy cases or under-provisioned for the hard ones. On a workload dominated by a frontier model, this lever alone routinely removes 50–80% of spend on the routed slice — which is why it sits at the top of the impact ranking.
Routing has a quality floor. Tasks involving multi-step reasoning, long-context synthesis, code generation in unfamiliar stacks, or anything customer-facing where a wrong answer is expensive should default to the capable model and only route down after you have evals proving the small model holds quality. The failure mode is silent: a small model produces plausible-but-wrong output, your eval suite does not catch it, and you have traded a small bill for a large quality regression.
The correct sequence is: build a representative eval set, measure the small model against it, and only ship the cheaper route if it clears your bar. Optimization without an eval harness is guessing — and on reasoning-heavy tasks, guessing usually costs more in rework than it saves on tokens.
Most production prompts are mostly static: a long system prompt, a tool schema, a few-shot block, or a large document that is identical across many requests. Prompt caching lets Bedrock charge full price for those tokens once, then bill subsequent reads at a steep discount.
The mechanism: you mark a stable prefix of your prompt as cacheable. On the first request, Bedrock processes and caches it — you pay a normal (sometimes slightly elevated) write rate for those tokens. On every subsequent request that reuses the same prefix within the cache TTL, those tokens are billed as cache reads at roughly a 90% discount to the normal input rate. The cache is keyed on the exact prefix, so anything that changes per request (the user's actual question) must come after the cached block.
This is transformative for a specific and very common shape: a multi-turn chat assistant or an agent with a large, fixed system prompt and tool definitions. Without caching, you re-send and re-pay for that entire preamble on every single turn. With caching, you pay for it once per session and then read it at a tenth of the price for the rest of the conversation. For a 4,000-token system prompt over a 20-turn conversation, caching can cut the input-token bill for that conversation by well over 80%.
Two design rules determine whether caching actually pays off. First, order your prompt static-to-dynamic: system prompt and schemas first (cacheable), retrieved context next (cacheable if reused), user input last (never cached). Second, respect the TTL and the cache-hit economics: the cache has a limited lifetime, so caching only saves money when the prefix is genuinely reused within that window. A prompt that is unique per request gains nothing from caching and pays the small write premium for no benefit.
Prompt caching is the highest-leverage lever that requires no model change and no quality tradeoff — you get identical output at a fraction of the input cost. The only reason it sits second rather than first is that its savings are bounded by how much of your spend is input tokens on reused prefixes; on output-heavy or low-reuse workloads, right-sizing moves more.
Caching is keyed per model. If your router sends the same conversation to different models on different turns, you fragment the cache and lose hits. Pin a session to one model for the duration of the conversation, or scope caching to the system prompt that every model in the family shares. Designing the two levers together is worth more than either alone.
A large fraction of GenAI work is not interactive: nightly summarization, bulk classification, dataset enrichment, embeddings backfills, evaluation runs. For all of it, Bedrock batch inference charges exactly half the on-demand token rate.
The trade is latency for price. You submit a batch job — a file of many requests — to Bedrock, and instead of synchronous millisecond responses you get results back asynchronously, typically within a window measured in hours. In exchange, every input and output token in the job is billed at 50% of the on-demand rate. There is no quality difference whatsoever; it is the identical model producing identical output, only scheduled rather than real-time.
The reason this lever is underused is organizational, not technical: teams build everything through the synchronous real-time API because that is what they prototyped with, and never revisit which of those calls actually need to be real-time. The audit is simple — for each call site, ask "does a human or a live request block on this response?" If the answer is no (a cron job, a queue worker, a data pipeline, an offline eval), it is a batch candidate, and moving it halves that line item with zero downside.
Batch composes cleanly with the levers above. You can batch requests to a right-sized small model, and you can batch requests that use a cached prefix. The discounts apply to different terms of the cost formula, so a job that is batched (50% off the rate), routed to a small model (cheaper rate), and uses output discipline (fewer tokens) can land at a tiny fraction of the naive cost. For any workload with a meaningful offline component, batch is among the easiest 50% you will ever save.
Because output tokens cost 4–5× input tokens, generating fewer of them is pure margin. Yet most teams never constrain output length, and many actively inflate it with prompts that invite the model to ramble.
There are four reliable techniques, in rough order of impact. First, cap output explicitly with max-tokens and with prompt instructions that demand brevity ("answer in one sentence," "return only the JSON," "no preamble"). A model told to "explain your reasoning" on a task that only needs the answer can easily 5× its output token count — and you pay for every word of that reasoning at the premium output rate.
Second, prefer structured output over prose. When the consumer of the response is code, ask for JSON or a terse schema rather than a narrative the model wraps around the data. Structured responses are shorter and have no conversational filler. Third, strip the "Certainly! Here is..." padding with a system instruction; that boilerplate is small per request but compounds across millions of calls. Fourth, avoid asking for content you will discard — if you only use the first item of a list, do not ask for ten.
Output discipline has a subtle interaction with reasoning models. Models that "think" before answering generate reasoning tokens that are billed as output. On genuinely hard tasks that thinking earns its cost in correctness; on easy tasks it is pure waste. This is the same right-sizing logic applied within a single model: reserve extended reasoning for the inputs that need it, and disable or minimize it for the inputs that do not.
The savings here are workload-dependent but frequently large: teams that audit and constrain output routinely cut 20–40% off output spend, and on output-heavy workloads (generation, summarization, agents) that can be the largest single line item. The lever ranks below the top three only because it requires per-task prompt work rather than a single configuration flip — but on the right workload it outranks all of them.
Long-context models tempt you to stuff an entire knowledge base into every prompt. That is the most expensive possible way to give a model information, because you pay the full input rate for every token of context on every single request.
Consider the two architectures for "answer questions over a 200-page document set." Long context: paste the relevant documents into the prompt each time. If that is 50,000 tokens of context and you serve 100,000 queries, you pay for five billion input tokens — the same documents, re-read five billion times. RAG (retrieval-augmented generation): embed the documents once at ingestion, then at query time retrieve only the handful of passages relevant to the specific question and put just those in the prompt. Now each query carries perhaps 2,000 tokens of context instead of 50,000 — a 25× reduction in input tokens on the dominant term of your bill.
The cost case for RAG at scale is overwhelming, and it is the default for any high-volume application over a large or growing corpus. The input-token savings typically dwarf the added cost of the vector store and the per-query embedding. RAG also improves quality on large corpora by focusing the model on relevant passages rather than burying the answer in noise, and it keeps you under context-window limits that long-context stuffing would blow through.
Long context is not always wrong, though, and the honest tradeoff matters. For low-volume tasks, for documents that must be reasoned over holistically rather than retrieved in pieces (a contract where clause interactions matter), or during early prototyping where the engineering cost of a retrieval pipeline is not yet justified, long context is the pragmatic choice. The decision rule is volume and reuse: the more queries you serve over the same corpus, the more RAG wins, because RAG amortizes the ingestion cost across all future queries while long context re-pays the full cost every time.
Two refinements compound the RAG savings. Cache the retrieved context when the same passages recur across requests (caching plus RAG), and right-size the embedding model and the chunk size so you are not over-retrieving. Over-retrieval — pulling 20 chunks when 3 would answer the question — quietly reinflates the input-token bill that RAG was supposed to deflate.
On-demand billing is per-token and perfect for variable or low traffic. Provisioned Throughput is the opposite model: you reserve dedicated model capacity for an hourly (or monthly/annual committed) fee, and token volume within that capacity no longer adds to the bill. The entire question is whether you cross the break-even.
Provisioned Throughput (PT) reserves model units that deliver a guaranteed throughput. You pay for the reservation whether you use it or not, which means PT only saves money when your sustained utilization is high enough that the hourly reservation cost is less than what the same tokens would cost on-demand. Below that utilization, PT is more expensive than on-demand because you are paying for idle capacity. Above it, PT is cheaper and also gives you predictable latency and reserved capacity that is not subject to on-demand throttling.
Computing the break-even is the whole exercise. Take the on-demand cost of your actual token volume over a period, and compare it to the cost of the provisioned capacity needed to serve that volume over the same period. If on-demand > provisioned, PT wins; if not, stay on-demand. In practice PT begins to pay off only for high-volume, steady, predictable workloads — a production endpoint running near capacity around the clock — and committed monthly or annual PT terms lower the hourly rate further in exchange for the commitment, which improves the break-even for workloads you are certain will persist.
The common mistake is buying PT for prestige or "just in case" before the volume justifies it, then paying for idle units. The correct sequence is to run on-demand first, measure sustained utilization, and only convert to PT once the data shows you are reliably above break-even. For spiky or growing-but-uncertain workloads, on-demand (optionally with batch for the offline portion) is almost always the cheaper and more flexible choice. PT is a lever you grow into, not one you start with.
Provisioned Throughput is capacity reservation, not a discount program. It is distinct from broader AWS commitment discounts and from credits. During a credit-funded build phase, on-demand plus credits is usually the right call — you keep flexibility and the credits absorb the spend. Convert to PT only once steady-state volume makes the break-even unambiguous.
Two more levers are situational rather than universal, but where they apply they are among the most powerful: model distillation collapses a frontier-model task onto a small model, and embeddings tuning attacks the one cost line RAG introduces.
Model distillation trains a small, cheap model to imitate a large, expensive one on your specific task. You use the frontier model to generate high-quality training examples (or you collect your own production traffic), then fine-tune a small model on those examples until it matches the teacher's quality on your narrow distribution. Bedrock Model Distillation automates much of this workflow. The payoff is structural: after a one-time training cost, you run inference on a model that can be 10× cheaper and faster than the teacher, permanently, for that task.
The economics of distillation are a volume calculation. There is an upfront cost (generating training data, the fine-tuning run) and an ongoing saving (the per-request gap between teacher and student). Distillation pays off when your request volume is high enough that the accumulated per-request savings exceed the upfront cost — typically a high-volume, stable, well-defined task where the small model can be made reliably good. For low-volume or rapidly-changing tasks, the upfront cost may never amortize, and routing to a small model (Lever 1) gets you most of the benefit with none of the training investment.
Embeddings cost is the line item RAG adds, and it has its own optimizations. Embedding models are cheap per token, but cost accrues in two places: ingestion (embedding the whole corpus, repeated on every re-index) and query time (embedding every incoming query). The levers: choose a right-sized embedding model and dimensionality rather than the largest available; avoid needless re-embedding by only re-indexing changed documents rather than the entire corpus on every update; cache embeddings for frequently-repeated queries; and batch the ingestion embedding job to take the 50% batch discount. At very large corpus sizes or query volumes these add up, and a full re-embed of millions of documents on a schema tweak is a surprise bill teams hit exactly once before they fix their indexing.
Both levers reward depth over breadth: they will not touch most of your bill, but on the specific high-volume task or the specific large corpus they apply to, they can be the difference between a viable unit economic and an unviable one.
The levers stack, but they do not all deserve the same priority on day one. Sequencing them by impact-per-effort gets you most of the savings fast, then refines.
A pragmatic sequence for a team with an existing Bedrock bill:
The compounding is the point. Each lever attacks a different term of the cost formula, so their effects multiply rather than add. A request that is routed to a small model, reads a cached prefix, runs in a batch job, and returns terse structured output can cost a single-digit percentage of the naive "frontier model, on-demand, full context, verbose prose" baseline. The teams with the lowest Bedrock unit costs are not using one clever trick — they have quietly applied five of these and made it their default architecture.
The nine levers side by side — mechanism, the term of the cost formula each attacks, typical savings, quality tradeoff, and when it applies. Savings ranges are directional and workload-dependent; your own usage data (Section I) sets the real order for your stack.
| Lever | Mechanism | Typical savings | Quality tradeoff | Best when |
|---|---|---|---|---|
| 1. Right-sizing + routing | Cheaper model rate; route easy traffic down | 50–80% on routed slice | None if eval-gated; risk if not | Frontier model dominates spend |
| 2. Prompt caching | Reused input tokens billed ~90% off | Up to ~90% on cached prefix | None (identical output) | Large static system prompt / reused context |
| 3. Batch inference | Flat 50% off rate for async jobs | 50% on batched volume | None (latency only) | Any non-interactive workload |
| 4. Output discipline | Generate fewer (premium) output tokens | 20–40% on output spend | None if scoped correctly | Output-heavy / verbose workloads |
| 5. RAG over long context | Retrieve relevant passages, not whole corpus | 10–25× fewer input tokens at scale | Often improves quality | High query volume over large corpus |
| 6. Provisioned Throughput | Reserve capacity; tokens stop billing per-unit | Net save only above break-even | None (improves latency) | High, steady, predictable volume |
| 7. Model distillation | Small model imitates frontier on your task | ~10× per-request after amortization | Bounded to task distribution | High-volume, stable, narrow task |
| 8. Embeddings tuning | Right-size model, avoid re-embed, batch ingest | Cuts the RAG embedding line item | None | Large corpus / high query volume |
| 9. AWS credits | AWS funds the spend (Activate / POC / GenAI) | Up to 100% during build phase | None | Build / pre-revenue / migration phase |
Situation: Every support conversation hit a frontier model on-demand, re-sent a 5K-token system prompt and tool schema on every turn, and pasted whole help-center articles into the context. Nightly ticket-classification and a knowledge-base re-embed also ran synchronously through the real-time API. The bill was growing faster than revenue and the team had no eval harness to safely route traffic down.
What CloudRoute did: Routed within 24 hours to an EU AWS partner with a Bedrock FinOps and GenAI track record. The partner built an eval set, then: pinned ticket-classification to a small model with intelligent routing on the chat tier, marked the system prompt + tool schema as cached, moved classification and the nightly re-embed to batch (50%), switched help-center context from full-article stuffing to RAG retrieval, and added max-token caps with structured output on the agent responses.
Outcome: Per-conversation cost fell roughly 80% versus the on-demand baseline; the monthly run-rate dropped from ~$9K toward ~$1.8K at the same traffic — before credits. The partner then filed a Bedrock POC + Activate application; approved credits covered the remaining spend, taking the effective bill to $0 through the build phase. CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.
engagement window: 5 weeks · founder time: ~7 hours · run-rate cut: ~80% · effective bill during build: $0
CloudRoute routes you to a vetted AWS partner who re-architects your Bedrock workload for cost (routing, caching, batch, RAG) and files the credit application that can take the bill to $0. Customer pays $0 — AWS funds the engagement.