A complete, neutral reference for what Amazon Bedrock actually costs in 2026: how token pricing works, a per-model input/output price table (Claude, Llama, Mistral, Amazon Nova, Titan, Cohere), the four ways to pay — on-demand, Batch (~50% off), Provisioned Throughput, and prompt caching — embeddings and fine-tuning costs, three worked monthly examples, the levers that cut your bill, and how AWS credits make all of it $0 to build.
Bedrock pricing has a reputation for being confusing, but the core is simple once you separate two things: what you are billed for (tokens, mostly) and how you choose to pay for it (one of four modes). Get those two straight and every line on a Bedrock bill becomes legible.
The fundamental unit is the token. A token is a chunk of text — roughly ¾ of a word in English, so 1,000 tokens ≈ 750 words. Every request to a text model is metered in two directions: input tokens (everything you send — your prompt, the system instruction, any conversation history or retrieved documents) and output tokens (everything the model generates back). You are billed separately for each, almost always at a published rate per 1,000 tokens (some pages quote per-million; divide by 1,000 to convert).
Output tokens are typically priced 3–5× higher than input tokens for the same model, because generation is the expensive part. This matters for cost design: a workload that reads a lot and writes a little (classification, extraction, routing) is cheap; one that writes a lot from a short prompt (long-form generation) is dominated by output cost.
The single biggest driver of cost is which model you pick. Bedrock hosts a ladder of models from many providers, and prices span more than two orders of magnitude from the smallest to the largest. The discipline that separates a cheap Bedrock bill from an expensive one is matching the model to the task — using a small, fast model where it is good enough and reserving frontier models for the genuinely hard requests. Many production systems route requests across a tier of models for exactly this reason.
Not everything is token-priced. Image and video generation is billed per image or per second of video. Embeddings are billed per input token (output is a vector, not charged). Fine-tuning a custom model has a one-time training charge plus an ongoing cost to store and serve it. And the four pricing modes (next section) change the per-token rate or replace it with a capacity charge. Everything else — Knowledge Bases, Guardrails, Agents — is generally billed on top of the underlying model tokens it consumes, plus any supporting AWS services (vector store, storage, etc.).
Important caveat, stated once and meant throughout: the dollar figures on this page are representative as of 2026 to illustrate relative cost and the shape of a bill. Foundation-model prices change frequently as providers compete. Always confirm current rates on the official AWS Bedrock pricing page before budgeting — and see the amazon-bedrock-pricing-calculator sibling to model your own numbers.
(1) Which model? — sets the per-token rate (cheap small models vs costly frontier models, a 100×+ range). (2) Which pricing mode? — on-demand, Batch (~50% off), Provisioned Throughput, or prompt caching. Get these two right and Bedrock cost is predictable.
This is the table most people come for: representative 2026 on-demand text-model prices across the major providers on Bedrock, expressed per 1,000 input and output tokens. Use it to rank models by cost and to sanity-check a budget — not as an audited price sheet.
Rows are ordered roughly cheapest to most expensive. The "per 1M tokens" columns are included because providers increasingly quote prices that way — they are simply the per-1K figure × 1,000. Where a model is sold mainly per-million, we show the equivalent per-1K so the table stays comparable.
| Model | Input / 1K | Output / 1K | Input / 1M | Output / 1M | Typical use |
|---|---|---|---|---|---|
| Amazon Nova Micro | $0.000035 | $0.00014 | $0.035 | $0.14 | High-volume, simple, cheapest |
| Amazon Nova Lite | $0.00006 | $0.00024 | $0.06 | $0.24 | Fast multimodal, low cost |
| Claude Haiku | $0.00025 | $0.00125 | $0.25 | $1.25 | Fast, cheap, high-throughput |
| Llama (small, ~8B) | $0.00022 | $0.00072 | $0.22 | $0.72 | Open-weight, low cost |
| Mistral (small) | $0.0002 | $0.0006 | $0.20 | $0.60 | Open-weight, efficient |
| Amazon Nova Pro | $0.0008 | $0.0032 | $0.80 | $3.20 | Balanced multimodal |
| Cohere Command | $0.001 | $0.002 | $1.00 | $2.00 | RAG / enterprise text |
| Claude Sonnet | $0.003 | $0.015 | $3.00 | $15.00 | Best all-round workhorse |
| Llama (large, ~70B+) | $0.00265 | $0.0035 | $2.65 | $3.50 | Open-weight, capable |
| Amazon Nova Premier | $0.0025 | $0.0125 | $2.50 | $12.50 | Amazon's most capable |
| Claude Opus-class | $0.015 | $0.075 | $15.00 | $75.00 | Hardest reasoning tasks |
The same model can cost very different amounts depending on how you buy capacity. Choosing the right mode for each workload is the largest controllable lever on a Bedrock bill — often a bigger swing than switching models.
There are four pricing modes. Most teams use On-Demand for everything at first, then move specific workloads onto Batch, Provisioned, or caching as patterns emerge. They are not mutually exclusive — a single product can use all four for different paths.
The default. You call the model, you pay the per-token rate from the table above, you commit to nothing. Capacity is shared and subject to per-account throughput limits. Best for: prototypes, variable or low traffic, and anything where you do not yet know your volume. Downsides: it is the highest per-token rate, and during spikes you can hit throttling (which cross-region inference helps smooth).
You submit a large set of requests as a single job (typically a file in S3) and Bedrock processes them in the background, returning results when done. In exchange for giving up real-time responses, you pay roughly half the on-demand rate. Best for: bulk summarization, classification, enrichment, embedding a large corpus, offline evaluation — any high-volume job that is not latency-sensitive. The ~50% saving is the single easiest cost win for batch-shaped work.
You reserve dedicated model capacity (measured in "model units") for a committed term (hourly, or cheaper with a 1- or 6-month commitment) and pay a flat hourly rate regardless of how many tokens you push through it. This decouples cost from per-token pricing and guarantees throughput and latency. Best for: steady, high, predictable volume where on-demand throttling is a risk or where per-token math at scale exceeds the reserved rate. It is also required for serving most custom (fine-tuned) models. Downside: you pay for the reserved capacity whether or not you use it, so it is wasteful for spiky or low traffic.
When many requests share a large common prefix — a long system prompt, a big instruction set, a reference document, or few-shot examples — prompt caching lets Bedrock cache that prefix so subsequent requests do not pay full input price for it again. Cached input tokens are billed at a steep discount versus normal input tokens (with a smaller charge to write the cache). Best for: chatbots with a long fixed system prompt, RAG where the same context is reused, or agents with large tool definitions. On the right workload it can cut input cost by a large fraction — see the dedicated amazon-bedrock-prompt-caching page for the mechanics.
| Mode | How you pay | Relative cost | Latency | Best for | Watch out for |
|---|---|---|---|---|---|
| On-Demand | Per token, no commit | Baseline (highest/token) | Real-time | Prototypes, variable traffic | Throttling at spikes |
| Batch | Per token, async job | ~50% of on-demand | Minutes–hours | Bulk jobs, not time-sensitive | Not for interactive use |
| Provisioned Throughput | Hourly per model unit | Flat — wins at high steady volume | Real-time, guaranteed | Steady high volume, custom models | Paid even when idle |
| Prompt caching | Discounted cached input | Big cut on repeated context | Real-time | Long shared prompts / RAG | Only helps with repeated prefixes |
Beyond plain inference, three cost categories surprise teams because they are priced differently from chat tokens: embeddings (the engine of RAG and search), fine-tuning (training a custom model), and the ongoing cost of hosting that custom model. Each is worth understanding before you commit to an architecture.
Embedding models (Amazon Titan Text Embeddings, Cohere Embed) turn text into vectors for semantic search and the retrieval half of every RAG system. They are billed per input token only — the output vector is not charged — at very low rates (representative: roughly $0.00002–$0.0001 per 1K tokens, i.e. cents per million). Individually trivial. The catch is volume: embedding a large document corpus, then re-embedding it whenever content changes, can process hundreds of millions of tokens. Embedding is a classic Batch candidate, and you also pay for wherever the vectors live (a vector store / database).
Fine-tuning adapts a base model to your data or style. On Bedrock you pay a training cost based on the volume of training data processed (commonly priced per 1,000 training tokens, multiplied by the number of epochs). For typical datasets this is a one-time charge in the tens to low-hundreds of dollars; large datasets cost more. Related techniques like model distillation (training a smaller, cheaper model to mimic a larger one) have their own training cost but can dramatically cut ongoing inference cost — a real lever when a use case is high-volume and narrow.
A fine-tuned model is not free to keep available. Serving a custom model on Bedrock generally requires Provisioned Throughput, which is an hourly charge that runs continuously while the model is deployed — independent of how many requests you send. This is the line item that most often surprises teams: the fine-tuning was cheap, but a custom model sitting on reserved capacity 24/7 can cost more per month than the inference it serves. The honest guidance: only fine-tune-and-host when the volume and quality gains clearly justify a standing hourly cost; otherwise prompt engineering or RAG on a base model is usually cheaper overall.
Fine-tuning is a small one-time charge; hosting the resulting custom model on Provisioned Throughput is an ongoing hourly cost that accrues whether or not the model is used. Budget for the hosting, not just the training.
Abstract per-token rates are hard to feel. Here are three concrete, representative monthly estimates for common workloads, using the table above. They are illustrative — your mileage varies with prompt length, model choice, and mode — but they show the shape and the order of magnitude.
Example A — a simple support chatbot (Claude Haiku, on-demand). Say 50,000 conversations a month, each averaging 800 input tokens (system prompt + history + question) and 400 output tokens. That is 40M input tokens and 20M output tokens. At Haiku's representative rates ($0.25 / $1.25 per 1M), input ≈ $10, output ≈ $25 → ≈ $35/month. Add prompt caching on the fixed system prompt and the input portion drops further. A genuinely cheap production feature.
Example B — a RAG knowledge assistant (Claude Sonnet, on-demand + embeddings). Say 20,000 questions a month, each pulling ~3,000 tokens of retrieved context plus a 500-token question (3,500 input) and producing a 600-token answer. That is 70M input and 12M output tokens. At Sonnet's representative rates ($3 / $15 per 1M), input ≈ $210, output ≈ $180 → ≈ $390/month for inference. Add a one-time corpus embedding of, say, 100M tokens with Titan (≈ $2–$10) plus the vector store. Prompt caching on the repeated instructions and any shared context can cut the input bill substantially. The lesson: in RAG, the retrieved context dominates input cost, so retrieval tuning is a cost lever, not just a quality lever.
Example C — a nightly batch summarization job (Nova Lite, Batch). Say 500,000 documents a month, each ~1,500 input tokens summarized to ~200 output tokens. That is 750M input and 100M output tokens. At Nova Lite's representative on-demand rates ($0.06 / $0.24 per 1M) that would be input ≈ $45, output ≈ $24 → ≈ $69; run it via Batch at ~50% off and it is ≈ $35/month for half a million documents. This is why high-volume, non-interactive work belongs on a cheap model and the Batch path.
Across all three, two patterns repeat: (1) model choice and mode move the number more than anything else — the same workload can be 10× cheaper on the right model; and (2) these are small numbers at prototype and early-production scale, which is precisely why so many teams build the whole thing on AWS credits and pay $0 out of pocket while they prove the workload out.
If a Bedrock bill is too high, the fix is almost always one of a short list of levers. In rough order of impact, here is what actually moves the number — useful whether you are optimizing your own spend or stretching a pool of AWS credits further.
The largest lever of all for a startup is not paying for it at all during the build. AWS credits cover Bedrock inference, fine-tuning, and the supporting services — so cost optimization becomes "make the credits last," not "protect the runway." That changes how aggressively you can experiment.
Everything above prices what Bedrock costs if you pay AWS directly. For most startups and many companies, the relevant number is different — because AWS will frequently fund the build with credits, and Bedrock spend draws those credits down before it ever touches your card.
AWS runs several credit programs precisely to put generative-AI workloads on AWS, and Bedrock usage is fully credit-eligible. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed specifically at proving out a GenAI use case; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). Credits apply automatically against your AWS bill — including Bedrock inference, fine-tuning, embeddings, and the supporting services — until exhausted.
The practical mechanic is that most of these pools are partner-filed: they are requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams typically route through an AWS partner rather than applying alone — and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the Bedrock workload (the RAG pipeline, the agent, the cost-tuned model routing). The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
Put together with the cost levers above, the picture for a startup is: build aggressively on Bedrock, draw down a $25K–$100K credit pool while you find product-market fit, and only start paying real money once usage — and ideally revenue — has scaled past the credits. Related: see the cross-cluster pages on AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.
To make the mode choice concrete, here is the same illustrative high-volume workload (a Claude Sonnet-class model, ~80M tokens/month) priced four ways. It shows why the mode decision often matters more than the model decision. Figures are representative 2026 illustrations, not quotes.
| Pricing mode | How billed | Relative monthly cost | Latency | When it wins | Commitment |
|---|---|---|---|---|---|
| On-Demand | Per token | Baseline (100%) | Real-time | Variable / unknown volume | None |
| On-Demand + prompt caching | Per token, cached prefix discounted | ~40–70% (if context repeats) | Real-time | Long shared system prompt / RAG | None |
| Batch | Per token, async | ~50% | Minutes–hours | Non-interactive bulk jobs | None |
| Provisioned Throughput | Hourly per model unit | Flat — beats on-demand only at high steady volume | Real-time, guaranteed | Steady high volume; custom models | 1–6 month for best rate |
Situation: The team had modeled their Bedrock costs and were staring at roughly $4K/month at launch traffic, climbing with growth — on a frontier model for every request, on-demand, with a long fixed system prompt re-sent on every call. They wanted both to bring that number down and to avoid paying any of it out of a runway they needed for hiring.
What CloudRoute did: CloudRoute matched them in under 24 hours to a UK AWS partner with GenAI cost-engineering experience. The partner (1) introduced a tiered model router — Nova Lite / Claude Haiku for the easy 80% of requests, Sonnet only for the hard ones; (2) turned on prompt caching for the shared system prompt; (3) moved their nightly content-enrichment job to Batch; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the whole launch.
Outcome: Modeled inference cost fell from ~$4K to ~$1.1K/month through model-routing, caching, and Batch — and even that was fully covered by the approved credits, so the team paid $0 during the build and early launch. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
cost cut: ~$4K → ~$1.1K/mo modeled · credits secured: POC + Activate · out-of-pocket during build: $0
Whatever your Bedrock bill would be, AWS credits can cover it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to build and cost-tune the workload. Customer pays $0.