for AWS partners →Make Bedrock $0 with AWS credits →

amazon bedrock pricing · every model · 2026

Amazon Bedrock pricing — every model, every input.

A complete, neutral reference for what Amazon Bedrock actually costs in 2026: how token pricing works, a per-model input/output price table (Claude, Llama, Mistral, Amazon Nova, Titan, Cohere), the four ways to pay — on-demand, Batch (~50% off), Provisioned Throughput, and prompt caching — embeddings and fine-tuning costs, three worked monthly examples, the levers that cut your bill, and how AWS credits make all of it $0 to build.

Make Bedrock $0 with AWS credits →→ jump to the per-model price table

pricing model

per 1K tokens

Batch discount

~50%

upfront commitment

none (on-demand)

cost with credits

TL;DR

Bedrock is billed per token on the on-demand path — you pay one rate per 1,000 input tokens and a higher rate per 1,000 output tokens, and the rate depends entirely on which model you call. Cheap, fast models (Amazon Nova Micro, Claude Haiku) cost cents per million tokens; top frontier models (Claude Opus-class) cost dollars per million.
Four ways to pay, and the choice is the biggest cost lever: On-Demand (no commitment), Batch (~50% cheaper, async), Provisioned Throughput (reserved capacity for steady high volume), and prompt caching (cuts the cost of repeated context — big system prompts, shared documents). Fine-tuning adds a training cost plus an ongoing custom-model hosting cost.
A prototype costs single-digit dollars; production at scale runs to thousands. That gap is exactly what AWS credits cover — Activate (up to $100K), a Bedrock/GenAI POC pool ($10K–$50K), and the GenAI Accelerator (up to $1M). These are largely partner-filed; CloudRoute routes you to the credit pool and a vetted AWS partner so the build costs you $0.

the model

IHow Amazon Bedrock pricing works

Bedrock pricing has a reputation for being confusing, but the core is simple once you separate two things: what you are billed for (tokens, mostly) and how you choose to pay for it (one of four modes). Get those two straight and every line on a Bedrock bill becomes legible.

The fundamental unit is the token. A token is a chunk of text — roughly ¾ of a word in English, so 1,000 tokens ≈ 750 words. Every request to a text model is metered in two directions: input tokens (everything you send — your prompt, the system instruction, any conversation history or retrieved documents) and output tokens (everything the model generates back). You are billed separately for each, almost always at a published rate per 1,000 tokens (some pages quote per-million; divide by 1,000 to convert).

Output tokens are typically priced 3–5× higher than input tokens for the same model, because generation is the expensive part. This matters for cost design: a workload that reads a lot and writes a little (classification, extraction, routing) is cheap; one that writes a lot from a short prompt (long-form generation) is dominated by output cost.

The single biggest driver of cost is which model you pick. Bedrock hosts a ladder of models from many providers, and prices span more than two orders of magnitude from the smallest to the largest. The discipline that separates a cheap Bedrock bill from an expensive one is matching the model to the task — using a small, fast model where it is good enough and reserving frontier models for the genuinely hard requests. Many production systems route requests across a tier of models for exactly this reason.

Not everything is token-priced. Image and video generation is billed per image or per second of video. Embeddings are billed per input token (output is a vector, not charged). Fine-tuning a custom model has a one-time training charge plus an ongoing cost to store and serve it. And the four pricing modes (next section) change the per-token rate or replace it with a capacity charge. Everything else — Knowledge Bases, Guardrails, Agents — is generally billed on top of the underlying model tokens it consumes, plus any supporting AWS services (vector store, storage, etc.).

Important caveat, stated once and meant throughout: the dollar figures on this page are representative as of 2026 to illustrate relative cost and the shape of a bill. Foundation-model prices change frequently as providers compete. Always confirm current rates on the official AWS Bedrock pricing page before budgeting — and see the amazon-bedrock-pricing-calculator sibling to model your own numbers.

the two questions that determine your bill

(1) Which model? — sets the per-token rate (cheap small models vs costly frontier models, a 100×+ range). (2) Which pricing mode? — on-demand, Batch (~50% off), Provisioned Throughput, or prompt caching. Get these two right and Bedrock cost is predictable.

the price table

IIPer-model pricing — input and output per 1K tokens

This is the table most people come for: representative 2026 on-demand text-model prices across the major providers on Bedrock, expressed per 1,000 input and output tokens. Use it to rank models by cost and to sanity-check a budget — not as an audited price sheet.

Rows are ordered roughly cheapest to most expensive. The "per 1M tokens" columns are included because providers increasingly quote prices that way — they are simply the per-1K figure × 1,000. Where a model is sold mainly per-million, we show the equivalent per-1K so the table stays comparable.

representative on-demand bedrock text-model pricing · per 1K and per 1M tokens · 2026

Model	Input / 1K	Output / 1K	Input / 1M	Output / 1M	Typical use
Amazon Nova Micro	$0.000035	$0.00014	$0.035	$0.14	High-volume, simple, cheapest
Amazon Nova Lite	$0.00006	$0.00024	$0.06	$0.24	Fast multimodal, low cost
Claude Haiku	$0.00025	$0.00125	$0.25	$1.25	Fast, cheap, high-throughput
Llama (small, ~8B)	$0.00022	$0.00072	$0.22	$0.72	Open-weight, low cost
Mistral (small)	$0.0002	$0.0006	$0.20	$0.60	Open-weight, efficient
Amazon Nova Pro	$0.0008	$0.0032	$0.80	$3.20	Balanced multimodal
Cohere Command	$0.001	$0.002	$1.00	$2.00	RAG / enterprise text
Claude Sonnet	$0.003	$0.015	$3.00	$15.00	Best all-round workhorse
Llama (large, ~70B+)	$0.00265	$0.0035	$2.65	$3.50	Open-weight, capable
Amazon Nova Premier	$0.0025	$0.0125	$2.50	$12.50	Amazon's most capable
Claude Opus-class	$0.015	$0.075	$15.00	$75.00	Hardest reasoning tasks

Representative 2026 figures for relative comparison only — confirm current rates on the AWS Bedrock pricing page. Output is typically 3–5× input. Image/video (Nova Canvas/Reel, Stability) are per-image/per-second, not per token (see §V). Prices vary by region and exclude prompt-caching and Batch discounts.

the four ways to pay

IIIOn-Demand vs Batch vs Provisioned Throughput vs prompt caching

The same model can cost very different amounts depending on how you buy capacity. Choosing the right mode for each workload is the largest controllable lever on a Bedrock bill — often a bigger swing than switching models.

There are four pricing modes. Most teams use On-Demand for everything at first, then move specific workloads onto Batch, Provisioned, or caching as patterns emerge. They are not mutually exclusive — a single product can use all four for different paths.

On-Demand — pay per request, no commitment

The default. You call the model, you pay the per-token rate from the table above, you commit to nothing. Capacity is shared and subject to per-account throughput limits. Best for: prototypes, variable or low traffic, and anything where you do not yet know your volume. Downsides: it is the highest per-token rate, and during spikes you can hit throttling (which cross-region inference helps smooth).

Batch — ~50% cheaper, asynchronous

You submit a large set of requests as a single job (typically a file in S3) and Bedrock processes them in the background, returning results when done. In exchange for giving up real-time responses, you pay roughly half the on-demand rate. Best for: bulk summarization, classification, enrichment, embedding a large corpus, offline evaluation — any high-volume job that is not latency-sensitive. The ~50% saving is the single easiest cost win for batch-shaped work.

Provisioned Throughput — reserved capacity, hourly

You reserve dedicated model capacity (measured in "model units") for a committed term (hourly, or cheaper with a 1- or 6-month commitment) and pay a flat hourly rate regardless of how many tokens you push through it. This decouples cost from per-token pricing and guarantees throughput and latency. Best for: steady, high, predictable volume where on-demand throttling is a risk or where per-token math at scale exceeds the reserved rate. It is also required for serving most custom (fine-tuned) models. Downside: you pay for the reserved capacity whether or not you use it, so it is wasteful for spiky or low traffic.

Prompt caching — stop re-paying for repeated context

When many requests share a large common prefix — a long system prompt, a big instruction set, a reference document, or few-shot examples — prompt caching lets Bedrock cache that prefix so subsequent requests do not pay full input price for it again. Cached input tokens are billed at a steep discount versus normal input tokens (with a smaller charge to write the cache). Best for: chatbots with a long fixed system prompt, RAG where the same context is reused, or agents with large tool definitions. On the right workload it can cut input cost by a large fraction — see the dedicated amazon-bedrock-prompt-caching page for the mechanics.

the four bedrock pricing modes compared · 2026

Mode	How you pay	Relative cost	Latency	Best for	Watch out for
On-Demand	Per token, no commit	Baseline (highest/token)	Real-time	Prototypes, variable traffic	Throttling at spikes
Batch	Per token, async job	~50% of on-demand	Minutes–hours	Bulk jobs, not time-sensitive	Not for interactive use
Provisioned Throughput	Hourly per model unit	Flat — wins at high steady volume	Real-time, guaranteed	Steady high volume, custom models	Paid even when idle
Prompt caching	Discounted cached input	Big cut on repeated context	Real-time	Long shared prompts / RAG	Only helps with repeated prefixes

These combine: a product can serve interactive traffic On-Demand with prompt caching, run nightly enrichment via Batch, and put one hot model path on Provisioned Throughput.

embeddings, fine-tuning, hosting

IVEmbeddings, fine-tuning, and custom-model hosting costs

Beyond plain inference, three cost categories surprise teams because they are priced differently from chat tokens: embeddings (the engine of RAG and search), fine-tuning (training a custom model), and the ongoing cost of hosting that custom model. Each is worth understanding before you commit to an architecture.

Embeddings — cheap per token, but volume adds up

Embedding models (Amazon Titan Text Embeddings, Cohere Embed) turn text into vectors for semantic search and the retrieval half of every RAG system. They are billed per input token only — the output vector is not charged — at very low rates (representative: roughly $0.00002–$0.0001 per 1K tokens, i.e. cents per million). Individually trivial. The catch is volume: embedding a large document corpus, then re-embedding it whenever content changes, can process hundreds of millions of tokens. Embedding is a classic Batch candidate, and you also pay for wherever the vectors live (a vector store / database).

Fine-tuning — a one-time training charge

Fine-tuning adapts a base model to your data or style. On Bedrock you pay a training cost based on the volume of training data processed (commonly priced per 1,000 training tokens, multiplied by the number of epochs). For typical datasets this is a one-time charge in the tens to low-hundreds of dollars; large datasets cost more. Related techniques like model distillation (training a smaller, cheaper model to mimic a larger one) have their own training cost but can dramatically cut ongoing inference cost — a real lever when a use case is high-volume and narrow.

Custom-model hosting — the recurring cost people forget

A fine-tuned model is not free to keep available. Serving a custom model on Bedrock generally requires Provisioned Throughput, which is an hourly charge that runs continuously while the model is deployed — independent of how many requests you send. This is the line item that most often surprises teams: the fine-tuning was cheap, but a custom model sitting on reserved capacity 24/7 can cost more per month than the inference it serves. The honest guidance: only fine-tune-and-host when the volume and quality gains clearly justify a standing hourly cost; otherwise prompt engineering or RAG on a base model is usually cheaper overall.

the hidden recurring cost

Fine-tuning is a small one-time charge; hosting the resulting custom model on Provisioned Throughput is an ongoing hourly cost that accrues whether or not the model is used. Budget for the hosting, not just the training.

real numbers

VThree worked monthly-cost examples

Abstract per-token rates are hard to feel. Here are three concrete, representative monthly estimates for common workloads, using the table above. They are illustrative — your mileage varies with prompt length, model choice, and mode — but they show the shape and the order of magnitude.

Example A — a simple support chatbot (Claude Haiku, on-demand). Say 50,000 conversations a month, each averaging 800 input tokens (system prompt + history + question) and 400 output tokens. That is 40M input tokens and 20M output tokens. At Haiku's representative rates ($0.25 / $1.25 per 1M), input ≈ $10, output ≈ $25 → ≈ $35/month. Add prompt caching on the fixed system prompt and the input portion drops further. A genuinely cheap production feature.

Example B — a RAG knowledge assistant (Claude Sonnet, on-demand + embeddings). Say 20,000 questions a month, each pulling ~3,000 tokens of retrieved context plus a 500-token question (3,500 input) and producing a 600-token answer. That is 70M input and 12M output tokens. At Sonnet's representative rates ($3 / $15 per 1M), input ≈ $210, output ≈ $180 → ≈ $390/month for inference. Add a one-time corpus embedding of, say, 100M tokens with Titan (≈ $2–$10) plus the vector store. Prompt caching on the repeated instructions and any shared context can cut the input bill substantially. The lesson: in RAG, the retrieved context dominates input cost, so retrieval tuning is a cost lever, not just a quality lever.

Example C — a nightly batch summarization job (Nova Lite, Batch). Say 500,000 documents a month, each ~1,500 input tokens summarized to ~200 output tokens. That is 750M input and 100M output tokens. At Nova Lite's representative on-demand rates ($0.06 / $0.24 per 1M) that would be input ≈ $45, output ≈ $24 → ≈ $69; run it via Batch at ~50% off and it is ≈ $35/month for half a million documents. This is why high-volume, non-interactive work belongs on a cheap model and the Batch path.

Across all three, two patterns repeat: (1) model choice and mode move the number more than anything else — the same workload can be 10× cheaper on the right model; and (2) these are small numbers at prototype and early-production scale, which is precisely why so many teams build the whole thing on AWS credits and pay $0 out of pocket while they prove the workload out.

cutting the bill

VIThe levers that cut your Bedrock cost

If a Bedrock bill is too high, the fix is almost always one of a short list of levers. In rough order of impact, here is what actually moves the number — useful whether you are optimizing your own spend or stretching a pool of AWS credits further.

Right-size the model — The biggest lever. Route easy requests to a cheap, fast model (Nova Micro/Lite, Haiku) and reserve frontier models (Sonnet, Opus-class, Nova Premier) for genuinely hard ones. A tiered router can cut cost 5–10× with little quality loss.
Move bulk work to Batch — Anything not interactive — enrichment, classification, embedding, evaluation — belongs on the Batch path at ~50% off. Often the single easiest win.
Turn on prompt caching — If requests share a long system prompt, instruction set, or document, cache it. On chatbots and RAG with large fixed context this can slash the input portion of the bill.
Shorten prompts and cap output — Input cost scales with everything you send (including retrieved context and history) and output cost with what you let the model write. Trim retrieved chunks, summarize history, and set sensible max-output limits.
Use Provisioned Throughput only for steady volume — Reserved capacity wins when traffic is high and predictable; it wastes money on spiky or low traffic. Match the mode to the traffic shape.
Reconsider fine-tuning vs RAG — A fine-tuned model carries a standing hourly hosting cost. For many cases, prompt engineering or RAG on a base model gives similar results with no reserved-capacity bill.
Watch the supporting services — Knowledge Bases, vector stores, S3, and logging all add cost around the model. Include them in the budget — the model tokens are sometimes not the largest line.

the meta-lever

The largest lever of all for a startup is not paying for it at all during the build. AWS credits cover Bedrock inference, fine-tuning, and the supporting services — so cost optimization becomes "make the credits last," not "protect the runway." That changes how aggressively you can experiment.

how it becomes $0

VIIHow AWS credits make Bedrock $0 to build

Everything above prices what Bedrock costs if you pay AWS directly. For most startups and many companies, the relevant number is different — because AWS will frequently fund the build with credits, and Bedrock spend draws those credits down before it ever touches your card.

AWS runs several credit programs precisely to put generative-AI workloads on AWS, and Bedrock usage is fully credit-eligible. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed specifically at proving out a GenAI use case; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). Credits apply automatically against your AWS bill — including Bedrock inference, fine-tuning, embeddings, and the supporting services — until exhausted.

The practical mechanic is that most of these pools are partner-filed: they are requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams typically route through an AWS partner rather than applying alone — and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the Bedrock workload (the RAG pipeline, the agent, the cost-tuned model routing). The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.

Put together with the cost levers above, the picture for a startup is: build aggressively on Bedrock, draw down a $25K–$100K credit pool while you find product-market fit, and only start paying real money once usage — and ideally revenue — has scaled past the credits. Related: see the cross-cluster pages on AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.

cost by pricing mode

Same model, four pricing modes — relative monthly cost

To make the mode choice concrete, here is the same illustrative high-volume workload (a Claude Sonnet-class model, ~80M tokens/month) priced four ways. It shows why the mode decision often matters more than the model decision. Figures are representative 2026 illustrations, not quotes.

Pricing mode	How billed	Relative monthly cost	Latency	When it wins	Commitment
On-Demand	Per token	Baseline (100%)	Real-time	Variable / unknown volume	None
On-Demand + prompt caching	Per token, cached prefix discounted	~40–70% (if context repeats)	Real-time	Long shared system prompt / RAG	None
Batch	Per token, async	~50%	Minutes–hours	Non-interactive bulk jobs	None
Provisioned Throughput	Hourly per model unit	Flat — beats on-demand only at high steady volume	Real-time, guaranteed	Steady high volume; custom models	1–6 month for best rate

Modes combine. A real product often serves interactive traffic On-Demand with prompt caching, runs nightly enrichment on Batch, and reserves Provisioned Throughput only for one always-hot path. See amazon-bedrock-pricing-calculator to model your own mix.

before you pay for a single token

Get AWS credits that cover Bedrock — and a partner to build it (you pay $0)

Get matched in 24h →

a recent match

A Bedrock bill that should have been $4K/month — built on $0 — anonymized

inquiry · Series-A consumer AI app, London

Series-A consumer AI product, 22 people, projecting ~$4K/month of Bedrock inference at launch

Situation: The team had modeled their Bedrock costs and were staring at roughly $4K/month at launch traffic, climbing with growth — on a frontier model for every request, on-demand, with a long fixed system prompt re-sent on every call. They wanted both to bring that number down and to avoid paying any of it out of a runway they needed for hiring.

What CloudRoute did: CloudRoute matched them in under 24 hours to a UK AWS partner with GenAI cost-engineering experience. The partner (1) introduced a tiered model router — Nova Lite / Claude Haiku for the easy 80% of requests, Sonnet only for the hard ones; (2) turned on prompt caching for the shared system prompt; (3) moved their nightly content-enrichment job to Batch; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the whole launch.

Outcome: Modeled inference cost fell from ~$4K to ~$1.1K/month through model-routing, caching, and Batch — and even that was fully covered by the approved credits, so the team paid $0 during the build and early launch. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

cost cut: ~$4K → ~$1.1K/mo modeled · credits secured: POC + Activate · out-of-pocket during build: $0

faq

Common questions

How much does Amazon Bedrock cost?

Bedrock is billed per token on the on-demand path — a rate per 1,000 input tokens and a higher rate per 1,000 output tokens, depending on the model. Representative 2026 ranges run from a few cents per million tokens (Amazon Nova Micro, Claude Haiku) to dollars per million for top frontier models (Claude Opus-class). A working prototype usually costs single-digit dollars; production at scale runs to thousands. You can cut cost with Batch (~50% off), prompt caching, and Provisioned Throughput. Always confirm current rates on the AWS Bedrock pricing page.

How is Bedrock pricing calculated — what is a token?

A token is a chunk of text, roughly ¾ of a word in English (1,000 tokens ≈ 750 words). Every request is metered as input tokens (everything you send: prompt, system instruction, history, retrieved context) and output tokens (everything the model generates). You pay separately for each per 1,000 tokens, and output is typically 3–5× the input rate. Multiply your monthly input and output token volume by the model's rates to estimate cost.

Is Bedrock cheaper than calling OpenAI directly?

It depends on the model and the workload — both are token-priced and competitive. The cost story for Bedrock is less about a single headline rate and more about (1) model choice across many providers so you can pick the cheapest model that works, (2) the ~50% Batch discount and prompt caching, and (3) billing on your existing AWS account, where AWS credits can cover the spend entirely. For a startup with credits, Bedrock is frequently effectively $0 while OpenAI is paid. See amazon-bedrock-vs-openai for a full comparison.

What is the difference between on-demand, Batch, and Provisioned Throughput pricing?

On-Demand: pay per token with no commitment — best for prototypes and variable traffic. Batch: submit a large async job for roughly half the on-demand price — best for bulk, non-interactive work. Provisioned Throughput: reserve dedicated capacity for a flat hourly rate regardless of usage — best for steady high volume and required for serving most fine-tuned custom models. They combine within one application, and prompt caching can layer on top of on-demand to discount repeated context.

How much does prompt caching save on Bedrock?

Prompt caching discounts the input cost of a repeated prefix — a long system prompt, a shared document, or large tool/few-shot context — so you do not re-pay full input price for it on every request. Cached input tokens are billed at a steep discount versus normal input tokens (with a small charge to write the cache). On workloads with a large fixed context reused across many calls (chatbots, RAG), it can cut the input portion of the bill by a large fraction. It only helps when context actually repeats. See the amazon-bedrock-prompt-caching page.

What does it cost to fine-tune and host a custom model on Bedrock?

Fine-tuning has a one-time training charge based on the volume of training data (commonly priced per 1,000 training tokens × epochs) — often tens to low-hundreds of dollars for typical datasets. The cost people miss is hosting: serving a custom model generally requires Provisioned Throughput, an ongoing hourly charge that accrues continuously while the model is deployed, whether or not it is used. Budget for the standing hosting cost, and only fine-tune-and-host when volume and quality gains justify it — otherwise RAG on a base model is usually cheaper.

How much do embeddings cost on Bedrock?

Embedding models (Amazon Titan Text Embeddings, Cohere Embed) are billed per input token only — the output vector is not charged — at very low rates (representative: roughly $0.00002–$0.0001 per 1K tokens). Per request it is trivial; the cost comes from volume when you embed and re-embed a large document corpus. Embedding is a good Batch candidate, and you also pay for the vector store where the embeddings live.

Can AWS credits cover Bedrock costs?

Yes — Bedrock inference, fine-tuning, embeddings, and supporting services are all credit-eligible and credits apply automatically against your AWS bill. The relevant pools are AWS Activate (up to $100K), a dedicated Bedrock/GenAI POC pool ($10K–$50K), and the GenAI Accelerator (up to $1M for selected startups). These are largely partner-filed via the AWS Partner Network, which is why teams route through a partner. CloudRoute matches you to the right pool and a vetted AWS partner who files the application and builds the workload — customer pays $0, AWS funds it.

Stop pricing Bedrock — get it funded

Whatever your Bedrock bill would be, AWS credits can cover it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to build and cost-tune the workload. Customer pays $0.

Get matched in 24h →→ see the AI-team persona detail

matched within< 24h

GenAI credit ceilingup to $1M

cost to you$0