A focused, neutral reference for how Amazon Bedrock charges for tokens in 2026: exactly what counts as a token and how Bedrock meters it, why input and output tokens are priced separately (and why output costs 3–5× more), a big side-by-side per-model table in both per-1K and per-1M for Nova, Claude, Llama, Mistral, Cohere and Titan, how to estimate token counts for your own text before you ship, why output tokens quietly dominate most bills, and how prompt caching and Batch change the per-token math. All figures are representative as of 2026 — confirm current rates on the AWS Bedrock pricing page.
Before any price makes sense you have to be precise about the thing being priced. On Bedrock the billable unit is the token, not the word, the character, or the API call — and the gap between a token and a word is where most budget surprises start.
A token is a sub-word chunk of text produced by the model's tokenizer. Common short words are usually a single token ("the", "cloud", "model"), but longer or rarer words split into several ("tokenization" might be two or three tokens), and whitespace, punctuation, numbers, and code symbols all consume tokens too. A useful working rule for English prose is that 1 token ≈ ¾ of a word, so 1,000 tokens ≈ 750 words and a token is roughly 4 characters on average. Other languages and code tokenize less efficiently — non-English text and source code often use noticeably more tokens per word.
Each model family has its own tokenizer, so the exact same paragraph can be a slightly different number of tokens on Claude versus Llama versus Amazon Nova. The differences are usually small for ordinary English, but they mean a token count is always model-specific — there is no single universal token. When you compare model prices, you are comparing the rate per token and implicitly the efficiency of that model's tokenizer, though the per-token rate dominates the comparison in practice.
Crucially, Bedrock meters tokens in both directions of every request. Input tokens are everything you send into the model on a given call: the user's message, your system prompt or instructions, any conversation history you replay, any documents or retrieved context you stuff into the prompt, and any tool or function definitions. Output tokens are everything the model generates in response. You are billed for each separately, at different rates, and the count resets every request — there is no flat monthly token allowance on the on-demand path; you pay for exactly what flows through.
A few practical consequences fall straight out of this definition. Long system prompts are not free — a 600-word instruction block re-sent on every call is ~800 input tokens billed every single time (which is exactly what prompt caching exists to fix). Conversation history compounds — replaying the whole transcript on each turn means later turns in a long chat cost far more in input than early ones. And retrieved context in RAG is usually the largest input component — pulling 3,000 tokens of documents to answer a 50-token question means the question is a rounding error and the retrieval is the cost.
A token ≈ ¾ of an English word (≈ 4 characters); 1,000 tokens ≈ 750 words. Bedrock bills input tokens (everything you send: prompt + system instruction + history + retrieved context + tool defs) and output tokens (everything the model writes) separately, per request, at different rates.
The single most important thing to internalize about Bedrock token costs is that input and output are billed at different rates, and output is the expensive one. Almost every cost-design decision flows from this asymmetry.
For a given model, the published output rate is typically 3–5× the input rate. On a representative Claude Sonnet-class model that is roughly $3 per million input tokens against $15 per million output — a 5× gap. The reason is mechanical: reading your prompt is a single forward pass the model can process in parallel, but generating output is autoregressive — the model produces one token at a time, each conditioned on all the tokens before it, so output is far more compute-intensive per token. You are paying for that difference.
This asymmetry means two workloads with the same total token count can have wildly different bills depending on the input/output split. A workload that reads a lot and writes a little — classification, extraction, routing, sentiment scoring, "answer yes or no" — is cheap, because the bulk of its tokens are charged at the lower input rate. A workload that writes a lot from a short prompt — drafting articles, generating code, long-form summaries, synthetic data — is dominated by the higher output rate even though it sends very little in.
There is a second-order effect that catches teams out: output tokens are also the ones you control least precisely. You know exactly how long your input is before you send it, but the model decides how much to generate (up to your max-output cap). A prompt that invites a rambling answer can cost several times what a tightly-scoped one does for identical input. This is why "cap your max output tokens" and "ask for terse answers" are real cost levers, not just style preferences — and why streaming a response does not change the price (you are billed for the tokens generated, not for how they are delivered).
The design takeaway is to think in two budgets, not one. Estimate input tokens (which you can measure precisely from your prompt template, history policy, and retrieval size) and output tokens (which you bound with a max-tokens limit and shape with instructions) separately, price each at the model's respective rate, and add them. Treating "tokens" as one undifferentiated number is the most common way teams misjudge a Bedrock bill — usually underestimating it, because they reason about the short prompt they typed and forget the long answer they asked for.
| Dimension | Input tokens | Output tokens |
|---|---|---|
| What it covers | Prompt + system instruction + history + retrieved context + tool defs | Everything the model generates back |
| Relative price | Baseline (lower) | Typically 3–5× the input rate |
| Why | Read in one parallel forward pass | Generated one token at a time (autoregressive) |
| How predictable | Known exactly before you send | Bounded by your max-output cap; model decides actual length |
| Main lever | Trim history, shrink retrieved context, prompt caching | Cap max tokens, ask for terse answers |
| Cheap when | Workload reads a lot, writes a little (classify/extract) | Rarely the cheap side — minimize generated length |
You do not need to call the API to get a usable cost estimate. With two back-of-envelope rules and a simple formula you can size a Bedrock bill from a prompt template and an expected answer length before you write a line of production code.
For a fast first pass on English text, use either of these equivalent approximations: tokens ≈ characters ÷ 4, or tokens ≈ words × 1.33. So a 500-word document is roughly 665 tokens; a 2,000-character chunk is roughly 500 tokens. These are deliberately rough — they run a little high for simple prose and a little low for code, numbers, and non-English text, which tokenize less efficiently — but they are accurate enough to choose a model and set a budget. When you need precision, count tokens exactly with the model family's own tokenizer (the Bedrock Converse API returns the actual input and output token counts in its response metadata, which is the ground truth for any real workload).
The per-request cost formula then has just two terms, one per direction:
A quick worked example to anchor the method. Suppose a support assistant on a Claude Haiku-class model sends, per request, an 800-token input (a fixed system prompt plus a short question plus a little history) and generates a 400-token answer. Per request that is 0.8 × $0.00025 + 0.4 × $0.00125 ≈ $0.0002 + $0.0005 = $0.0007, or about $0.70 per thousand conversations. Two things to notice immediately: even though the input is twice the size of the output, the output contributes more of the cost (the 5× rate gap outweighs the 2× volume gap) — and the absolute number is tiny, which is why prototypes feel free and only scale makes Bedrock a real line item.
tokens ≈ characters ÷ 4 (or words × 1.33). Cost per request = (input tokens ÷ 1K × input rate) + (output tokens ÷ 1K × output rate). Multiply by monthly requests. For exact counts, read the token usage the Converse API returns on every call. To model a full mix, see the amazon-bedrock-pricing-calculator sibling.
This is the reference table itself: representative 2026 on-demand token rates across the major model families on Bedrock — Amazon Nova, Claude, Llama, Mistral, Cohere, and Amazon Titan — shown side by side in both per-1,000 and per-1,000,000 tokens, with input and output broken out so the asymmetry is visible at a glance.
Rows run roughly cheapest to most expensive by output rate (since output usually drives cost). The per-1K and per-1M columns are the same number scaled by 1,000 — both are included because AWS and model providers quote sometimes one, sometimes the other, and having both side by side removes a constant source of confusion. The final column is the output-to-input ratio, which makes the 3–5× rule concrete and shows how much "writing" costs relative to "reading" on each model.
Read the table as a ranking and a sanity-check, not an audited price sheet. Foundation-model prices move frequently as providers compete, vary by region, and exclude the prompt-caching and Batch discounts covered in §VI. The spread is the real lesson: from Amazon Nova Micro to a Claude Opus-class model, the output rate ranges roughly $0.14 to $75 per million — a factor of more than 500×. Picking the smallest model that does the job is, by a wide margin, the biggest token-cost decision you make.
| Model | Provider | Input / 1K | Output / 1K | Input / 1M | Output / 1M | Output:input |
|---|---|---|---|---|---|---|
| Nova Micro | Amazon | $0.000035 | $0.00014 | $0.035 | $0.14 | 4× |
| Nova Lite | Amazon | $0.00006 | $0.00024 | $0.06 | $0.24 | 4× |
| Mistral Small | Mistral | $0.0002 | $0.0006 | $0.20 | $0.60 | 3× |
| Llama (small, ~8B) | Meta | $0.00022 | $0.00072 | $0.22 | $0.72 | ~3.3× |
| Titan Text Lite | Amazon | $0.00015 | $0.0002 | $0.15 | $0.20 | ~1.3× |
| Claude Haiku | Anthropic | $0.00025 | $0.00125 | $0.25 | $1.25 | 5× |
| Titan Text Express | Amazon | $0.0002 | $0.0006 | $0.20 | $0.60 | 3× |
| Cohere Command | Cohere | $0.001 | $0.002 | $1.00 | $2.00 | 2× |
| Llama (large, ~70B+) | Meta | $0.00265 | $0.0035 | $2.65 | $3.50 | ~1.3× |
| Mistral Large | Mistral | $0.002 | $0.006 | $2.00 | $6.00 | 3× |
| Nova Pro | Amazon | $0.0008 | $0.0032 | $0.80 | $3.20 | 4× |
| Nova Premier | Amazon | $0.0025 | $0.0125 | $2.50 | $12.50 | 5× |
| Claude Sonnet | Anthropic | $0.003 | $0.015 | $3.00 | $15.00 | 5× |
| Claude Opus-class | Anthropic | $0.015 | $0.075 | $15.00 | $75.00 | 5× |
Teams instinctively optimize the prompt — the thing they wrote and can see. But on a large fraction of real workloads, the output tokens are where the money actually goes, and they get less attention precisely because the model, not the engineer, produces them.
The arithmetic is straightforward once you combine the two facts from §II. Output is priced 3–5× higher per token, so even when a request generates fewer output tokens than it consumes in input, output can still be the larger cost line. Take the Haiku example from §III: 800 input tokens and 400 output tokens — twice as many input tokens — yet output costs $0.0005 against input's $0.0002, more than double. The rate gap beats the volume gap. Unless your workload is genuinely input-heavy (long retrieved context, short answers), expect output to be a big share or the majority of the bill.
This flips the usual optimization instinct. Trimming the system prompt by 100 tokens saves you 100 input tokens per call; letting the model write a 500-token answer where 150 would do costs you 350 output tokens per call at 3–5× the rate — roughly 10–17× more impact per token. The highest-leverage token optimization on a generation-heavy workload is almost always making the output shorter: cap max-output tokens, instruct the model to be concise, ask for structured/JSON output instead of prose, and avoid prompting patterns that invite preamble ("Sure! Here is a detailed explanation…") when you only need the answer.
There are important exceptions, and naming them keeps the rule honest. RAG and long-context workloads are input-dominated — when you stuff 3,000–10,000 tokens of retrieved documents into every call and get back a 300-token answer, input is the bill and retrieval tuning (fewer, better chunks) plus prompt caching are the levers. Classification and extraction are input-dominated for the same reason — lots of text in, a label or a few fields out. So the real guidance is: identify which side of each workload is the cost, then optimize that side. Generation-heavy → shrink output. Context-heavy → shrink and cache input. The mistake is optimizing the side that is not the cost.
One more reason output deserves attention: it is the harder number to predict and the easier one to let drift. Input is fixed by your template and policies; output length is a behavior of the model that can change when you swap models, tweak a prompt, or upgrade to a more verbose version. Monitoring average output tokens per request over time is one of the most useful Bedrock cost metrics, because a silent increase there — a new model that "explains its reasoning," a prompt change that loosened the format — shows up directly on the bill.
Because output is priced 3–5× input, shortening the answer usually saves more than shortening the prompt — often 10×+ more per token. On generation-heavy workloads, cap max-output tokens and demand concise/structured output first. On RAG and classification (input-dominated), do the opposite: shrink and cache the input. Note: image/video are billed per-image/second and embeddings per input token only — neither follows the input/output split above.
The per-token rates in §IV are the on-demand list price. Two mechanisms change the actual rate you pay without changing the model: prompt caching discounts repeated input tokens, and Batch discounts every token on non-interactive jobs. Both are the difference between a list-price bill and a real one.
These stack with the model-choice lever rather than replacing it, and they target different parts of the bill, so the right move depends on which side dominates (per §V). An input-heavy chatbot benefits most from prompt caching (discounting the repeated context it re-sends). A high-volume offline pipeline benefits most from Batch (halving every token on work that can wait). A single product often uses both: interactive traffic served on-demand with caching on the shared prompt, and nightly bulk jobs pushed through Batch. What neither does is change the fundamental unit — you are still paying per token; you are just paying a lower rate per token. The broader set of cost levers and the full pricing-mode comparison live on the amazon-bedrock-pricing page; this page stays on the token math itself.
Prompt caching attacks the input side specifically. When many requests share a large common prefix — a long fixed system prompt, a big instruction set, a reference document, or few-shot examples — caching lets Bedrock store that prefix so subsequent requests do not pay full input price for it again. Cached input tokens are billed at a steep discount versus normal input tokens (with a smaller one-time charge to write the cache). It changes nothing about output pricing — it is purely an input-token discount — but on workloads where the same context is re-sent thousands of times (chatbots with a long system prompt, RAG with shared instructions, agents with large tool definitions), the input portion of the bill can fall by a large fraction. It only helps when context actually repeats; a workload where every prompt is unique gets no benefit. See the dedicated amazon-bedrock-prompt-caching page for the exact mechanics and cache lifetime.
Batch attacks both sides at once but trades away latency. You submit a large set of requests as a single job (typically a file in S3) and Bedrock processes them in the background, returning results when done. In exchange for giving up real-time responses, both input and output tokens are billed at roughly half the on-demand rate. There is no per-model gymnastics — it is a flat ~50% cut on the same token pricing, applied to bulk, non-interactive work: corpus summarization, classification at scale, enrichment, offline evaluation, embedding a large dataset. For any token-heavy job that does not need an instant answer, Batch is the single easiest way to halve the per-token cost. See the amazon-bedrock-batch-inference page for the job mechanics.
Prompt caching = a steep discount on repeated input tokens (helps input-heavy, repetitive workloads). Batch = ~50% off both input and output for non-interactive jobs. They stack with model choice and with each other across different request paths — pick the one that targets your dominant cost side.
Everything above prices tokens if you pay AWS directly. For most startups and many companies the relevant number is different, because Bedrock token spend is fully credit-eligible — every input and output token, plus embeddings and the supporting services, draws down AWS credits before it touches your card.
AWS runs several credit programs specifically to put generative-AI workloads on AWS, and Bedrock token usage counts against them automatically. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a specific GenAI use case; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). Credits apply against your AWS bill — including every Bedrock input and output token, embeddings, fine-tuning, and the vector store and storage around them — until they run out.
The practical catch is that most of these pools are partner-filed: they are requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams typically route through an AWS partner rather than applying alone — and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the workload cost-efficiently (the tiered model routing, the prompt caching, the Batch pipelines described above). The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
Put together with the token math on this page, the picture for a startup is simple: estimate your token costs so you understand the shape of the workload, build it cost-efficiently (right-sized model, capped output, caching, Batch), and then have a $25K–$100K credit pool absorb the spend entirely while you find product-market fit — paying real money only once usage, and ideally revenue, has scaled past the credits. Related: the cross-cluster pages on AWS credits for generative-AI startups and Bedrock POC funding cover the credit mechanics in full.
To make the per-token spread concrete, here is one identical illustrative workload — 50M tokens a month split 60/40 input/output (30M input, 20M output), a typical assistant-style mix — priced on six models from the §IV table. It isolates the single biggest token-cost lever: model choice. Figures are representative 2026 illustrations, not quotes.
| Model | Input cost (30M) | Output cost (20M) | Est. monthly total | Output share | vs cheapest |
|---|---|---|---|---|---|
| Nova Micro | $1.05 | $2.80 | ~$3.85 | 73% | 1× (baseline) |
| Nova Lite | $1.80 | $4.80 | ~$6.60 | 73% | ~1.7× |
| Claude Haiku | $7.50 | $25.00 | ~$32.50 | 77% | ~8× |
| Nova Pro | $24.00 | $64.00 | ~$88.00 | 73% | ~23× |
| Claude Sonnet | $90.00 | $300.00 | ~$390.00 | 77% | ~101× |
| Claude Opus-class | $450.00 | $1,500.00 | ~$1,950.00 | 77% | ~506× |
Situation: The product was output-heavy by design — users asked for full drafts, so most of every request's tokens were generation, billed at the frontier model's top output rate. With no max-output discipline and a frontier model on every call, the modeled token bill was climbing fast as usage grew, and the team had no spare runway to absorb it during the seed period.
What CloudRoute did: CloudRoute matched them in under 24 hours to an EU AWS partner with GenAI cost-engineering experience. The partner (1) moved the easy 70% of generations — outlines, short rewrites, tone tweaks — onto Nova Lite / Claude Haiku and reserved the frontier model only for full long-form drafts; (2) set sensible max-output caps and switched to structured output where the UI allowed it, cutting average output tokens per request by roughly a third; (3) cached the large shared style-and-instruction prompt; and (4) filed a Bedrock POC credit application plus an Activate application to fund the launch.
Outcome: Modeled token cost fell by roughly an order of magnitude through model-routing and output discipline before any discount — and the remaining spend was fully covered by the approved credits, so the team paid $0 during the build and early launch. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
token cost cut: ~10× via routing + output caps · credits secured: POC + Activate · out-of-pocket during build: $0
However your token bill pencils out, AWS credits can cover every input and output token. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to build and cost-tune the workload. Customer pays $0.