A complete, neutral reference for Bedrock prompt caching in 2026: what it is, how cache checkpoints work in the Converse API, which models support it, how long the cache lives, the use cases where it pays off (long system prompts, agents, document Q&A, multi-turn chat), the before/after savings math, the pitfalls that quietly kill your hit rate, and how it stacks with Batch and Provisioned Throughput. Plus how AWS credits make the whole bill $0 to build.
Prompt caching is the single highest-leverage cost optimization available on Bedrock for a large class of real workloads — and it is widely underused because teams do not realize how much of every request they are paying for twice. The idea is simple once you see where the waste is.
Every time you call a text model, you are billed for input tokens — everything you send — and output tokens — everything the model writes back. The hidden problem is that in most production systems, a large fraction of the input is identical on every single request. A customer-support assistant might prepend a 4,000-token system prompt — brand voice, policies, formatting rules, tool descriptions — to every message. A document-Q&A app re-sends the same 20-page contract with each question. An agent re-sends a fat block of tool schemas on every turn. You pay full input price for that repeated block, over and over, even though the model is processing the exact same tokens it just processed a second ago.
Prompt caching removes that waste. When you cache a prompt prefix, Bedrock stores the model's internal computed state for those tokens. On the next request that begins with the same prefix, the model skips re-processing it and reads the cached state instead. You are billed for those cached input tokens at a steep discount — representatively around 90% off the normal input rate — and because the model does not have to re-ingest the prefix, time-to-first-token drops sharply too (representatively up to ~85% lower latency on cache-heavy requests). It is one of the rare optimizations that cuts both cost and latency at the same time.
The mental model worth internalizing: prompt caching turns the static, repeated part of your prompt into a cheap, fast-to-load asset, while you still pay full price only for the dynamic part that genuinely changes per request (the user's actual question, the new turn in a conversation). The more of your prompt is fixed boilerplate, the bigger the win.
It is important to be precise about what it is not. Prompt caching is not response caching — it does not return a stored answer for a repeated question (that is a separate semantic-cache pattern you build yourself). It does not change the model's output or quality at all; the model produces exactly the same response it would without caching. And it is not automatic by default on most paths — you opt in by marking where the cache should start. Get those three things straight and the rest is mechanics.
Prompt caching charges you ~90% less for the repeated part of your prompt (a long system prompt, a shared document, tool definitions, few-shot examples) and makes those requests much faster — because Bedrock reuses the model's already-computed state for tokens it has seen before, instead of re-processing and re-billing them on every call.
Prompt caching is exposed through the Converse API (and InvokeModel) as a cache checkpoint you place in the request. Understanding the prefix model — what gets cached, what does not, and what invalidates it — is the whole game, because the difference between a 90% saving and 0% is whether your cache actually hits.
You enable caching by inserting a cache checkpoint (Bedrock represents it as a cachePoint block) into your request. Conceptually it is a marker that says: "cache everything up to here." Everything before the checkpoint becomes the cached prefix; everything after it is processed fresh on every call. So the rule for laying out a cacheable prompt is blunt: put the stable, repeated content at the very top, place the cache checkpoint right after it, and put the volatile content (the user's message) below the checkpoint.
The cache is keyed on the exact tokens and their exact order from the start of the prompt up to the checkpoint. This is a prefix match, not a fuzzy match — the cached block has to be byte-for-byte identical, starting from the very first token. If you change a single character anywhere inside the cached region — a date in the system prompt, a reordered tool, a whitespace tweak — the prefix no longer matches and you get a cache miss: the request is processed (and billed) at the full input rate, and a fresh cache entry is written.
The first call that establishes a cache entry is a cache write. Writing the cache costs slightly more than a normal input token (representatively a modest premium over the base input rate) because Bedrock has to compute and store the state. Every subsequent call that reuses it is a cache read, billed at the deep discount. This is why caching only pays off when a prefix is reused enough times to amortize the one write — cache something used once and you have spent slightly more, not less.
You can place more than one checkpoint to cache nested layers of context — for example, a stable system prompt as one cached block and a per-session document as a second cached block beneath it — so that changing the session document invalidates only the second block while the system prompt stays cached. Bedrock's API response reports cache usage (tokens read from cache vs. written vs. processed fresh), which is the telemetry you watch to confirm your hit rate is what you think it is.
On a multi-turn conversation, the pattern compounds nicely: you keep the system prompt and the conversation history before the checkpoint and the newest user turn after it. As the conversation grows, more of it falls inside the cached prefix, so each new turn re-pays full price only for the latest exchange rather than the entire transcript. (Exact checkpoint placement and the number of checkpoints supported vary by model — confirm against current AWS docs.)
In a Converse call you build a list of content blocks. To cache, you order them as: (1) the system prompt / instructions, (2) any large shared context (documents, tool definitions, few-shot examples), (3) a cachePoint checkpoint block, then (4) the dynamic user message. On the first request Bedrock writes the cache for blocks 1–2; on later requests with the identical 1–2, it reads them from cache and only processes block 4 fresh.
Nothing else about your integration changes — same model ID, same Converse call, same response parsing. Prompt caching is close to a one-field change for the common case, which is part of why it is such a high-ROI lever: minimal engineering, large recurring saving.
A cache entry is not permanent. It lives for a short rolling time-to-live that is refreshed on each hit — representatively on the order of a few minutes of inactivity before it is evicted (commonly cited around five minutes, though the exact TTL and any longer-lived options vary by model and can change; confirm against current AWS documentation). Crucially, each cache read resets the TTL, so a prefix that is hit continuously stays warm indefinitely; a prefix that goes idle past the window is evicted, and the next request pays a fresh write.
The practical implication is about traffic density. Caching shines when requests sharing a prefix arrive close together — a busy chatbot, a burst of questions against the same document, an agent loop firing many turns in seconds. For sparse, low-frequency traffic where each request is minutes apart, the entry may expire between calls and you pay the write repeatedly with few reads to amortize it. Caching is a throughput optimization first: the higher and more clustered your reuse, the larger the realized saving.
| Token treatment | When it happens | Relative cost vs. normal input | Effect on latency |
|---|---|---|---|
| Uncached input | No caching, or content after the checkpoint | Baseline (1×) | Full processing |
| Cache write | First request establishing the prefix | Small premium over baseline (~1.25×) | Full processing + store |
| Cache read | Later request reusing the identical prefix | Deep discount (~0.1× — about 90% off) | Prefix skipped — much faster |
Prompt caching is a per-model capability, not a blanket Bedrock feature — a model has to expose it. As of 2026 it is available on a growing set of the high-traffic text models, which are exactly the ones where caching matters most.
Support has expanded over time and continues to. As of 2026, prompt caching is available on flagship Anthropic Claude models on Bedrock (the Sonnet and Haiku tiers most teams run in production) and on Amazon Nova text models, with coverage broadening across providers. Because availability and the exact terms (minimum cacheable token counts, number of checkpoints, TTL behavior) differ by model and evolve, treat any specific list as point-in-time and confirm support for your chosen model in the current AWS Bedrock documentation before you design around it.
A practical nuance: most models enforce a minimum cacheable prefix length — caching a 50-token snippet is not worth it and may not be eligible. The feature is built for substantial repeated context (hundreds to thousands of tokens — long system prompts, real documents, big tool schemas), which is exactly the situation where the saving is large. If your repeated prefix is tiny, caching is the wrong tool; if it is a 3,000-token system prompt hit thousands of times a day, it is close to free money.
Caching is also one input among several when you pick a model. The decision is rarely "the model with caching" in isolation — it is "the cheapest model that meets the quality bar, then turn on caching to cut its repeated-context cost further." See the amazon-bedrock-pricing sibling for the full per-model price table and claude-on-amazon-bedrock for the Claude-specific details.
Prompt-caching support, minimum prefix length, checkpoint count, and TTL all vary by model and change as AWS ships updates. Confirm the specifics for your exact model in the current AWS Bedrock docs — this page gives the durable mechanics and representative economics, not a frozen capability matrix.
Prompt caching is not a universal win — it pays off precisely when a large prompt prefix is reused across many requests in a short window. Here are the patterns where it is almost always worth turning on, and why each one fits the prefix-reuse shape.
The unifying test: "Is a large chunk of my prompt identical across many requests that arrive close together?" If yes, cache it. If every request is unique from the first token — one-off prompts, constantly-changing context, sparse traffic — caching has little to grab onto and you should reach for other levers (model right-sizing, Batch, shorter prompts) instead.
Percentages are abstract; dollars are not. Here is a concrete, representative before/after for a common workload, so you can see the shape of the saving and reproduce the calculation for your own numbers.
The workload. A customer-support assistant on a Claude Sonnet-class model. It serves 200,000 requests/month. Every request carries a 4,000-token fixed system prompt (policies, tone, tool descriptions) plus an average 500-token user message, and produces a 400-token answer. So per request: 4,500 input tokens, 400 output tokens.
Before caching (all on-demand). Monthly input = 200,000 × 4,500 = 900M input tokens; output = 200,000 × 400 = 80M output tokens. At Sonnet's representative rates of $3.00 / 1M input and $15.00 / 1M output: input ≈ $2,700, output ≈ $1,200 → ≈ $3,900/month. Notice the input cost dominates, and the 4,000-token system prompt is 89% of every input — re-billed 200,000 times.
After caching the system prompt. The 4,000-token system prompt (800M tokens/month of the input) moves to cache reads at a representative ~90% discount → roughly $0.30 / 1M instead of $3.00 / 1M, costing about $240 instead of $2,400. The remaining input — the 500-token user messages, 100M tokens/month — is still full price at ≈ $300. Add a negligible handful of cache writes (one write per warm prefix, amortized across the month). Output is unchanged at $1,200. New total ≈ $240 + $300 + ~$5 writes + $1,200 → ≈ $1,745/month.
The result. The bill falls from ≈ $3,900 to ≈ $1,745/month — about a 55% total reduction — from a one-field change, with no quality impact and faster responses as a bonus. And look closer at just the input line: it dropped from ~$2,700 to ~$540, an ~80% cut on input. When the fixed prefix is an even larger share of the prompt (a long document, a big toolset) the input cut pushes toward the headline ~90%; the blended total saving depends on how output-heavy the workload is.
Two lessons fall out of the arithmetic. First, caching attacks input, not output — so it helps most on read-heavy, prefix-heavy workloads (assistants, RAG, agents, document Q&A) and barely at all on workloads that send a tiny prompt and generate a lot. Second, the saving scales with how much of your prompt is repeated and how often: the bigger and more-reused the prefix, the closer you get to cutting your input bill by ~90%.
| Line item | Tokens / month | Before (on-demand) | After (cached prefix) | Change |
|---|---|---|---|---|
| System prompt (4K, fixed) | 800M input | ~$2,400 | ~$240 (cache reads) | −90% |
| User messages (500 avg) | 100M input | ~$300 | ~$300 | unchanged |
| Cache writes | ~one per warm prefix | — | ~$5 | new (tiny) |
| Output (400 avg) | 80M output | ~$1,200 | ~$1,200 | unchanged |
| Total | — | ~$3,900 | ~$1,745 | ~−55% |
The gap between the savings on the page and the savings you actually realize is almost always a hit-rate problem. Prompt caching is a prefix match on exact tokens, and a surprising number of everyday coding habits silently break the prefix. Here are the failure modes that matter, in rough order of how often they bite.
Everything in the cached region must be byte-identical and static from the first token to the checkpoint, and the prefix must be reused often enough, close enough together to amortize the one-time write. Break either condition and the saving evaporates — usually silently.
Prompt caching is one of four Bedrock pricing modes, and the modes are not mutually exclusive. Knowing how caching composes with the others lets you cost-tune each path of a product independently rather than picking one global setting.
Recall the four ways to pay on Bedrock: On-Demand (per token, no commitment), Batch (~50% cheaper, asynchronous), Provisioned Throughput (reserved capacity at a flat hourly rate), and prompt caching (a discount on repeated input). Caching is best understood as a modifier on a real-time path rather than a standalone mode — it layers on top of On-Demand to discount the repeated prefix while you still pay normally for everything dynamic.
Caching + On-Demand is the everyday combination: interactive traffic served on-demand, with a long shared system prompt or document cached so each request only pays full price for its dynamic part. This is where most teams start and where most of the saving lives.
Caching vs. Batch is mostly an either/or by traffic shape, not a stack. Batch is for non-interactive, high-volume jobs (bulk classification, embeddings backfill, dataset enrichment, offline evals) and already gives ~50% off the on-demand rate asynchronously. Caching is for interactive, latency-sensitive traffic where a prefix repeats. You generally route a workload to one or the other: real-time chat/agents → on-demand + caching; offline bulk → Batch. Both are ways to stop overpaying, applied to different traffic.
Caching + Provisioned Throughput can complement each other on a high-volume hot path: Provisioned reserves dedicated capacity for guaranteed throughput and latency, and caching reduces how much work each request does within that capacity (so reserved units stretch further). Whether the combination is worthwhile depends on volume and your model's terms — but conceptually they target different things (capacity vs. repeated-token cost) and do not conflict.
The right mental model for a real product: route each path to its best mode. Serve interactive traffic On-Demand with prompt caching on the fixed context; run nightly enrichment and backfills via Batch; reserve Provisioned Throughput only for an always-hot path or a custom model. Caching is the lever you reach for first on anything interactive with a repeated prefix because it is nearly free to add and cuts both cost and latency. See amazon-bedrock-pricing for all four modes side by side and amazon-bedrock-batch-inference for the Batch path in depth.
| Lever | What it cuts | Traffic it fits | Stacks with caching? |
|---|---|---|---|
| Prompt caching | Repeated-input cost + latency | Interactive, repeated prefix | — (this is it) |
| On-Demand | Nothing (baseline) | Variable / interactive | Yes — the default pairing |
| Batch | ~50% off, async | Bulk, non-interactive | No — different traffic, pick one |
| Provisioned Throughput | Caps cost at high steady volume | Steady high volume; custom models | Yes — complementary (capacity vs. tokens) |
| Model right-sizing | Per-token rate (pick cheaper model) | Any | Yes — combine for compounding savings |
Everything above is about shrinking a Bedrock bill you pay AWS directly. For most startups and many companies the more relevant move is to not pay it at all during the build — because AWS will frequently fund generative-AI workloads with credits, and Bedrock spend (cached or not) draws those credits down before it touches your card.
AWS runs several credit programs specifically to put GenAI workloads on AWS, and Bedrock usage — inference, fine-tuning, embeddings, and the supporting services — is fully credit-eligible. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). Credits apply automatically against your AWS bill until exhausted.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams route through an AWS partner rather than applying alone, and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the workload — including the FinOps work that makes prompt caching, model routing, and Batch actually land. The customer pays $0: AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
There is a neat synergy between caching and credits worth naming. A pool of credits is a fixed budget; prompt caching (and the other levers) determine how long that budget lasts. A team that turns on caching, right-sizes its models, and batches its bulk work can stretch a $25K–$100K credit pool across far more experimentation and far more launch traffic than a team paying full on-demand for repeated context. Cost optimization stops being "protect the runway" and becomes "make the credits go further" — which changes how aggressively you can build. Related: see the cross-cluster pages on AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.
To make the lever concrete, here is the representative monthly cost of four typical workloads computed both ways — without caching (all input at full on-demand price) and with the repeated prefix cached at a ~90% read discount. The realized saving tracks how much of each prompt is fixed, repeated context. Figures are representative 2026 illustrations, not quotes.
| Scenario | Fixed prefix | Repeated share of input | Without caching | With caching | Input saving |
|---|---|---|---|---|---|
| Support chatbot (Sonnet) | 4K-token system prompt | High (~85%) | ~$3,900/mo | ~$1,745/mo | ~80% |
| Document Q&A (Sonnet) | 20-page doc per session | Very high (~90%) | ~$2,000/mo | ~$650/mo | ~85% |
| Agent loop (Sonnet) | Large tool schemas + sys prompt | High (~80%) | ~$5,000/mo | ~$2,300/mo | ~78% |
| Few-shot classifier (Haiku) | Fixed example set | Very high (~92%) | ~$220/mo | ~$70/mo | ~88% |
| One-off generation (any) | None — unique each call | None | ~$X/mo | ~$X/mo | ~0% (caching N/A) |
Situation: Their product wrapped every request in a ~6,000-token system prompt — domain rules, formatting, a large tool catalog — re-sent on every call to a Sonnet-class model, on-demand, across a busy multi-turn assistant. The bill was ~$3.8K/month and climbing with usage, and roughly 85% of every input was the same fixed prefix being re-billed thousands of times a day. They wanted to both cut the number and avoid paying it out of a runway earmarked for hiring.
What CloudRoute did: CloudRoute matched them in under 24 hours to a German AWS partner with GenAI cost-engineering experience. The partner (1) restructured the prompt so the static system prompt and tool catalog sat above a cache checkpoint and only the user turn below it; (2) moved a per-request timestamp that had been silently zeroing the hit rate to below the checkpoint; (3) added cache-usage logging so the team could watch its realized hit rate; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the launch.
Outcome: Realized cache hit rate climbed above 90% on warm traffic; modeled inference cost fell from ~$3.8K to ~$1.15K/month (about a 70% cut), with time-to-first-token down sharply as a bonus — and even that reduced bill was fully covered by the approved credits, so the team paid $0 during the build and early launch. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
cache hit rate: >90% warm · cost cut: ~$3.8K → ~$1.15K/mo · credits secured: POC + Activate · out-of-pocket: $0
Prompt caching can cut your repeated-input cost up to ~90%. AWS credits can cover what is left. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who builds the caching, model routing, and FinOps. Customer pays $0.