amazon bedrock prompt caching · the cost lever · 2026

Amazon Bedrock prompt caching — cut input cost up to 90%.

A complete, neutral reference for Bedrock prompt caching in 2026: what it is, how cache checkpoints work in the Converse API, which models support it, how long the cache lives, the use cases where it pays off (long system prompts, agents, document Q&A, multi-turn chat), the before/after savings math, the pitfalls that quietly kill your hit rate, and how it stacks with Batch and Provisioned Throughput. Plus how AWS credits make the whole bill $0 to build.

cached-input discount
up to ~90%
latency cut
up to ~85%
code change
~1 field
cost with credits
$0
TL;DR
  • Prompt caching lets Bedrock store the model's internal representation of a repeated prompt prefix — a long system prompt, a shared reference document, big tool definitions, or few-shot examples — so the next request that reuses that exact prefix is not re-charged at the full input rate. On the right workload the cached portion is billed at roughly a 90% discount and time-to-first-token drops dramatically.
  • It is opt-in and prefix-based. You mark a cache checkpoint in the Converse API; everything before the checkpoint is cached, everything after is processed fresh. The cache is keyed on the exact tokens and their order, so the cached block must come first and be byte-identical across requests — change one character near the top and you get a cache miss and pay full price.
  • Caching stacks with the other Bedrock cost levers: it layers on top of On-Demand to discount repeated context, and pairs naturally with model right-sizing. A prototype that uses it costs cents; production at scale still runs to real money — which is exactly what AWS credits cover. CloudRoute routes you to the credit pool (Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who builds and cost-tunes the workload, so you pay $0.
the concept

IWhat Amazon Bedrock prompt caching actually is

Prompt caching is the single highest-leverage cost optimization available on Bedrock for a large class of real workloads — and it is widely underused because teams do not realize how much of every request they are paying for twice. The idea is simple once you see where the waste is.

Every time you call a text model, you are billed for input tokens — everything you send — and output tokens — everything the model writes back. The hidden problem is that in most production systems, a large fraction of the input is identical on every single request. A customer-support assistant might prepend a 4,000-token system prompt — brand voice, policies, formatting rules, tool descriptions — to every message. A document-Q&A app re-sends the same 20-page contract with each question. An agent re-sends a fat block of tool schemas on every turn. You pay full input price for that repeated block, over and over, even though the model is processing the exact same tokens it just processed a second ago.

Prompt caching removes that waste. When you cache a prompt prefix, Bedrock stores the model's internal computed state for those tokens. On the next request that begins with the same prefix, the model skips re-processing it and reads the cached state instead. You are billed for those cached input tokens at a steep discount — representatively around 90% off the normal input rate — and because the model does not have to re-ingest the prefix, time-to-first-token drops sharply too (representatively up to ~85% lower latency on cache-heavy requests). It is one of the rare optimizations that cuts both cost and latency at the same time.

The mental model worth internalizing: prompt caching turns the static, repeated part of your prompt into a cheap, fast-to-load asset, while you still pay full price only for the dynamic part that genuinely changes per request (the user's actual question, the new turn in a conversation). The more of your prompt is fixed boilerplate, the bigger the win.

It is important to be precise about what it is not. Prompt caching is not response caching — it does not return a stored answer for a repeated question (that is a separate semantic-cache pattern you build yourself). It does not change the model's output or quality at all; the model produces exactly the same response it would without caching. And it is not automatic by default on most paths — you opt in by marking where the cache should start. Get those three things straight and the rest is mechanics.

the one-sentence version

Prompt caching charges you ~90% less for the repeated part of your prompt (a long system prompt, a shared document, tool definitions, few-shot examples) and makes those requests much faster — because Bedrock reuses the model's already-computed state for tokens it has seen before, instead of re-processing and re-billing them on every call.

the mechanics

IIHow it works — cache checkpoints in the Converse API

Prompt caching is exposed through the Converse API (and InvokeModel) as a cache checkpoint you place in the request. Understanding the prefix model — what gets cached, what does not, and what invalidates it — is the whole game, because the difference between a 90% saving and 0% is whether your cache actually hits.

You enable caching by inserting a cache checkpoint (Bedrock represents it as a cachePoint block) into your request. Conceptually it is a marker that says: "cache everything up to here." Everything before the checkpoint becomes the cached prefix; everything after it is processed fresh on every call. So the rule for laying out a cacheable prompt is blunt: put the stable, repeated content at the very top, place the cache checkpoint right after it, and put the volatile content (the user's message) below the checkpoint.

The cache is keyed on the exact tokens and their exact order from the start of the prompt up to the checkpoint. This is a prefix match, not a fuzzy match — the cached block has to be byte-for-byte identical, starting from the very first token. If you change a single character anywhere inside the cached region — a date in the system prompt, a reordered tool, a whitespace tweak — the prefix no longer matches and you get a cache miss: the request is processed (and billed) at the full input rate, and a fresh cache entry is written.

The first call that establishes a cache entry is a cache write. Writing the cache costs slightly more than a normal input token (representatively a modest premium over the base input rate) because Bedrock has to compute and store the state. Every subsequent call that reuses it is a cache read, billed at the deep discount. This is why caching only pays off when a prefix is reused enough times to amortize the one write — cache something used once and you have spent slightly more, not less.

You can place more than one checkpoint to cache nested layers of context — for example, a stable system prompt as one cached block and a per-session document as a second cached block beneath it — so that changing the session document invalidates only the second block while the system prompt stays cached. Bedrock's API response reports cache usage (tokens read from cache vs. written vs. processed fresh), which is the telemetry you watch to confirm your hit rate is what you think it is.

On a multi-turn conversation, the pattern compounds nicely: you keep the system prompt and the conversation history before the checkpoint and the newest user turn after it. As the conversation grows, more of it falls inside the cached prefix, so each new turn re-pays full price only for the latest exchange rather than the entire transcript. (Exact checkpoint placement and the number of checkpoints supported vary by model — confirm against current AWS docs.)

The request shape, in plain terms

In a Converse call you build a list of content blocks. To cache, you order them as: (1) the system prompt / instructions, (2) any large shared context (documents, tool definitions, few-shot examples), (3) a cachePoint checkpoint block, then (4) the dynamic user message. On the first request Bedrock writes the cache for blocks 1–2; on later requests with the identical 1–2, it reads them from cache and only processes block 4 fresh.

Nothing else about your integration changes — same model ID, same Converse call, same response parsing. Prompt caching is close to a one-field change for the common case, which is part of why it is such a high-ROI lever: minimal engineering, large recurring saving.

Cache lifetime / TTL

A cache entry is not permanent. It lives for a short rolling time-to-live that is refreshed on each hit — representatively on the order of a few minutes of inactivity before it is evicted (commonly cited around five minutes, though the exact TTL and any longer-lived options vary by model and can change; confirm against current AWS documentation). Crucially, each cache read resets the TTL, so a prefix that is hit continuously stays warm indefinitely; a prefix that goes idle past the window is evicted, and the next request pays a fresh write.

The practical implication is about traffic density. Caching shines when requests sharing a prefix arrive close together — a busy chatbot, a burst of questions against the same document, an agent loop firing many turns in seconds. For sparse, low-frequency traffic where each request is minutes apart, the entry may expire between calls and you pay the write repeatedly with few reads to amortize it. Caching is a throughput optimization first: the higher and more clustered your reuse, the larger the realized saving.

cache write vs. cache read vs. uncached — representative per-token economics · 2026
Token treatmentWhen it happensRelative cost vs. normal inputEffect on latency
Uncached inputNo caching, or content after the checkpointBaseline (1×)Full processing
Cache writeFirst request establishing the prefixSmall premium over baseline (~1.25×)Full processing + store
Cache readLater request reusing the identical prefixDeep discount (~0.1× — about 90% off)Prefix skipped — much faster
Representative 2026 figures for relative comparison only — confirm current multipliers on the AWS Bedrock pricing page. The economics turn positive once a written prefix is read enough times to outweigh the single write premium; on high-reuse workloads that is immediate.
where it works

IIIWhich models support prompt caching

Prompt caching is a per-model capability, not a blanket Bedrock feature — a model has to expose it. As of 2026 it is available on a growing set of the high-traffic text models, which are exactly the ones where caching matters most.

Support has expanded over time and continues to. As of 2026, prompt caching is available on flagship Anthropic Claude models on Bedrock (the Sonnet and Haiku tiers most teams run in production) and on Amazon Nova text models, with coverage broadening across providers. Because availability and the exact terms (minimum cacheable token counts, number of checkpoints, TTL behavior) differ by model and evolve, treat any specific list as point-in-time and confirm support for your chosen model in the current AWS Bedrock documentation before you design around it.

A practical nuance: most models enforce a minimum cacheable prefix length — caching a 50-token snippet is not worth it and may not be eligible. The feature is built for substantial repeated context (hundreds to thousands of tokens — long system prompts, real documents, big tool schemas), which is exactly the situation where the saving is large. If your repeated prefix is tiny, caching is the wrong tool; if it is a 3,000-token system prompt hit thousands of times a day, it is close to free money.

Caching is also one input among several when you pick a model. The decision is rarely "the model with caching" in isolation — it is "the cheapest model that meets the quality bar, then turn on caching to cut its repeated-context cost further." See the amazon-bedrock-pricing sibling for the full per-model price table and claude-on-amazon-bedrock for the Claude-specific details.

check before you build

Prompt-caching support, minimum prefix length, checkpoint count, and TTL all vary by model and change as AWS ships updates. Confirm the specifics for your exact model in the current AWS Bedrock docs — this page gives the durable mechanics and representative economics, not a frozen capability matrix.

when it pays off

IVThe use cases where prompt caching pays off

Prompt caching is not a universal win — it pays off precisely when a large prompt prefix is reused across many requests in a short window. Here are the patterns where it is almost always worth turning on, and why each one fits the prefix-reuse shape.

The unifying test: "Is a large chunk of my prompt identical across many requests that arrive close together?" If yes, cache it. If every request is unique from the first token — one-off prompts, constantly-changing context, sparse traffic — caching has little to grab onto and you should reach for other levers (model right-sizing, Batch, shorter prompts) instead.

  • Long system prompts — Any assistant with a heavy fixed system prompt — brand voice, policies, formatting rules, safety instructions — that is identical on every request. This is the textbook case: a 3,000–5,000 token system prompt cached once and read on every subsequent call cuts the dominant fixed cost to near zero. The longer and more reused the system prompt, the larger the saving.
  • Agents with large tool definitions — Agentic workloads re-send a big block of tool/function schemas on every turn of the loop, and an agent can fire many turns to complete one task. Caching the tool definitions (and the system prompt) means each turn re-pays full price only for the new reasoning step, not the entire toolset — a large saving on multi-turn agent runs, with the latency cut making the agent feel faster too.
  • Document Q&A — Apps where a user asks multiple questions against the same large document, contract, codebase, or knowledge file. The document is the expensive part of the input and it is identical across that session's questions. Cache the document once, then each question only pays fresh for the (short) question and answer. The more questions per document, the better the amortization.
  • Multi-turn chat — In a sustained conversation, the system prompt plus the growing transcript form a stable prefix that recurs on every turn. Keeping them before the checkpoint and the newest user message after it means each turn re-pays full input price only for the latest exchange instead of re-billing the whole history — the saving grows as the conversation lengthens.
  • Few-shot / in-context examples — Workloads that prepend a fixed set of high-quality few-shot examples to steer the model (classification, extraction, structured output). Those examples are pure repeated boilerplate; cache them and pay full price only for the new item being processed. Pairs especially well with high-volume classification.
  • RAG with stable shared context — Retrieval-augmented generation where a common instruction block or a frequently-retrieved "hot" context is reused across many queries. Caching the shared portion cuts the input bill that retrieved context otherwise dominates. (Note: per-query retrieved chunks that differ every time are not cacheable — only the genuinely shared prefix is.)
the numbers

VThe savings math — a worked before/after example

Percentages are abstract; dollars are not. Here is a concrete, representative before/after for a common workload, so you can see the shape of the saving and reproduce the calculation for your own numbers.

The workload. A customer-support assistant on a Claude Sonnet-class model. It serves 200,000 requests/month. Every request carries a 4,000-token fixed system prompt (policies, tone, tool descriptions) plus an average 500-token user message, and produces a 400-token answer. So per request: 4,500 input tokens, 400 output tokens.

Before caching (all on-demand). Monthly input = 200,000 × 4,500 = 900M input tokens; output = 200,000 × 400 = 80M output tokens. At Sonnet's representative rates of $3.00 / 1M input and $15.00 / 1M output: input ≈ $2,700, output ≈ $1,200≈ $3,900/month. Notice the input cost dominates, and the 4,000-token system prompt is 89% of every input — re-billed 200,000 times.

After caching the system prompt. The 4,000-token system prompt (800M tokens/month of the input) moves to cache reads at a representative ~90% discount → roughly $0.30 / 1M instead of $3.00 / 1M, costing about $240 instead of $2,400. The remaining input — the 500-token user messages, 100M tokens/month — is still full price at ≈ $300. Add a negligible handful of cache writes (one write per warm prefix, amortized across the month). Output is unchanged at $1,200. New total ≈ $240 + $300 + ~$5 writes + $1,200 → ≈ $1,745/month.

The result. The bill falls from ≈ $3,900 to ≈ $1,745/month — about a 55% total reduction — from a one-field change, with no quality impact and faster responses as a bonus. And look closer at just the input line: it dropped from ~$2,700 to ~$540, an ~80% cut on input. When the fixed prefix is an even larger share of the prompt (a long document, a big toolset) the input cut pushes toward the headline ~90%; the blended total saving depends on how output-heavy the workload is.

Two lessons fall out of the arithmetic. First, caching attacks input, not output — so it helps most on read-heavy, prefix-heavy workloads (assistants, RAG, agents, document Q&A) and barely at all on workloads that send a tiny prompt and generate a lot. Second, the saving scales with how much of your prompt is repeated and how often: the bigger and more-reused the prefix, the closer you get to cutting your input bill by ~90%.

support assistant — before vs. after prompt caching · representative 2026 monthly cost
Line itemTokens / monthBefore (on-demand)After (cached prefix)Change
System prompt (4K, fixed)800M input~$2,400~$240 (cache reads)−90%
User messages (500 avg)100M input~$300~$300unchanged
Cache writes~one per warm prefix~$5new (tiny)
Output (400 avg)80M output~$1,200~$1,200unchanged
Total~$3,900~$1,745~−55%
Representative 2026 illustration using Sonnet-class rates ($3 / $15 per 1M) and a ~90% cached-read discount — confirm current rates on the AWS Bedrock pricing page. Input-only saving is ~80% here and approaches ~90% as the fixed prefix grows relative to the dynamic message.
what goes wrong

VIPitfalls — why your cache misses (and how to fix it)

The gap between the savings on the page and the savings you actually realize is almost always a hit-rate problem. Prompt caching is a prefix match on exact tokens, and a surprising number of everyday coding habits silently break the prefix. Here are the failure modes that matter, in rough order of how often they bite.

  • Dynamic content near the top (the #1 killer) — A timestamp, a per-request ID, the user's name, today's date, or a random greeting inserted at the start of the system prompt changes the prefix on every request — so it never matches and you cache-miss 100% of the time while paying the write premium. Fix: move every dynamic value below the cache checkpoint. Cached region = strictly static.
  • Wrong ordering — Caching is prefix-based: only the contiguous block from the very first token to the checkpoint is cached. If you put the volatile user message before the stable system prompt, nothing reusable is at the front and caching does nothing. Fix: stable content first, checkpoint, then dynamic content — always.
  • Non-deterministic serialization — If you build the prompt by serializing objects (JSON tool definitions, a dict of settings) and the key order or whitespace is not stable across runs, the bytes differ and the prefix breaks even though the content is "the same." Fix: serialize the cached region deterministically — fixed key order, fixed formatting.
  • Letting the cache go cold — The TTL is short (representatively a few minutes, refreshed on each hit). Sparse traffic lets the entry expire between requests, so you keep paying writes with few reads to amortize them. Fix: caching is for clustered, high-frequency reuse; for sparse traffic, either accept it is not worth it or batch the work so reuse is dense.
  • Caching a prefix that is too short — Most models enforce a minimum cacheable length; below it, caching is ineligible or simply not worth the write premium. Fix: only cache substantial prefixes (hundreds-to-thousands of tokens). Tiny boilerplate is not worth caching.
  • Caching something used once — A cache write costs slightly more than a normal input token. If a prefix is read zero or one more times, you spent more, not less. Fix: cache only what is genuinely reused many times; one-off prompts should not be cached.
  • Not measuring the hit rate — Teams turn caching on, assume it is working, and never check. The Converse response reports cache-read vs. cache-write vs. fresh tokens. Fix: log those, compute your realized hit rate, and alert if it drops — a deploy that tweaks the system prompt can silently zero out your savings.
the golden rule

Everything in the cached region must be byte-identical and static from the first token to the checkpoint, and the prefix must be reused often enough, close enough together to amortize the one-time write. Break either condition and the saving evaporates — usually silently.

combining levers

VIIHow caching stacks with Batch and Provisioned Throughput

Prompt caching is one of four Bedrock pricing modes, and the modes are not mutually exclusive. Knowing how caching composes with the others lets you cost-tune each path of a product independently rather than picking one global setting.

Recall the four ways to pay on Bedrock: On-Demand (per token, no commitment), Batch (~50% cheaper, asynchronous), Provisioned Throughput (reserved capacity at a flat hourly rate), and prompt caching (a discount on repeated input). Caching is best understood as a modifier on a real-time path rather than a standalone mode — it layers on top of On-Demand to discount the repeated prefix while you still pay normally for everything dynamic.

Caching + On-Demand is the everyday combination: interactive traffic served on-demand, with a long shared system prompt or document cached so each request only pays full price for its dynamic part. This is where most teams start and where most of the saving lives.

Caching vs. Batch is mostly an either/or by traffic shape, not a stack. Batch is for non-interactive, high-volume jobs (bulk classification, embeddings backfill, dataset enrichment, offline evals) and already gives ~50% off the on-demand rate asynchronously. Caching is for interactive, latency-sensitive traffic where a prefix repeats. You generally route a workload to one or the other: real-time chat/agents → on-demand + caching; offline bulk → Batch. Both are ways to stop overpaying, applied to different traffic.

Caching + Provisioned Throughput can complement each other on a high-volume hot path: Provisioned reserves dedicated capacity for guaranteed throughput and latency, and caching reduces how much work each request does within that capacity (so reserved units stretch further). Whether the combination is worthwhile depends on volume and your model's terms — but conceptually they target different things (capacity vs. repeated-token cost) and do not conflict.

The right mental model for a real product: route each path to its best mode. Serve interactive traffic On-Demand with prompt caching on the fixed context; run nightly enrichment and backfills via Batch; reserve Provisioned Throughput only for an always-hot path or a custom model. Caching is the lever you reach for first on anything interactive with a repeated prefix because it is nearly free to add and cuts both cost and latency. See amazon-bedrock-pricing for all four modes side by side and amazon-bedrock-batch-inference for the Batch path in depth.

how prompt caching relates to the other bedrock pricing modes · 2026
LeverWhat it cutsTraffic it fitsStacks with caching?
Prompt cachingRepeated-input cost + latencyInteractive, repeated prefix— (this is it)
On-DemandNothing (baseline)Variable / interactiveYes — the default pairing
Batch~50% off, asyncBulk, non-interactiveNo — different traffic, pick one
Provisioned ThroughputCaps cost at high steady volumeSteady high volume; custom modelsYes — complementary (capacity vs. tokens)
Model right-sizingPer-token rate (pick cheaper model)AnyYes — combine for compounding savings
Caching layers onto a real-time path; Batch is the alternative for offline bulk. Combine caching with model right-sizing for the largest compounding effect on an interactive bill.
how it becomes $0

VIIIHow AWS credits make the whole bill $0 to build

Everything above is about shrinking a Bedrock bill you pay AWS directly. For most startups and many companies the more relevant move is to not pay it at all during the build — because AWS will frequently fund generative-AI workloads with credits, and Bedrock spend (cached or not) draws those credits down before it touches your card.

AWS runs several credit programs specifically to put GenAI workloads on AWS, and Bedrock usage — inference, fine-tuning, embeddings, and the supporting services — is fully credit-eligible. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). Credits apply automatically against your AWS bill until exhausted.

The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams route through an AWS partner rather than applying alone, and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the workload — including the FinOps work that makes prompt caching, model routing, and Batch actually land. The customer pays $0: AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.

There is a neat synergy between caching and credits worth naming. A pool of credits is a fixed budget; prompt caching (and the other levers) determine how long that budget lasts. A team that turns on caching, right-sizes its models, and batches its bulk work can stretch a $25K–$100K credit pool across far more experimentation and far more launch traffic than a team paying full on-demand for repeated context. Cost optimization stops being "protect the runway" and becomes "make the credits go further" — which changes how aggressively you can build. Related: see the cross-cluster pages on AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.

cost with vs. without caching

With caching vs. without — across four common scenarios

To make the lever concrete, here is the representative monthly cost of four typical workloads computed both ways — without caching (all input at full on-demand price) and with the repeated prefix cached at a ~90% read discount. The realized saving tracks how much of each prompt is fixed, repeated context. Figures are representative 2026 illustrations, not quotes.

ScenarioFixed prefixRepeated share of inputWithout cachingWith cachingInput saving
Support chatbot (Sonnet)4K-token system promptHigh (~85%)~$3,900/mo~$1,745/mo~80%
Document Q&A (Sonnet)20-page doc per sessionVery high (~90%)~$2,000/mo~$650/mo~85%
Agent loop (Sonnet)Large tool schemas + sys promptHigh (~80%)~$5,000/mo~$2,300/mo~78%
Few-shot classifier (Haiku)Fixed example setVery high (~92%)~$220/mo~$70/mo~88%
One-off generation (any)None — unique each callNone~$X/mo~$X/mo~0% (caching N/A)
Caching helps in direct proportion to the repeated share of the input; it does nothing for prompts that are unique on every call. All figures representative 2026 illustrations using a ~90% cached-read discount — confirm current rates on the AWS Bedrock pricing page. See amazon-bedrock-pricing-calculator to model your own mix.
before you optimize a single token
Get AWS credits that cover Bedrock — and a partner to build the caching, routing, and FinOps (you pay $0)
Get matched in 24h →
a recent match

A $3.8K/month Bedrock bill cut ~70% with caching — and funded to $0 — anonymized

inquiry · Series-A vertical-AI SaaS, Berlin
Series-A vertical-AI SaaS, 19 people, ~$3.8K/month Bedrock inference on a customer-facing assistant

Situation: Their product wrapped every request in a ~6,000-token system prompt — domain rules, formatting, a large tool catalog — re-sent on every call to a Sonnet-class model, on-demand, across a busy multi-turn assistant. The bill was ~$3.8K/month and climbing with usage, and roughly 85% of every input was the same fixed prefix being re-billed thousands of times a day. They wanted to both cut the number and avoid paying it out of a runway earmarked for hiring.

What CloudRoute did: CloudRoute matched them in under 24 hours to a German AWS partner with GenAI cost-engineering experience. The partner (1) restructured the prompt so the static system prompt and tool catalog sat above a cache checkpoint and only the user turn below it; (2) moved a per-request timestamp that had been silently zeroing the hit rate to below the checkpoint; (3) added cache-usage logging so the team could watch its realized hit rate; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the launch.

Outcome: Realized cache hit rate climbed above 90% on warm traffic; modeled inference cost fell from ~$3.8K to ~$1.15K/month (about a 70% cut), with time-to-first-token down sharply as a bonus — and even that reduced bill was fully covered by the approved credits, so the team paid $0 during the build and early launch. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

cache hit rate: >90% warm · cost cut: ~$3.8K → ~$1.15K/mo · credits secured: POC + Activate · out-of-pocket: $0

faq

Common questions

What is prompt caching in Amazon Bedrock?
Prompt caching lets Bedrock store the model's computed state for a repeated prompt prefix — a long system prompt, a shared document, tool definitions, or few-shot examples — so the next request that reuses that exact prefix is not re-charged at the full input rate. Cached input tokens are billed at a steep discount (representatively around 90% off), and because the model skips re-processing the prefix, latency drops too. It is opt-in: you mark a cache checkpoint in the Converse API, and everything before it is cached.
How much does Bedrock prompt caching save?
On the repeated portion of the input, representatively about 90% — cached input tokens are billed at roughly a tenth of the normal input rate (with a small one-time premium to write the cache). The total bill saving depends on how much of your prompt is fixed, repeated context: a workload with a 4,000-token system prompt re-sent on every request can cut its input cost ~80% and its total cost ~55%, while a document-Q&A app reusing a large document across many questions can cut input ~85%+. It does not reduce output cost. Confirm current discount multipliers on the AWS Bedrock pricing page.
How does prompt caching work technically?
You insert a cache checkpoint (a cachePoint block) into a Converse (or InvokeModel) request. Everything before the checkpoint becomes the cached prefix; everything after is processed fresh. The cache is keyed on the exact tokens and their order from the start of the prompt to the checkpoint — a prefix match, not fuzzy — so the cached block must come first and be byte-identical across requests. The first call writes the cache (a small premium); later calls reusing the identical prefix read from it at the deep discount. The API response reports how many tokens were read from cache vs. written vs. processed fresh.
How long does the Bedrock prompt cache last (TTL)?
A cache entry has a short, rolling time-to-live — representatively on the order of a few minutes of inactivity (commonly cited around five minutes), refreshed on every hit, though the exact TTL and any longer-lived options vary by model and can change. Because each read resets the timer, a continuously-hit prefix stays warm indefinitely, while a prefix that goes idle past the window is evicted and the next request pays a fresh write. This makes caching a throughput optimization: it pays off most when requests sharing a prefix arrive close together. Confirm current TTL behavior in the AWS Bedrock documentation.
Which models support prompt caching on Bedrock?
As of 2026, prompt caching is available on flagship Anthropic Claude models on Bedrock (the Sonnet and Haiku tiers most teams run) and on Amazon Nova text models, with coverage broadening across providers. Support, minimum cacheable prefix length, the number of checkpoints, and TTL behavior all vary by model and change as AWS ships updates, so confirm support for your specific model in the current AWS Bedrock documentation before designing around it.
Why is my prompt cache not hitting?
Almost always because the cached region is not byte-identical across requests, or it is not reused densely enough. The most common cause is dynamic content near the top of the prompt — a timestamp, a per-request ID, the user's name, today's date — which changes the prefix every time and forces a 100% cache miss. Other causes: putting the volatile user message before the stable content (wrong ordering), non-deterministic JSON serialization of tool definitions, sparse traffic letting the entry expire between calls, or a prefix below the model's minimum cacheable length. Fix: keep the cached region strictly static and first, serialize deterministically, and confirm your realized hit rate from the API's cache-usage fields.
Does prompt caching work with Batch or Provisioned Throughput?
Prompt caching is best understood as a modifier on a real-time path — it layers onto On-Demand to discount a repeated prefix on interactive traffic. Batch is the alternative for non-interactive bulk jobs (it already gives ~50% off asynchronously), so you typically route a workload to one or the other: interactive chat/agents → on-demand + caching; offline bulk → Batch. Caching can complement Provisioned Throughput on a high-volume hot path (caching reduces the work per request within reserved capacity). Combine caching with model right-sizing for the biggest compounding saving on an interactive bill.
Does prompt caching change the model's output or quality?
No. Prompt caching only changes how the input is billed and how fast the prefix loads — the model produces exactly the same response it would without caching. It is not response caching (it does not return a stored answer for a repeated question; that is a separate semantic-cache pattern you build yourself). It is purely a cost-and-latency optimization on the input side, with no effect on output quality.
Can AWS credits cover Bedrock costs, including cached usage?
Yes — Bedrock inference (cached or not), fine-tuning, embeddings, and supporting services are all credit-eligible, and credits apply automatically against your AWS bill until exhausted. The relevant pools are AWS Activate (up to $100K), a dedicated Bedrock/GenAI POC pool ($10K–$50K), and the GenAI Accelerator (up to $1M for selected startups). These are largely partner-filed via the AWS Partner Network, which is why teams route through a partner. CloudRoute matches you to the right pool and a vetted AWS partner who files the application and builds the workload (caching, routing, FinOps included) — customer pays $0, AWS funds it.

Cut the bill with caching — then make it $0 with credits

Prompt caching can cut your repeated-input cost up to ~90%. AWS credits can cover what is left. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who builds the caching, model routing, and FinOps. Customer pays $0.

cached-input savingup to ~90%
GenAI credit ceilingup to $1M
cost to you$0
Amazon Bedrock prompt caching — cut cost 90% (2026) · CloudRoute