for AWS partners →Get an AWS-funded Bedrock build →

claude on bedrock · production guide 2026

Running Claude on Amazon Bedrock in production — the model math, the API, the cost levers (2026).

Q: Should I use the Converse API or InvokeModel for Claude on Bedrock?

Use Converse for anything production-bound. It gives you one provider-agnostic request shape across the whole model catalog, first-class tool use (function calling), streaming via ConverseStream, system prompts, and clean attachment points for prompt caching and Guardrails. InvokeModel still works and exposes model-specific raw bodies, but it forces you to hand-assemble a different JSON payload per model family and rewrite it whenever you change models — the opposite of the portability you want when you are blending Haiku, Sonnet, and Opus.

Q: How do I choose between Claude Opus, Sonnet, and Haiku in production?

Default to Sonnet, then move down or up on evidence. Build a golden eval set, run it against all three, and let the results decide: push the traffic Haiku answers correctly down to Haiku (it is far cheaper and faster), and escalate only the measured-hard slice to Opus. Putting Opus on everything is the most common and most expensive mistake — you pay frontier prices on traffic that was already being answered correctly. A blended fleet that defaults to Sonnet and routes the extremes typically lands several-fold cheaper than an Opus-everywhere build with negligible quality difference on the easy majority.

Q: How much can prompt caching actually save on Bedrock?

On the cached portion of your input — the stable prefix you reuse, such as a long system prompt, tool definitions, few-shot exemplars, or fixed RAG context — caching can cut the per-token cost on those tokens by roughly 80–90% on repeat reads. The catch is that it only pays off when the prefix is genuinely reused within the cache's short, rolling TTL, so it shines for multi-turn chat, high-frequency calls, and large fixed contexts and does little for one-off calls spread far apart. Put the constant material at the front, the variable material at the end, and read the cache-read/write counts from the usage block to confirm you are getting hits.

Q: When should I use batch inference instead of real-time calls?

Use batch for anything that does not need an immediate answer: overnight document processing, bulk classification, dataset labeling, embedding backfills, periodic summarization, and offline evals. You submit input via S3 and collect output from S3, and Bedrock processes it asynchronously at roughly a 50% discount to on-demand. The only cost is latency — batch jobs complete in minutes to hours, not milliseconds. For most teams it is the single highest-ROI optimization available: almost no engineering effort beyond formatting the job, yet it halves the price on every non-interactive call.

Q: Do I need Bedrock Guardrails if Claude is already a safe model?

Usually yes, because Claude's safety is general and your policy is specific. Guardrails lets you enforce your application's named rules — the topics this product must refuse, the PII categories you must redact or mask, content thresholds for your audience, and contextual-grounding checks that flag RAG answers unsupported by the source material. It gives you a consistent, versioned, auditable policy layer applied to inputs and/or outputs — the difference between "the model tries to be safe" and "this application provably enforces these policies the same way every time." For regulated verticals and RAG products, the grounding and PII controls are typically requirements, not extras.

Q: When is Provisioned Throughput worth it versus on-demand?

Stay on-demand for the vast majority of deployments — it is pay-per-token with zero commitment, and adding a cross-region inference profile gives you a higher throughput ceiling and throttle smoothing for free. Reach for Provisioned Throughput only when you are reliably hitting on-demand throttling limits at sustained high volume, or when a hard p99 latency SLA requires reserved, dedicated capacity. It is billed hourly per model unit (optionally on a 1- or 6-month commitment) whether or not you use it, so it only makes sense when utilization is high and predictable. It is a capacity and latency tool, not a discount lever.

Q: How do I control Claude latency for a chat UI on Bedrock?

Four levers, in order of impact: stream the response with ConverseStream so the first tokens appear in a few hundred milliseconds instead of after the full generation; pick the smallest model in the family that clears your quality bar, since smaller models generate faster; cap maxTokens so the model cannot ramble past what you need; and cache the stable prefix so prompt-processing time shrinks on repeat calls. Most "Claude feels slow" complaints come from an un-streamed UI calling the largest model with an uncapped token budget — all three of which are quick fixes.

Q: Is Claude usage on Bedrock eligible for AWS credits?

Frequently, yes. Early-stage Bedrock inference spend can often be covered by AWS Activate credits, generative-AI program credits, and Bedrock proof-of-concept funding — which is most valuable during the build-and-validate phase, when you are running the most experiments and have the least revenue. The credit path and the optimization discipline in this guide are complementary: credits buy runway to get the architecture right, and the architecture keeps the bill sane after the credits are spent. These larger credit pools are typically partner-filed rather than self-serve, which is the gap CloudRoute closes.

Most teams ship a Claude prototype on Bedrock in an afternoon and then spend three months discovering what production actually requires: the right model per workload, the Converse API done correctly, prompt caching and batch that cut the bill 60–90%, Guardrails that hold, provisioned throughput when on-demand throttles, and evals you trust. This is the practitioner-grade reference for all of it.

Get an AWS-funded Bedrock build →→ jump to model selection

caching savings

up to 90%

batch discount

50%

Claude models on Bedrock

Opus · Sonnet · Haiku

price spread, Opus→Haiku

~60×

TL;DR

Model selection is the single biggest production lever. Route the cheap, high-volume traffic to Haiku, the everyday reasoning and agentic work to Sonnet, and reserve Opus for the genuinely hard tasks. A blended fleet that defaults to Sonnet and falls back to Haiku usually lands 5–15× cheaper than putting Opus on everything, with no measurable quality loss on the easy traffic.
Build on the Converse API, not the legacy InvokeModel call. Converse gives you one provider-agnostic request shape, native tool use (function calling), streaming via ConverseStream, system prompts, and a clean place to attach prompt-caching checkpoints and Guardrails. It is the API the rest of this guide assumes.
Cost at scale is won with three levers stacked: prompt caching (up to 90% off repeated input tokens, e.g. a large system prompt or RAG context reused across turns), batch inference (a flat ~50% discount for anything non-interactive), and right-sizing the model. Provisioned Throughput and cross-region inference are about latency and capacity, not headline price — reach for them when on-demand throttles or when you need predictable p99.

context

IWhy teams run Claude on Bedrock specifically — and what changes in production

Claude is available through Anthropic's own API, through Amazon Bedrock, and through Google Vertex AI. The model weights are the same. What differs is the operational envelope around them — and in production, the envelope is most of the work.

Running Claude through Bedrock means the model lives inside your AWS account boundary. Requests stay on AWS networking, authentication is IAM (no separate API key to rotate and leak), data is governed by your existing AWS data-processing terms, and inputs and outputs are not used to train the foundation models. For any team already on AWS — which is most teams that have raised institutional funding — this collapses a procurement and security-review problem that would otherwise take weeks. The model becomes "just another AWS service" that your VPC, CloudTrail, billing, and SSO already understand.

The second reason is consolidation. Bedrock exposes Claude alongside Amazon Nova, Meta Llama, Mistral, Cohere, and others behind one API surface. A team that standardizes on the Converse API can swap or blend models — Claude Sonnet for the hard path, a cheaper model for the trivial path — without rewriting the integration. That optionality is worth real money once you are spending five or six figures a month on inference.

What surprises teams is how little the prototype tells you about production. A demo that calls Claude once per user action, synchronously, with a short prompt behaves nothing like a system handling thousands of concurrent requests with 30-page RAG contexts, tool-use loops, streaming UIs, and a p99 SLA. The gap between those two is the subject of this guide: model selection, the API contract, the cost levers, the safety layer, the capacity model, and the eval-and-observability discipline that keeps a Claude deployment honest over time.

One framing to carry through: on Bedrock, almost every production decision reduces to a tradeoff between four quantities — quality, latency, throughput, and cost. No setting maximizes all four. The job is to know which one each workload actually cares about and to spend the others to buy it.

the biggest lever

IIChoosing across the Claude family — Opus, Sonnet, Haiku — by workload

The Claude family on Bedrock spans roughly three capability-and-price tiers. Picking the wrong default is the most expensive mistake teams make, and it is invisible in a prototype because at one request per click the bill is rounding error.

The mental model: Haiku is the fast, cheap workhorse for high-volume, well-bounded tasks. Sonnet is the balanced default for everyday reasoning, agentic loops, coding, and RAG. Opus is the frontier tier you reserve for the genuinely hard problems — deep multi-step reasoning, complex agentic planning, the tasks where a wrong answer is expensive and a smarter model demonstrably wins. The list prices step roughly an order of magnitude per tier, so the routing decision is also the cost decision.

The most important production insight is that most traffic is easy. In a typical mixed application — support, classification, extraction, summarization, routing — the large majority of requests are well within Haiku's or Sonnet's competence, and only a thin slice genuinely benefits from Opus. Teams that default everything to the smartest model are paying frontier prices for trivial work. Teams that default everything to the cheapest model take quality hits on the hard slice and erode trust. The right answer is almost always a blended fleet with explicit routing.

A practical routing pattern that works: default to Sonnet; downshift to Haiku for requests you can cheaply classify as simple (short inputs, high-confidence intents, bulk classification); escalate to Opus only when a task is flagged hard — long-horizon reasoning, low model confidence, or a human-in-the-loop tier. You can implement the router as a tiny Haiku classification call in front of the main call, or with deterministic rules on input characteristics. Either way the savings are large: shifting even half your volume off the top tier typically cuts the model line of the bill several-fold.

Haiku — the high-volume workhorse

Use it for: classification, routing, intent detection, PII tagging, short extraction, the cheap leg of a two-pass pipeline, and any task where you make many calls and each one is well-bounded.

Why: it is the cheapest and fastest tier, which matters enormously when call volume is high. At Haiku prices you can afford to call the model in places where Opus would be economically absurd — per-row, per-event, per-document-chunk.

Watch for: tasks that quietly need more reasoning than they appear to. The fix is not "always use a bigger model" — it is to measure quality on a representative eval set and escalate only the slice that fails.

Sonnet — the balanced default

Use it for: the everyday reasoning core of most applications — agentic tool-use loops, coding assistance, RAG answering over retrieved context, structured generation, multi-turn conversation with real logic.

Why: it is the price/performance sweet spot. For the large majority of production tasks Sonnet is indistinguishable from the top tier in output quality while costing a fraction. Making Sonnet your default — rather than Opus — is usually the single highest-leverage cost decision in the whole deployment.

Watch for: the temptation to never escalate. Keep an Opus path available and route the genuinely hard cases to it; the point of a default is that it has exits.

Opus — the frontier reserve

Use it for: deep multi-step reasoning, complex agentic planning over long horizons, hard coding and refactoring tasks, research-grade synthesis, and any decision where a wrong answer is costly enough to justify frontier pricing.

Why: on the hard slice the capability gap is real and shows up in measurable task success. The mistake is not using Opus — it is using it on traffic that does not need it.

Watch for: blanket defaults. Opus on 100% of traffic is the canonical way to turn a $2K/month workload into a $40K/month workload while improving end-to-end quality by almost nothing, because most of that traffic was already being answered correctly.

the routing heuristic

Start every workload on Sonnet. Profile a representative sample. Push everything that Haiku answers correctly down to Haiku. Push only the measured-hard slice up to Opus. Re-profile when the model lineup or the traffic mix changes. This single discipline routinely separates a 5× cost difference from a 1.1× quality difference.

the API contract

IIIThe Converse API, tool use, and streaming — the integration that scales

Bedrock exposes two ways to call a model: the older InvokeModel (raw, model-specific JSON bodies) and the newer Converse API (one unified, provider-agnostic shape). For anything you intend to run in production, build on Converse.

Converse gives you a single request/response contract that is identical across Claude, Nova, Llama, and the rest: a `messages` array of user/assistant turns, an optional `system` prompt, an `inferenceConfig` block (max tokens, temperature, topP, stop sequences), and structured `toolConfig` for function calling. The value is that your application code stops caring which specific model handles a request — you can A/B Sonnet against a cheaper model, or route per-request, without branching your serialization logic. With InvokeModel you would hand-assemble a different JSON body per model family and rewrite it every time you change models.

Tool use (function calling) is first-class in Converse. You declare tools in `toolConfig` with a name, description, and JSON Schema for the input. When Claude decides to call one, the response returns with `stopReason: "tool_use"` and a `toolUse` block holding the tool name and arguments. Your code runs the tool, sends the result back as a `toolResult` content block in a new user turn, and the loop continues until Claude returns a normal text answer. This is the backbone of every agent: the model proposes actions, your code runs them, the results feed back. Keep tool descriptions precise — vague ones are the most common cause of the model calling the wrong tool or hallucinating arguments.

Streaming uses the parallel ConverseStream operation, which returns an event stream of incremental content-block deltas you render as they arrive. For any human-facing chat or assistant UI this is non-negotiable: it takes perceived latency from "several seconds of nothing" to "first words in a few hundred milliseconds," even though total generation time is unchanged. Tool use and streaming compose — streamed responses still surface tool-use blocks — so an agentic, streaming UI runs on one code path.

Two contract details that bite teams in production. First, branch explicitly on `stopReason`: `end_turn` is a complete answer, `max_tokens` means you truncated the model mid-thought (raise the limit or shorten the task), `tool_use` hands control to your tool loop, `guardrail_intervened` means a Guardrail fired. Second, the response carries a `usage` block with `inputTokens`, `outputTokens`, and — when caching is on — cache read/write counts. That block is the ground truth for cost attribution; log it on every call from day one.

A minimal Converse call (boto3)

The shape below is the whole contract for a non-streaming call. Note the model is addressed by an inference-profile ID, the unified `messages`/`system`/`inferenceConfig` blocks, and the `usage` field you read back for cost.

resp = client.converse( modelId="us.anthropic.claude-sonnet-...-v1:0", # cross-region inference profile system=[{"text": SYSTEM_PROMPT}], messages=[{"role": "user", "content": [{"text": user_input}]}], inferenceConfig={"maxTokens": 1024, "temperature": 0.2}, ) text = resp["output"]["message"]["content"][0]["text"] usage = resp["usage"] # inputTokens / outputTokens / cacheRead*

Swap `converse` for `converse_stream` and iterate the event stream to render tokens incrementally. The request body is otherwise identical — which is exactly the portability Converse is designed to give you.

cost at scale

IVPrompt caching and batch — the two levers that cut the bill 50–90%

After model selection, the two largest cost levers on Bedrock are prompt caching (for repeated input tokens) and batch inference (for non-interactive workloads). They are independent and stackable, and most teams leave both on the table for months.

Prompt caching attacks the most wasteful pattern in LLM serving: re-sending the same large block of input tokens on every request. A long system prompt, a fixed instruction set, a few-shot exemplar block, a tool catalog, or a chunk of RAG context that is constant across a conversation — without caching you pay full input price to process those identical tokens on every call. With caching, you mark a stable prefix with a checkpoint; the first request writes it to the cache, and subsequent requests sharing that prefix read it back at a steep discount on the cached tokens. For workloads with a large fixed context and many calls against it, the input-token bill on the cached portion falls dramatically — frequently 80–90% on that segment. The wins land exactly where it hurts most: big system prompts, multi-turn chat with stable context, and RAG where retrieved passages are reused across follow-up questions.

The mechanics that matter: the cached prefix must be stable and sit at the front of the prompt — put the constant material (system prompt, instructions, tool definitions, fixed context) first and the variable material (the user's question) last, so the prefix is as long as possible. Cache entries are short-lived (a rolling few-minute TTL that refreshes on each hit), so caching pays off for bursty, conversational, or high-frequency access and does little for one-shot calls spread far apart. There is a small write premium the first time a prefix is cached, so it helps when the prefix is reused and is a slight net cost if it never is. Read the cache-read/write counts from the `usage` block to confirm hits — a "cache" that never reports reads is a misconfigured prefix.

Batch inference is the other half. For anything that does not need an immediate answer — overnight document processing, bulk classification, dataset labeling, embedding backfills, periodic summarization, offline evals — submit the work as a batch job (input and output in S3) and Bedrock processes it asynchronously at roughly half the on-demand token price. The discount is a flat ~50% with essentially no engineering cost beyond formatting the job, which makes batch the highest-ROI change available for any team with meaningful non-interactive volume. The only constraint is latency: batch is throughput-oriented, so jobs complete in minutes to hours, not milliseconds. The mistake to avoid is running interactive and batchable traffic through the same synchronous path and paying full freight on work that had no deadline.

Stack the levers. A document-processing pipeline that (a) uses Haiku for the cheap extraction pass, (b) caches the shared instruction-and-schema prefix, and (c) runs as a batch job pays a small fraction of what the naive "Opus, synchronous, full prompt every call" version costs — often an order of magnitude less — for output that is, on that workload, indistinguishable.

order of operations

Apply the cost levers in this order, because each one multiplies the next: (1) right-size the model (Sonnet/Haiku default, Opus reserved) → (2) cache the stable prefix (system prompt, tools, fixed context up front) → (3) batch everything non-interactive (~50% off). Provisioned Throughput is not on this list — it is a capacity/latency tool, not a discount.

the safety layer

VGuardrails — the policy layer between your users and the model

Amazon Bedrock Guardrails is a configurable safety layer you attach to a Claude deployment to enforce content policy, redact sensitive data, and constrain topics — independently of, and on top of, Claude's own built-in safety behavior.

Claude is already a well-aligned model, so a fair question is why you need Guardrails at all. The answer is that Claude's safety is general and your policy is specific. Guardrails lets you encode your rules — the topics this particular product must refuse, the PII categories you are legally required to redact, the profanity and content thresholds appropriate for your audience, and the boundaries against prompt-injection and off-topic use. It is the difference between "the model tries to be safe" and "this application provably enforces these named policies, the same way every time, with an audit trail."

Functionally, a Guardrail can be applied to the input (what the user sends), the output (what the model returns), or both, and it bundles several policy types: content filters across categories like hate, violence, and sexual content with tunable strength; denied-topics definitions that block whole subject areas you describe in natural language; sensitive-information policies that detect and either block or mask PII (emails, phone numbers, card numbers, and custom regex patterns); word and profanity filters; and contextual-grounding checks that flag responses unsupported by the provided source material — a direct lever against RAG hallucination. When a Guardrail intervenes, the Converse response comes back with `stopReason: guardrail_intervened` and you serve your configured fallback message instead of the model output.

In production, attach Guardrails at the API call (Converse accepts a guardrail identifier and version), version your guardrail configurations so changes are deliberate and reviewable, and decide consciously whether each policy blocks or masks — masking PII preserves a usable response while blocking is appropriate for hard-prohibited topics. The grounding check deserves particular attention for any RAG or knowledge-base application: it is the cleanest built-in defense against the model confidently asserting things the retrieved context does not support. Treat the guardrail config as policy-as-code: it belongs in version control, in review, and in your eval suite, not hand-edited in the console.

capacity & latency

VILatency, throughput, provisioned throughput, and cross-region inference

On-demand Bedrock is the right default for the vast majority of deployments. The capacity tools — cross-region inference and Provisioned Throughput — exist for specific failure modes: throttling under load and the need for predictable latency or reserved capacity.

Start with the latency anatomy. Two numbers govern user experience: time-to-first-token (how long before the first word appears) and tokens-per-second (how fast the rest streams). Streaming via ConverseStream optimizes the first; model choice dominates both, since smaller models in the family generate faster. The practical levers to feel snappier are: stream always for interactive UIs, pick the smallest model that meets quality, cap `maxTokens` so the model cannot ramble, and cache the prefix so the prompt-processing portion of latency shrinks on repeat calls. Most "Claude feels slow" complaints are an un-streamed UI calling the largest model with an uncapped token budget — all three fixable.

Cross-region inference profiles are the first capacity tool and they are nearly free upside. Instead of pinning requests to a single region, you address the model by a regional inference-profile ID (e.g. a `us.` or `eu.` profile) and Bedrock automatically distributes calls across multiple regions in that geography. This raises your effective throughput ceiling and smooths out throttling during bursts, while keeping data within the geography for residency. For most production Claude workloads, calling through a cross-region profile rather than a single-region model ID is simply the better default — more headroom, the same code. (Mind that data is processed in any region within the profile's geography, which is a compliance consideration but rarely a blocker for US/EU profiles.)

Provisioned Throughput is the heavier instrument. On-demand inference shares a regional capacity pool and is subject to per-account throttling (rate limits expressed in requests and tokens per minute); under sustained high load you will see throttling that on-demand alone cannot solve. Provisioned Throughput reserves dedicated model capacity — measured in model units, billed hourly, optionally on a 1- or 6-month commitment — giving you guaranteed throughput and more consistent latency independent of the shared pool. The tradeoff is that you pay for the reserved capacity whether or not you use it, so it only makes economic sense at high, predictable, sustained volume, or when a latency/throughput SLA genuinely requires reserved capacity. The decision rule: stay on-demand (with retries and cross-region) until you are reliably hitting throttling limits or you have a hard p99 SLA, then model whether sustained utilization justifies the hourly reservation.

Regardless of capacity model, build for throttling. On-demand will occasionally return throttling exceptions, so wrap calls in exponential backoff with jitter, set sane client timeouts, and degrade gracefully (queue, retry, or fall back to a smaller model) rather than failing the user request. This retry discipline is what separates a deployment that survives a traffic spike from one that cascades.

on-demand vs cross-region vs provisioned throughput — when to use which

Capacity model	What it gives you	Billing	Reach for it when
On-demand (single region)	Pay-per-token, zero commitment, simplest	Per input/output token	Default for almost everything; dev, low/medium volume
Cross-region inference profile	Higher throughput ceiling, throttle smoothing, same code	Per token (same as on-demand)	Production default — more headroom for free
Provisioned Throughput	Reserved capacity, predictable latency/throughput	Hourly per model unit (± 1/6-mo commit)	Sustained high volume or hard latency SLA
Batch	~50% cheaper, async, throughput-oriented	Per token at ~50% discount	Any non-interactive / no-deadline workload

These are not mutually exclusive: a real deployment often runs interactive traffic on a cross-region profile, offloads bulk work to batch, and reserves Provisioned Throughput only for a latency-critical tier once volume is proven.

keeping it honest

VIIEvals and observability — how you know it still works

A Claude deployment without evals is a deployment you cannot safely change. Without observability it is one you cannot debug or cost-attribute. Both are production requirements, not nice-to-haves — and both are where prototype-grade projects quietly rot.

Start with a golden eval set: a curated collection of representative inputs paired with known-good outputs or graded criteria, covering the easy majority, the hard slice, and the adversarial edges (prompt injection, off-topic, malformed input). This set is what lets you answer the questions that otherwise become arguments — "is Sonnet good enough here, or do we need Opus?", "did changing the system prompt help or hurt?", "is the cheaper model safe to route this traffic to?" You grade outputs with a mix of deterministic checks (exact match, schema validity, regex, did-it-call-the-right-tool) and model-graded scoring (an LLM-as-judge rubric for open-ended quality), and you run the set on every prompt change, model change, and routing change. Bedrock's built-in model-evaluation tooling can host automatic and human-in-the-loop evaluations, but the discipline matters more than the tool: the point is that a quality regression shows up in a number before it shows up in a user complaint.

Evals are also how you make the model-selection decision from Section II rigorously instead of by vibe. Run the same golden set against Haiku, Sonnet, and Opus, look at where the cheaper tiers actually fail, and let the failure pattern define your routing rules. This converts "which model should we use?" from an opinion into a measurement, and it is the only honest way to claim a cheaper model is "good enough."

Observability is the runtime counterpart. At minimum, log on every call: the model ID, the token counts from the `usage` block (input, output, cache read/write), latency (time-to-first-token and total), `stopReason`, whether a Guardrail fired, and a request ID you can trace. CloudWatch captures Bedrock invocation metrics and you can enable model-invocation logging to land full request/response records in S3 or CloudWatch Logs for audit and debugging. Those token counts are also your cost ledger — per-feature and per-customer cost attribution comes directly from summing `usage` by tag, and without it you will not be able to answer "which feature is driving the bill" when finance asks. Watch error and throttling rates, p50/p95/p99 latency, cache-hit ratio, and Guardrail intervention rate as standing dashboards; each one maps directly to a lever in this guide.

The standing dashboard

Cost: tokens by model, by feature, by customer; cache-hit ratio; batch vs on-demand mix. This is where overspend hides.

Latency: time-to-first-token and total, p50/p95/p99, per model. Regressions here are usually a model or token-budget change.

Reliability: error and throttling rates, retry counts, fallback activations. Spikes signal a capacity problem before users feel it.

Safety/quality: Guardrail intervention rate, grounding-check failures, and the latest golden-eval score. A drift in any of these is your early warning.

the math at scale

VIIIThe cost math — how the levers compound at production volume

The point of the previous sections is that the levers multiply, not add. Walking one workload from the naive build to the optimized build makes the magnitude concrete — and shows why two teams running "the same" Claude feature can have bills an order of magnitude apart.

Take a representative pipeline: a million document-processing calls a month, each with a large fixed instruction-and-schema prefix and a variable document body, none of it latency-critical. The naive build runs every call on the top model, synchronously, sending the full prefix every time. The optimized build does three things — routes the workload to the right (smaller) model after an eval confirms quality holds, caches the stable prefix so the repeated input tokens bill at a fraction, and submits the whole thing as a batch job at the ~50% discount.

Each lever lands a multiplier on the relevant token segment: model right-sizing can swing the per-token rate by 5–60× depending on how far down the family you move; prompt caching can take 80–90% off the cached-prefix portion of input tokens, which in a large-fixed-prefix workload is most of the input; batch takes a flat ~50% off everything. Because they apply to different parts of the bill and stack, the combined effect on a prefix-heavy, non-interactive workload is routinely an order of magnitude or more — without changing the output quality on that workload, because the eval is what licensed the model choice in the first place.

The discipline this implies is simple to state and easy to skip: measure tokens per request and split your traffic into "interactive and quality-critical" versus "batchable and bounded." Spend frontier money only on the first bucket, and only on the slice of it that evals prove needs frontier capability. Push the second bucket through caching and batch on the smallest model that passes. The single most common production cost pathology is the opposite — one synchronous code path, the largest model, the full prompt every time, applied uniformly to traffic that was 90% trivial. That pathology is what turns a workload that should cost low four figures a month into one that costs five.

A final note on credits. Early-stage AWS spend on Bedrock is frequently fundable — AWS Activate credits, generative-AI program credits, and Bedrock proof-of-concept funding can cover a substantial portion of inference cost during the build-and-validate phase, which is exactly when you are running the most experiments and have the least revenue. The optimization discipline above and the funding path are complementary: credits buy you runway to get the architecture right, and the architecture keeps the bill sane once the credits are spent.

pick the tier

Claude family on Bedrock — Haiku vs Sonnet vs Opus for production

The list prices step roughly an order of magnitude per tier, so this table is as much a cost decision as a capability one. Read it as "what is the cheapest tier that clears the quality bar for this workload?" — and let your eval set, not intuition, draw the line.

Dimension	Claude Haiku	Claude Sonnet	Claude Opus
Position	Fast, cheap workhorse	Balanced default	Frontier reserve
Relative token cost	Lowest (≈1×)	Mid (single-digit× Haiku)	Highest (≈order-of-mag over Sonnet)
Relative speed	Fastest	Fast	Slowest of the three
Best-fit work	Classification, routing, extraction, bulk	Agents, coding, RAG, everyday reasoning	Hard reasoning, complex planning, research
Share of typical traffic	High-volume simple slice	The default majority	The thin hard slice
Default verdict	Downshift target	Start here	Escalation target only

Exact per-token prices and the precise model versions shift over time — confirm current Bedrock pricing and the available Claude model IDs in your region before committing a routing design. The structural advice (Sonnet default, Haiku down, Opus up-on-evidence) is stable across versions.

building Claude on Bedrock for real traffic?

Get matched with an AWS partner who builds production Bedrock systems — often AWS-funded

Start in 3 minutes →

a recent match

A Claude-on-Bedrock production build — anonymized

inquiry · series-a vertical-SaaS, support-automation product

Series-A vertical SaaS, 22 engineers, building a customer-support agent on Claude, already on AWS

Situation: Had a working prototype: every support message hit the top Claude model synchronously with a 6,000-token system prompt and full knowledge-base context re-sent on every turn. It worked in the demo and projected to roughly $30K/month at their forecast volume — untenable on a Series-A budget. They also had no Guardrails (a compliance risk for a regulated vertical), no evals, and no cost visibility per customer.

What CloudRoute did: Routed within 24 hours to a vetted AWS partner with a Bedrock + GenAI track record. The partner moved the integration to the Converse API, set Sonnet as the default with a Haiku classifier in front for the trivial intents and an Opus path for escalations, added prompt caching on the stable system-prompt-plus-tools prefix, moved overnight ticket-summarization to a batch job, attached Bedrock Guardrails (PII masking + denied topics + contextual grounding for the RAG answers), and stood up a golden eval set plus CloudWatch dashboards for tokens, latency, and Guardrail rate. The partner also filed for AWS Bedrock POC + Activate credits to fund the build.

Outcome: Projected monthly inference cost fell from ~$30K to under $4K at the same volume — a blended-model fleet, caching on the heavy prefix, and batch on the async work compounding to roughly an 8× reduction with no measured quality drop on the eval set. Guardrails closed the compliance gap; the eval set made every subsequent model/prompt change a measured decision. AWS credits covered the build-and-validate phase, so CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.

engagement window: 7 weeks · est. monthly cost: $30K → <$4K · quality delta on evals: ~neutral · cost to customer: $0

faq

Common questions

Should I use the Converse API or InvokeModel for Claude on Bedrock?

Use Converse for anything production-bound. It gives you one provider-agnostic request shape across the whole model catalog, first-class tool use (function calling), streaming via ConverseStream, system prompts, and clean attachment points for prompt caching and Guardrails. InvokeModel still works and exposes model-specific raw bodies, but it forces you to hand-assemble a different JSON payload per model family and rewrite it whenever you change models — the opposite of the portability you want when you are blending Haiku, Sonnet, and Opus.

How do I choose between Claude Opus, Sonnet, and Haiku in production?

Default to Sonnet, then move down or up on evidence. Build a golden eval set, run it against all three, and let the results decide: push the traffic Haiku answers correctly down to Haiku (it is far cheaper and faster), and escalate only the measured-hard slice to Opus. Putting Opus on everything is the most common and most expensive mistake — you pay frontier prices on traffic that was already being answered correctly. A blended fleet that defaults to Sonnet and routes the extremes typically lands several-fold cheaper than an Opus-everywhere build with negligible quality difference on the easy majority.

How much can prompt caching actually save on Bedrock?

On the cached portion of your input — the stable prefix you reuse, such as a long system prompt, tool definitions, few-shot exemplars, or fixed RAG context — caching can cut the per-token cost on those tokens by roughly 80–90% on repeat reads. The catch is that it only pays off when the prefix is genuinely reused within the cache's short, rolling TTL, so it shines for multi-turn chat, high-frequency calls, and large fixed contexts and does little for one-off calls spread far apart. Put the constant material at the front, the variable material at the end, and read the cache-read/write counts from the usage block to confirm you are getting hits.

When should I use batch inference instead of real-time calls?

Use batch for anything that does not need an immediate answer: overnight document processing, bulk classification, dataset labeling, embedding backfills, periodic summarization, and offline evals. You submit input via S3 and collect output from S3, and Bedrock processes it asynchronously at roughly a 50% discount to on-demand. The only cost is latency — batch jobs complete in minutes to hours, not milliseconds. For most teams it is the single highest-ROI optimization available: almost no engineering effort beyond formatting the job, yet it halves the price on every non-interactive call.

Do I need Bedrock Guardrails if Claude is already a safe model?

Usually yes, because Claude's safety is general and your policy is specific. Guardrails lets you enforce your application's named rules — the topics this product must refuse, the PII categories you must redact or mask, content thresholds for your audience, and contextual-grounding checks that flag RAG answers unsupported by the source material. It gives you a consistent, versioned, auditable policy layer applied to inputs and/or outputs — the difference between "the model tries to be safe" and "this application provably enforces these policies the same way every time." For regulated verticals and RAG products, the grounding and PII controls are typically requirements, not extras.

When is Provisioned Throughput worth it versus on-demand?

Stay on-demand for the vast majority of deployments — it is pay-per-token with zero commitment, and adding a cross-region inference profile gives you a higher throughput ceiling and throttle smoothing for free. Reach for Provisioned Throughput only when you are reliably hitting on-demand throttling limits at sustained high volume, or when a hard p99 latency SLA requires reserved, dedicated capacity. It is billed hourly per model unit (optionally on a 1- or 6-month commitment) whether or not you use it, so it only makes sense when utilization is high and predictable. It is a capacity and latency tool, not a discount lever.

How do I control Claude latency for a chat UI on Bedrock?

Four levers, in order of impact: stream the response with ConverseStream so the first tokens appear in a few hundred milliseconds instead of after the full generation; pick the smallest model in the family that clears your quality bar, since smaller models generate faster; cap maxTokens so the model cannot ramble past what you need; and cache the stable prefix so prompt-processing time shrinks on repeat calls. Most "Claude feels slow" complaints come from an un-streamed UI calling the largest model with an uncapped token budget — all three of which are quick fixes.

Is Claude usage on Bedrock eligible for AWS credits?

Frequently, yes. Early-stage Bedrock inference spend can often be covered by AWS Activate credits, generative-AI program credits, and Bedrock proof-of-concept funding — which is most valuable during the build-and-validate phase, when you are running the most experiments and have the least revenue. The credit path and the optimization discipline in this guide are complementary: credits buy runway to get the architecture right, and the architecture keeps the bill sane after the credits are spent. These larger credit pools are typically partner-filed rather than self-serve, which is the gap CloudRoute closes.

Ship Claude on Bedrock in production — without the trial-and-error bill

CloudRoute routes you to a vetted AWS partner who does the Converse-API build, model routing, caching, Guardrails, and evals — and files for the AWS credits that often fund it. Customer pays $0. No procurement theater.

Get an AWS-funded Bedrock build →→ see the data & AI persona detail

matched within< 24h

typical cost cut5–15×

cost to you$0