Most teams ship a Claude prototype on Bedrock in an afternoon and then spend three months discovering what production actually requires: the right model per workload, the Converse API done correctly, prompt caching and batch that cut the bill 60–90%, Guardrails that hold, provisioned throughput when on-demand throttles, and evals you trust. This is the practitioner-grade reference for all of it.
Claude is available through Anthropic's own API, through Amazon Bedrock, and through Google Vertex AI. The model weights are the same. What differs is the operational envelope around them — and in production, the envelope is most of the work.
Running Claude through Bedrock means the model lives inside your AWS account boundary. Requests stay on AWS networking, authentication is IAM (no separate API key to rotate and leak), data is governed by your existing AWS data-processing terms, and inputs and outputs are not used to train the foundation models. For any team already on AWS — which is most teams that have raised institutional funding — this collapses a procurement and security-review problem that would otherwise take weeks. The model becomes "just another AWS service" that your VPC, CloudTrail, billing, and SSO already understand.
The second reason is consolidation. Bedrock exposes Claude alongside Amazon Nova, Meta Llama, Mistral, Cohere, and others behind one API surface. A team that standardizes on the Converse API can swap or blend models — Claude Sonnet for the hard path, a cheaper model for the trivial path — without rewriting the integration. That optionality is worth real money once you are spending five or six figures a month on inference.
What surprises teams is how little the prototype tells you about production. A demo that calls Claude once per user action, synchronously, with a short prompt behaves nothing like a system handling thousands of concurrent requests with 30-page RAG contexts, tool-use loops, streaming UIs, and a p99 SLA. The gap between those two is the subject of this guide: model selection, the API contract, the cost levers, the safety layer, the capacity model, and the eval-and-observability discipline that keeps a Claude deployment honest over time.
One framing to carry through: on Bedrock, almost every production decision reduces to a tradeoff between four quantities — quality, latency, throughput, and cost. No setting maximizes all four. The job is to know which one each workload actually cares about and to spend the others to buy it.
The Claude family on Bedrock spans roughly three capability-and-price tiers. Picking the wrong default is the most expensive mistake teams make, and it is invisible in a prototype because at one request per click the bill is rounding error.
The mental model: Haiku is the fast, cheap workhorse for high-volume, well-bounded tasks. Sonnet is the balanced default for everyday reasoning, agentic loops, coding, and RAG. Opus is the frontier tier you reserve for the genuinely hard problems — deep multi-step reasoning, complex agentic planning, the tasks where a wrong answer is expensive and a smarter model demonstrably wins. The list prices step roughly an order of magnitude per tier, so the routing decision is also the cost decision.
The most important production insight is that most traffic is easy. In a typical mixed application — support, classification, extraction, summarization, routing — the large majority of requests are well within Haiku's or Sonnet's competence, and only a thin slice genuinely benefits from Opus. Teams that default everything to the smartest model are paying frontier prices for trivial work. Teams that default everything to the cheapest model take quality hits on the hard slice and erode trust. The right answer is almost always a blended fleet with explicit routing.
A practical routing pattern that works: default to Sonnet; downshift to Haiku for requests you can cheaply classify as simple (short inputs, high-confidence intents, bulk classification); escalate to Opus only when a task is flagged hard — long-horizon reasoning, low model confidence, or a human-in-the-loop tier. You can implement the router as a tiny Haiku classification call in front of the main call, or with deterministic rules on input characteristics. Either way the savings are large: shifting even half your volume off the top tier typically cuts the model line of the bill several-fold.
Use it for: classification, routing, intent detection, PII tagging, short extraction, the cheap leg of a two-pass pipeline, and any task where you make many calls and each one is well-bounded.
Why: it is the cheapest and fastest tier, which matters enormously when call volume is high. At Haiku prices you can afford to call the model in places where Opus would be economically absurd — per-row, per-event, per-document-chunk.
Watch for: tasks that quietly need more reasoning than they appear to. The fix is not "always use a bigger model" — it is to measure quality on a representative eval set and escalate only the slice that fails.
Use it for: the everyday reasoning core of most applications — agentic tool-use loops, coding assistance, RAG answering over retrieved context, structured generation, multi-turn conversation with real logic.
Why: it is the price/performance sweet spot. For the large majority of production tasks Sonnet is indistinguishable from the top tier in output quality while costing a fraction. Making Sonnet your default — rather than Opus — is usually the single highest-leverage cost decision in the whole deployment.
Watch for: the temptation to never escalate. Keep an Opus path available and route the genuinely hard cases to it; the point of a default is that it has exits.
Use it for: deep multi-step reasoning, complex agentic planning over long horizons, hard coding and refactoring tasks, research-grade synthesis, and any decision where a wrong answer is costly enough to justify frontier pricing.
Why: on the hard slice the capability gap is real and shows up in measurable task success. The mistake is not using Opus — it is using it on traffic that does not need it.
Watch for: blanket defaults. Opus on 100% of traffic is the canonical way to turn a $2K/month workload into a $40K/month workload while improving end-to-end quality by almost nothing, because most of that traffic was already being answered correctly.
Start every workload on Sonnet. Profile a representative sample. Push everything that Haiku answers correctly down to Haiku. Push only the measured-hard slice up to Opus. Re-profile when the model lineup or the traffic mix changes. This single discipline routinely separates a 5× cost difference from a 1.1× quality difference.
Bedrock exposes two ways to call a model: the older InvokeModel (raw, model-specific JSON bodies) and the newer Converse API (one unified, provider-agnostic shape). For anything you intend to run in production, build on Converse.
Converse gives you a single request/response contract that is identical across Claude, Nova, Llama, and the rest: a `messages` array of user/assistant turns, an optional `system` prompt, an `inferenceConfig` block (max tokens, temperature, topP, stop sequences), and structured `toolConfig` for function calling. The value is that your application code stops caring which specific model handles a request — you can A/B Sonnet against a cheaper model, or route per-request, without branching your serialization logic. With InvokeModel you would hand-assemble a different JSON body per model family and rewrite it every time you change models.
Tool use (function calling) is first-class in Converse. You declare tools in `toolConfig` with a name, description, and JSON Schema for the input. When Claude decides to call one, the response returns with `stopReason: "tool_use"` and a `toolUse` block holding the tool name and arguments. Your code runs the tool, sends the result back as a `toolResult` content block in a new user turn, and the loop continues until Claude returns a normal text answer. This is the backbone of every agent: the model proposes actions, your code runs them, the results feed back. Keep tool descriptions precise — vague ones are the most common cause of the model calling the wrong tool or hallucinating arguments.
Streaming uses the parallel ConverseStream operation, which returns an event stream of incremental content-block deltas you render as they arrive. For any human-facing chat or assistant UI this is non-negotiable: it takes perceived latency from "several seconds of nothing" to "first words in a few hundred milliseconds," even though total generation time is unchanged. Tool use and streaming compose — streamed responses still surface tool-use blocks — so an agentic, streaming UI runs on one code path.
Two contract details that bite teams in production. First, branch explicitly on `stopReason`: `end_turn` is a complete answer, `max_tokens` means you truncated the model mid-thought (raise the limit or shorten the task), `tool_use` hands control to your tool loop, `guardrail_intervened` means a Guardrail fired. Second, the response carries a `usage` block with `inputTokens`, `outputTokens`, and — when caching is on — cache read/write counts. That block is the ground truth for cost attribution; log it on every call from day one.
The shape below is the whole contract for a non-streaming call. Note the model is addressed by an inference-profile ID, the unified `messages`/`system`/`inferenceConfig` blocks, and the `usage` field you read back for cost.
resp = client.converse(
modelId="us.anthropic.claude-sonnet-...-v1:0", # cross-region inference profile
system=[{"text": SYSTEM_PROMPT}],
messages=[{"role": "user", "content": [{"text": user_input}]}],
inferenceConfig={"maxTokens": 1024, "temperature": 0.2},
)
text = resp["output"]["message"]["content"][0]["text"]
usage = resp["usage"] # inputTokens / outputTokens / cacheRead*
Swap `converse` for `converse_stream` and iterate the event stream to render tokens incrementally. The request body is otherwise identical — which is exactly the portability Converse is designed to give you.
After model selection, the two largest cost levers on Bedrock are prompt caching (for repeated input tokens) and batch inference (for non-interactive workloads). They are independent and stackable, and most teams leave both on the table for months.
Prompt caching attacks the most wasteful pattern in LLM serving: re-sending the same large block of input tokens on every request. A long system prompt, a fixed instruction set, a few-shot exemplar block, a tool catalog, or a chunk of RAG context that is constant across a conversation — without caching you pay full input price to process those identical tokens on every call. With caching, you mark a stable prefix with a checkpoint; the first request writes it to the cache, and subsequent requests sharing that prefix read it back at a steep discount on the cached tokens. For workloads with a large fixed context and many calls against it, the input-token bill on the cached portion falls dramatically — frequently 80–90% on that segment. The wins land exactly where it hurts most: big system prompts, multi-turn chat with stable context, and RAG where retrieved passages are reused across follow-up questions.
The mechanics that matter: the cached prefix must be stable and sit at the front of the prompt — put the constant material (system prompt, instructions, tool definitions, fixed context) first and the variable material (the user's question) last, so the prefix is as long as possible. Cache entries are short-lived (a rolling few-minute TTL that refreshes on each hit), so caching pays off for bursty, conversational, or high-frequency access and does little for one-shot calls spread far apart. There is a small write premium the first time a prefix is cached, so it helps when the prefix is reused and is a slight net cost if it never is. Read the cache-read/write counts from the `usage` block to confirm hits — a "cache" that never reports reads is a misconfigured prefix.
Batch inference is the other half. For anything that does not need an immediate answer — overnight document processing, bulk classification, dataset labeling, embedding backfills, periodic summarization, offline evals — submit the work as a batch job (input and output in S3) and Bedrock processes it asynchronously at roughly half the on-demand token price. The discount is a flat ~50% with essentially no engineering cost beyond formatting the job, which makes batch the highest-ROI change available for any team with meaningful non-interactive volume. The only constraint is latency: batch is throughput-oriented, so jobs complete in minutes to hours, not milliseconds. The mistake to avoid is running interactive and batchable traffic through the same synchronous path and paying full freight on work that had no deadline.
Stack the levers. A document-processing pipeline that (a) uses Haiku for the cheap extraction pass, (b) caches the shared instruction-and-schema prefix, and (c) runs as a batch job pays a small fraction of what the naive "Opus, synchronous, full prompt every call" version costs — often an order of magnitude less — for output that is, on that workload, indistinguishable.
Apply the cost levers in this order, because each one multiplies the next: (1) right-size the model (Sonnet/Haiku default, Opus reserved) → (2) cache the stable prefix (system prompt, tools, fixed context up front) → (3) batch everything non-interactive (~50% off). Provisioned Throughput is not on this list — it is a capacity/latency tool, not a discount.
Amazon Bedrock Guardrails is a configurable safety layer you attach to a Claude deployment to enforce content policy, redact sensitive data, and constrain topics — independently of, and on top of, Claude's own built-in safety behavior.
Claude is already a well-aligned model, so a fair question is why you need Guardrails at all. The answer is that Claude's safety is general and your policy is specific. Guardrails lets you encode your rules — the topics this particular product must refuse, the PII categories you are legally required to redact, the profanity and content thresholds appropriate for your audience, and the boundaries against prompt-injection and off-topic use. It is the difference between "the model tries to be safe" and "this application provably enforces these named policies, the same way every time, with an audit trail."
Functionally, a Guardrail can be applied to the input (what the user sends), the output (what the model returns), or both, and it bundles several policy types: content filters across categories like hate, violence, and sexual content with tunable strength; denied-topics definitions that block whole subject areas you describe in natural language; sensitive-information policies that detect and either block or mask PII (emails, phone numbers, card numbers, and custom regex patterns); word and profanity filters; and contextual-grounding checks that flag responses unsupported by the provided source material — a direct lever against RAG hallucination. When a Guardrail intervenes, the Converse response comes back with `stopReason: guardrail_intervened` and you serve your configured fallback message instead of the model output.
In production, attach Guardrails at the API call (Converse accepts a guardrail identifier and version), version your guardrail configurations so changes are deliberate and reviewable, and decide consciously whether each policy blocks or masks — masking PII preserves a usable response while blocking is appropriate for hard-prohibited topics. The grounding check deserves particular attention for any RAG or knowledge-base application: it is the cleanest built-in defense against the model confidently asserting things the retrieved context does not support. Treat the guardrail config as policy-as-code: it belongs in version control, in review, and in your eval suite, not hand-edited in the console.
On-demand Bedrock is the right default for the vast majority of deployments. The capacity tools — cross-region inference and Provisioned Throughput — exist for specific failure modes: throttling under load and the need for predictable latency or reserved capacity.
Start with the latency anatomy. Two numbers govern user experience: time-to-first-token (how long before the first word appears) and tokens-per-second (how fast the rest streams). Streaming via ConverseStream optimizes the first; model choice dominates both, since smaller models in the family generate faster. The practical levers to feel snappier are: stream always for interactive UIs, pick the smallest model that meets quality, cap `maxTokens` so the model cannot ramble, and cache the prefix so the prompt-processing portion of latency shrinks on repeat calls. Most "Claude feels slow" complaints are an un-streamed UI calling the largest model with an uncapped token budget — all three fixable.
Cross-region inference profiles are the first capacity tool and they are nearly free upside. Instead of pinning requests to a single region, you address the model by a regional inference-profile ID (e.g. a `us.` or `eu.` profile) and Bedrock automatically distributes calls across multiple regions in that geography. This raises your effective throughput ceiling and smooths out throttling during bursts, while keeping data within the geography for residency. For most production Claude workloads, calling through a cross-region profile rather than a single-region model ID is simply the better default — more headroom, the same code. (Mind that data is processed in any region within the profile's geography, which is a compliance consideration but rarely a blocker for US/EU profiles.)
Provisioned Throughput is the heavier instrument. On-demand inference shares a regional capacity pool and is subject to per-account throttling (rate limits expressed in requests and tokens per minute); under sustained high load you will see throttling that on-demand alone cannot solve. Provisioned Throughput reserves dedicated model capacity — measured in model units, billed hourly, optionally on a 1- or 6-month commitment — giving you guaranteed throughput and more consistent latency independent of the shared pool. The tradeoff is that you pay for the reserved capacity whether or not you use it, so it only makes economic sense at high, predictable, sustained volume, or when a latency/throughput SLA genuinely requires reserved capacity. The decision rule: stay on-demand (with retries and cross-region) until you are reliably hitting throttling limits or you have a hard p99 SLA, then model whether sustained utilization justifies the hourly reservation.
Regardless of capacity model, build for throttling. On-demand will occasionally return throttling exceptions, so wrap calls in exponential backoff with jitter, set sane client timeouts, and degrade gracefully (queue, retry, or fall back to a smaller model) rather than failing the user request. This retry discipline is what separates a deployment that survives a traffic spike from one that cascades.
| Capacity model | What it gives you | Billing | Reach for it when |
|---|---|---|---|
| On-demand (single region) | Pay-per-token, zero commitment, simplest | Per input/output token | Default for almost everything; dev, low/medium volume |
| Cross-region inference profile | Higher throughput ceiling, throttle smoothing, same code | Per token (same as on-demand) | Production default — more headroom for free |
| Provisioned Throughput | Reserved capacity, predictable latency/throughput | Hourly per model unit (± 1/6-mo commit) | Sustained high volume or hard latency SLA |
| Batch | ~50% cheaper, async, throughput-oriented | Per token at ~50% discount | Any non-interactive / no-deadline workload |
A Claude deployment without evals is a deployment you cannot safely change. Without observability it is one you cannot debug or cost-attribute. Both are production requirements, not nice-to-haves — and both are where prototype-grade projects quietly rot.
Start with a golden eval set: a curated collection of representative inputs paired with known-good outputs or graded criteria, covering the easy majority, the hard slice, and the adversarial edges (prompt injection, off-topic, malformed input). This set is what lets you answer the questions that otherwise become arguments — "is Sonnet good enough here, or do we need Opus?", "did changing the system prompt help or hurt?", "is the cheaper model safe to route this traffic to?" You grade outputs with a mix of deterministic checks (exact match, schema validity, regex, did-it-call-the-right-tool) and model-graded scoring (an LLM-as-judge rubric for open-ended quality), and you run the set on every prompt change, model change, and routing change. Bedrock's built-in model-evaluation tooling can host automatic and human-in-the-loop evaluations, but the discipline matters more than the tool: the point is that a quality regression shows up in a number before it shows up in a user complaint.
Evals are also how you make the model-selection decision from Section II rigorously instead of by vibe. Run the same golden set against Haiku, Sonnet, and Opus, look at where the cheaper tiers actually fail, and let the failure pattern define your routing rules. This converts "which model should we use?" from an opinion into a measurement, and it is the only honest way to claim a cheaper model is "good enough."
Observability is the runtime counterpart. At minimum, log on every call: the model ID, the token counts from the `usage` block (input, output, cache read/write), latency (time-to-first-token and total), `stopReason`, whether a Guardrail fired, and a request ID you can trace. CloudWatch captures Bedrock invocation metrics and you can enable model-invocation logging to land full request/response records in S3 or CloudWatch Logs for audit and debugging. Those token counts are also your cost ledger — per-feature and per-customer cost attribution comes directly from summing `usage` by tag, and without it you will not be able to answer "which feature is driving the bill" when finance asks. Watch error and throttling rates, p50/p95/p99 latency, cache-hit ratio, and Guardrail intervention rate as standing dashboards; each one maps directly to a lever in this guide.
Cost: tokens by model, by feature, by customer; cache-hit ratio; batch vs on-demand mix. This is where overspend hides.
Latency: time-to-first-token and total, p50/p95/p99, per model. Regressions here are usually a model or token-budget change.
Reliability: error and throttling rates, retry counts, fallback activations. Spikes signal a capacity problem before users feel it.
Safety/quality: Guardrail intervention rate, grounding-check failures, and the latest golden-eval score. A drift in any of these is your early warning.
The point of the previous sections is that the levers multiply, not add. Walking one workload from the naive build to the optimized build makes the magnitude concrete — and shows why two teams running "the same" Claude feature can have bills an order of magnitude apart.
Take a representative pipeline: a million document-processing calls a month, each with a large fixed instruction-and-schema prefix and a variable document body, none of it latency-critical. The naive build runs every call on the top model, synchronously, sending the full prefix every time. The optimized build does three things — routes the workload to the right (smaller) model after an eval confirms quality holds, caches the stable prefix so the repeated input tokens bill at a fraction, and submits the whole thing as a batch job at the ~50% discount.
Each lever lands a multiplier on the relevant token segment: model right-sizing can swing the per-token rate by 5–60× depending on how far down the family you move; prompt caching can take 80–90% off the cached-prefix portion of input tokens, which in a large-fixed-prefix workload is most of the input; batch takes a flat ~50% off everything. Because they apply to different parts of the bill and stack, the combined effect on a prefix-heavy, non-interactive workload is routinely an order of magnitude or more — without changing the output quality on that workload, because the eval is what licensed the model choice in the first place.
The discipline this implies is simple to state and easy to skip: measure tokens per request and split your traffic into "interactive and quality-critical" versus "batchable and bounded." Spend frontier money only on the first bucket, and only on the slice of it that evals prove needs frontier capability. Push the second bucket through caching and batch on the smallest model that passes. The single most common production cost pathology is the opposite — one synchronous code path, the largest model, the full prompt every time, applied uniformly to traffic that was 90% trivial. That pathology is what turns a workload that should cost low four figures a month into one that costs five.
A final note on credits. Early-stage AWS spend on Bedrock is frequently fundable — AWS Activate credits, generative-AI program credits, and Bedrock proof-of-concept funding can cover a substantial portion of inference cost during the build-and-validate phase, which is exactly when you are running the most experiments and have the least revenue. The optimization discipline above and the funding path are complementary: credits buy you runway to get the architecture right, and the architecture keeps the bill sane once the credits are spent.
The list prices step roughly an order of magnitude per tier, so this table is as much a cost decision as a capability one. Read it as "what is the cheapest tier that clears the quality bar for this workload?" — and let your eval set, not intuition, draw the line.
| Dimension | Claude Haiku | Claude Sonnet | Claude Opus |
|---|---|---|---|
| Position | Fast, cheap workhorse | Balanced default | Frontier reserve |
| Relative token cost | Lowest (≈1×) | Mid (single-digit× Haiku) | Highest (≈order-of-mag over Sonnet) |
| Relative speed | Fastest | Fast | Slowest of the three |
| Best-fit work | Classification, routing, extraction, bulk | Agents, coding, RAG, everyday reasoning | Hard reasoning, complex planning, research |
| Share of typical traffic | High-volume simple slice | The default majority | The thin hard slice |
| Default verdict | Downshift target | Start here | Escalation target only |
Situation: Had a working prototype: every support message hit the top Claude model synchronously with a 6,000-token system prompt and full knowledge-base context re-sent on every turn. It worked in the demo and projected to roughly $30K/month at their forecast volume — untenable on a Series-A budget. They also had no Guardrails (a compliance risk for a regulated vertical), no evals, and no cost visibility per customer.
What CloudRoute did: Routed within 24 hours to a vetted AWS partner with a Bedrock + GenAI track record. The partner moved the integration to the Converse API, set Sonnet as the default with a Haiku classifier in front for the trivial intents and an Opus path for escalations, added prompt caching on the stable system-prompt-plus-tools prefix, moved overnight ticket-summarization to a batch job, attached Bedrock Guardrails (PII masking + denied topics + contextual grounding for the RAG answers), and stood up a golden eval set plus CloudWatch dashboards for tokens, latency, and Guardrail rate. The partner also filed for AWS Bedrock POC + Activate credits to fund the build.
Outcome: Projected monthly inference cost fell from ~$30K to under $4K at the same volume — a blended-model fleet, caching on the heavy prefix, and batch on the async work compounding to roughly an 8× reduction with no measured quality drop on the eval set. Guardrails closed the compliance gap; the eval set made every subsequent model/prompt change a measured decision. AWS credits covered the build-and-validate phase, so CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.
engagement window: 7 weeks · est. monthly cost: $30K → <$4K · quality delta on evals: ~neutral · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who does the Converse-API build, model routing, caching, Guardrails, and evals — and files for the AWS credits that often fund it. Customer pays $0. No procurement theater.