A complete, neutral reference for running Anthropic's Claude Sonnet on Amazon Bedrock in 2026: why Sonnet is the model most production traffic should default to (the quality/speed/cost balance), its model ID and how to enable access, Sonnet's own per-token pricing with the Batch and prompt-caching levers, the exact decision boundary for when to reach up to Opus or down to Haiku, the capabilities that matter on Sonnet (vision, tool use, long context, prompt caching, extended thinking), the production use cases where Sonnet lives, concrete cost tips — and how AWS credits make Sonnet usage $0.
Claude Sonnet is the middle of Anthropic's three-tier Claude family on Amazon Bedrock, sitting between Opus (the most capable, most expensive tier) and Haiku (the fastest, cheapest tier). It is engineered to be the model you reach for first: strong enough for the large majority of production work, fast enough for interactive use, and priced low enough to run at scale.
If you take only one thing from this page: Sonnet is the sensible default. The discipline that governs all Bedrock cost is "match the model to the task," and for most tasks the right match is Sonnet. It clears the quality bar for RAG assistants, customer-support agents, content generation, coding help, document analysis, and the reasoning behind most agents — while costing a fraction of Opus and responding fast enough to sit in a live request path. The two other tiers are the exceptions you reach for deliberately: Opus when a request is genuinely hard, Haiku when a request is genuinely easy and high-volume. Sonnet is the broad middle where most real traffic should live.
Sonnet runs on Bedrock exactly like every other foundation model there — behind Bedrock's single managed Converse API, governed by AWS IAM, callable over private networking with VPC endpoints, encryptable with your own KMS keys, and audited in CloudTrail. Your prompts and responses stay in your AWS account and the region you choose, and your inputs and outputs are not used to train the base model. This page focuses on Sonnet specifically; for the full Claude-on-Bedrock picture — the whole family, the deep "why Bedrock vs the Anthropic API direct" argument, and the shared mechanics — see the claude-on-amazon-bedrock sibling.
The reason Sonnet works so well as a default is that the route-and-escalate pattern is cheap to build on Bedrock. Because switching tiers is a one-line change to the model ID on the same Converse API, you can start every request on Sonnet and either drop the easy ones to Haiku or escalate the hard ones to Opus without rewriting anything. Sonnet anchors that pattern: it is the baseline the cheaper and more expensive tiers are measured against.
One caveat, stated once and meant throughout: exact Claude Sonnet version names, model IDs, regional availability, context-window size, and per-token prices all change as Anthropic ships new Sonnet generations and AWS updates Bedrock. The figures and identifiers here are representative as of 2026 to convey the balance and relative cost. Always confirm the current Sonnet model ID in the Bedrock model catalog and current rates on the AWS Bedrock pricing page before you build or budget.
Opus = deepest reasoning, highest cost — the escalation target for hard problems. Sonnet = the balanced workhorse — your sensible default, where most production traffic belongs. Haiku = fast and cheap — the drop-down target for easy, high-volume work. Switching between them is a one-line model-ID change, so Sonnet-as-default with selective escalation is the standard pattern.
Sonnet earns the "workhorse" label because it is the tier where the three things you actually trade off — answer quality, response latency, and price per token — meet at a point that fits the widest range of production workloads. None of the three is the absolute best in the family, and that is exactly the point.
Think of the Claude family as a single dial from cheap-and-fast to deep-and-expensive. Haiku is at one end, Opus at the other, and Sonnet is the deliberately-chosen middle. What makes the middle valuable is that most real work does not need the extreme of either end. A customer-support answer, a RAG response grounded in retrieved documents, a code suggestion, a document summary — these need solid reasoning and reliable instruction-following, which Sonnet provides, but they rarely need Opus-grade depth, and they usually need more nuance than you would trust to the cheapest tier.
Sonnet delivers strong reasoning, reliable instruction-following, solid coding, and good long-document comprehension. It is not the deepest reasoner in the family — that is Opus — but the gap only matters on genuinely hard requests. For the broad middle of production work, Sonnet's quality clears the bar, and the right move is to spend the Opus premium only where a request actually exceeds Sonnet's reach. The practical test: if Sonnet answers a representative sample of your requests correctly, it is the right default and you escalate the exceptions, rather than paying Opus rates for everything to cover a minority of hard cases.
Sonnet is fast — faster than Opus, slower than Haiku — and crucially fast enough to sit in an interactive, user-facing path (chat, support, coding assistance) without feeling sluggish. With streaming (the converse_stream variant of the Converse API) the first tokens arrive quickly, which is what users actually perceive. Where Haiku's lower latency genuinely matters — very high-throughput pipelines, real-time triage at scale — that is a signal to drop those specific requests to Haiku, not to abandon Sonnet for the rest.
This is the lever that makes Sonnet the default. As a representative 2026 reference, Sonnet sits near $3 per million input tokens and $15 per million output tokens on-demand — roughly 10× cheaper than Opus-class pricing and roughly 10× more expensive than Haiku. Because output is priced several times higher than input across the family, concise outputs save real money on Sonnet just as they do elsewhere. The order-of-magnitude step on each side of Sonnet is the entire economic case for tiered routing: every request you correctly send to Haiku instead of Sonnet, and every request you avoid sending to Opus when Sonnet suffices, compounds across production volume.
Before you can call Sonnet on Bedrock you have to request model access in your account — foundation models are off by default — and you have to invoke Sonnet by its model ID. Both are quick; neither costs anything until you actually call the model.
Enabling access. In the Bedrock console, open Model access, find Claude Sonnet, and request access. For Sonnet this is typically granted effectively immediately (some Claude models prompt for brief use-case details). There is no charge for enabling access — you pay only when you invoke. Access is per-account and per-region, so enable Sonnet in each region you will call from. If you want Bedrock to spread Sonnet traffic across a set of regions for better availability and throughput, use a cross-region inference profile (see the amazon-bedrock-pricing sibling for how that interacts with rates, and the cross-region-inference page in the cluster for the mechanics).
The Sonnet model ID. Sonnet is invoked by a model ID — a string identifying the provider, model, and version, namespaced under Anthropic (of the shape anthropic.claude-…sonnet… with a version suffix). You pass this ID to the Converse API to select Sonnet for a request; moving that same request to Haiku or Opus is just a change of the model-ID string. Because the ID advances with each Sonnet generation, do not hard-code a guessed value — read the current Sonnet ID from the Bedrock model catalog in the console (or list it via the API/CLI) and treat it as configuration rather than a literal in your code. Storing the three tier IDs (Haiku, Sonnet, Opus) as config is what makes route-and-escalate a one-line decision at runtime.
Permissions. The IAM principal making the call needs permission for the Bedrock invoke actions on the Sonnet model ARN (and, if you use a cross-region inference profile, permission on the profile). A least-privilege policy scoped to the specific Sonnet ARN you intend to call is the recommended posture; widen it to include the Haiku and Opus ARNs only if you actually route across tiers. Once access is granted and IAM is in place, you are ready to call Sonnet.
Sonnet on Bedrock is billed per token: a rate per million input tokens (everything you send) and a higher rate per million output tokens (everything Sonnet generates), with output priced several times higher than input. The headline on-demand rate is only the starting point — two levers, Batch and prompt caching, can cut the effective rate substantially.
As a representative 2026 reference, Sonnet's on-demand rate sits near $3 per million input tokens and $15 per million output tokens — output about 5× input, which is why output length is a real cost lever. The table below places Sonnet against the tiers on either side so you can see the order-of-magnitude steps that make tiered routing pay; treat it as relative comparison, not an audited price sheet. Beyond the table sit two levers that change Sonnet's effective cost: Batch — submit non-interactive Sonnet work as an asynchronous job for roughly half the on-demand price — and prompt caching — stop re-paying full input price for a repeated prefix such as a long system prompt, instruction set, reference document, or tool definitions. On a Sonnet chatbot or RAG service with a large fixed context, caching can remove a large fraction of the input bill. For the full pricing model, including how Provisioned Throughput and cross-region inference interact with these, see the amazon-bedrock-pricing sibling.
| Claude tier | Input / 1M | Output / 1M | Relative to Sonnet | Cost position |
|---|---|---|---|---|
| Claude Haiku | $0.25 | $1.25 | ~10× cheaper | Drop-down: high-volume / fast |
| Claude Sonnet | $3.00 | $15.00 | — (the default) | The workhorse baseline |
| Claude Opus-class | $15.00 | $75.00 | ~5× dearer | Escalation: hardest reasoning |
1) Tier discipline — keep Sonnet as the default; drop easy requests to Haiku, escalate only hard ones to Opus. 2) Batch — run non-interactive Sonnet work async for ~50% off. 3) Prompt caching — cache the fixed prefix (system prompt, reference docs, tool defs) so you stop re-paying input price for it. Stack all three and Sonnet's effective rate drops well below its on-demand sticker.
The single most valuable skill in using the Claude family well is knowing the decision boundary around Sonnet: when Sonnet is the right default, when a request is hard enough to justify Opus, and when a request is easy enough to hand to Haiku. Getting this boundary right is what turns the family into a cost-efficient system instead of an expensive default.
Start every request on the assumption that Sonnet is the answer, then look for a specific reason to move off it in either direction. The reasons to move are concrete, not vibes — they are about the nature of the request, not a general preference for "the best model." Below are the boundaries in both directions.
Escalate from Sonnet to Opus when a request needs depth Sonnet cannot reliably reach: deep multi-step reasoning (long chains of dependent logic), complex analysis (synthesizing many sources, reconciling conflicting information), hard coding and refactoring (large, intricate changes across a codebase), research-style synthesis, and high-stakes agentic steps where a wrong action is expensive to undo. The test is empirical: if Sonnet's answers on a class of requests are wrong or shallow often enough to matter, that class is an escalation candidate. The discipline is to escalate the class of hard requests, not to promote everything to Opus — Opus is roughly 5× Sonnet's rate, so blanket use throws money at requests that did not need it. Many systems escalate dynamically: try Sonnet, and only on a low-confidence or failed-validation signal retry the same request on Opus. See the claude-opus-on-amazon-bedrock sibling for where Opus earns its premium.
Drop from Sonnet to Haiku when a request is simple enough that Sonnet is overkill and volume or latency make the savings worth it: classification, routing and triage, data extraction from structured-ish text, short-form generation, real-time chat where speed dominates, the cheap first stage of a tiered router, and bulk processing (especially via Batch). The test here is also empirical: if Haiku clears your quality bar on a class of requests, run that class on Haiku and pocket the ~10× saving. The mistake in this direction is leaving high-volume easy work on Sonnet out of caution — at scale that is the single most common source of avoidable Claude spend. See the claude-haiku-on-amazon-bedrock sibling for the easy-and-fast end of the family.
Everything that is neither genuinely hard nor trivially easy — which is most production work — stays on Sonnet. RAG answers, support replies, content drafts, routine coding help, document analysis, the reasoning inside most agents: this is Sonnet's home. The goal is not to minimize Sonnet usage but to make sure each request is on the right tier; Sonnet ends up carrying the majority precisely because the majority of requests fall in its band.
Sonnet is not just strong text-in/text-out — it carries the capabilities that make modern GenAI applications possible, and several of them double as cost or quality levers. Availability of any specific capability can vary by Sonnet version, so confirm specifics for the exact Sonnet model you enable.
Sonnet can accept images alongside text in a request and reason about them — reading charts and diagrams, extracting fields from screenshots and scanned documents, interpreting photos, and answering questions about visual content. Because Sonnet pairs solid vision with workhorse pricing, it is often the right tier for production document-understanding and visual-QA at volume, where Opus-grade vision would be more than the task needs. This collapses a class of OCR-plus-vision pipelines into a single Converse call.
Sonnet supports tool use: you describe tools (functions, APIs, database queries) and Sonnet decides when to call them and with what arguments, then folds the results into its answer. This is the foundation of agentic systems, and Sonnet's balance of reasoning and cost makes it the usual default reasoning engine behind Bedrock Agents — capable enough to plan and call tools reliably, cheap enough to run an agent loop (which can issue many model calls per task) without the cost spiraling. On Bedrock it is exposed through the Converse API's tool fields.
Sonnet offers a large context window — room for long documents, large chunks of a codebase, extended conversation history, and many retrieved passages in a single request. Long context simplifies RAG and document workflows: you can fit more relevant material in one call rather than over-engineering retrieval. It is also a cost dimension, since input is billed per token — a big context costs more on Sonnet just as on any tier, which is exactly where prompt caching earns its keep on Sonnet.
When many Sonnet requests share a large common prefix — a long system prompt, a fixed instruction set, a reference document, tool definitions — prompt caching lets Bedrock cache that prefix so later requests are not billed full input price for it again. On a Sonnet-powered chatbot or RAG service with a large fixed context, this removes a large fraction of the input bill, and it is one of the most effective ways to lower Sonnet's effective rate. See the amazon-bedrock-pricing sibling for how caching shows up on the bill.
Newer Sonnet generations support extended thinking — an explicit mode in which the model spends additional internal reasoning steps on a hard problem before answering, lifting quality on difficult math, multi-step analysis, and tricky coding. You can typically control the thinking budget. On Sonnet this is a useful middle path: for a request that is borderline-hard, turning on extended thinking can lift Sonnet over the bar without escalating all the way to Opus — though it trades some latency and output cost, so reserve it for the requests that need it rather than enabling it globally.
The clearest way to see why Sonnet is the default is to look at the production workloads that land on it. These are the cases where Sonnet's quality is enough, its latency fits a live path, and its cost is low enough to run at scale — the broad middle of real GenAI applications.
Sonnet is already the cost-efficient default, but a handful of concrete practices compound to lower its effective rate substantially. None require a different model — they are about how you call Sonnet and what you route to it.
These are the levers, in roughly the order of impact for a typical production system. The first is about tier discipline around Sonnet; the rest are about calling Sonnet efficiently once a request is correctly on it.
Everything above prices Sonnet if you pay AWS directly. For most startups and many companies the relevant number is different — AWS will frequently fund the build with credits, and Sonnet usage on Bedrock draws those credits down before it ever touches your card. This is the same tie-in that makes Bedrock beat the direct Anthropic API for a funded team: credits apply to Sonnet on Bedrock; they do not apply to the direct API.
Sonnet inference on Bedrock is ordinary AWS spend, so it is fully credit-eligible and credits apply automatically against your bill until exhausted — covering Sonnet tokens, any Batch and prompt-caching usage, plus the supporting services (Knowledge Bases, vector store, S3, logging) around it. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case; and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). For a Sonnet-based product — a RAG assistant, a support agent, a content engine — these pools comfortably cover inference through the build and early scale.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the Sonnet workload — the tiered router that keeps Sonnet as the default while dropping to Haiku and escalating to Opus, the RAG pipeline behind Knowledge Bases, the agent with tool use, prompt caching on the fixed context. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
Put together with the routing, caching, and Batch levers above, the picture for a startup is: keep Sonnet as the default, route each request to the tier it actually needs, cache the repeated context, and run the whole thing on a $25K–$100K (or larger) credit pool while you find product-market fit — paying real money only once usage, and ideally revenue, has scaled past the credits. Related: AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.
The core decision in one place, anchored on Sonnet: the three Claude tiers compared on intelligence, speed, cost, and the work each is suited to. Default to Sonnet; reach up to Opus for hard requests and down to Haiku for easy high-volume ones. Representative 2026 figures for relative comparison, not quotes.
| Tier | Intelligence | Speed | Relative cost (input/1M) | Role vs Sonnet | Best for |
|---|---|---|---|---|---|
| Claude Haiku | Good | Fastest | ~$0.25 (~10× cheaper) | Drop down to it | High-volume, latency-sensitive, simple tasks; tier-1 of a router; Batch |
| Claude Sonnet | Strong | Fast | ~$3 (the default) | The baseline | The production default: RAG, agents, support, coding, content, document analysis |
| Claude Opus-class | Deepest | Moderate | ~$15 (~5× dearer) | Escalate up to it | Hardest reasoning, complex analysis, high-stakes agentic steps |
Situation: Their in-product assistant (RAG over customer documents plus a few tools) was already built on Claude, but every request — easy classification, routine Q&A, and the occasional hard analysis alike — was hitting the most capable, most expensive tier. The bill was climbing with usage and paid out of runway. They were already an AWS customer for the rest of the stack and wanted to bring the assistant under AWS billing and stop overpaying per request.
What CloudRoute did: CloudRoute matched them in under 24 hours to a US-East AWS partner with GenAI experience. The partner (1) made <strong>Sonnet the default tier</strong> for the assistant; (2) dropped the easy, high-volume requests (classification, routing, short extractions) to Haiku and reserved Opus for the genuinely hard analysis via a confidence-based escalation; (3) turned on prompt caching for the long fixed system prompt and Batched the overnight document processing; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the workload.
Outcome: With Sonnet as the default plus selective Haiku/Opus routing, caching, and Batch, the modeled per-request cost dropped substantially versus running everything on the top tier — but the decisive change was that the spend now draws down AWS credits instead of runway, so the team pays $0 during the build and early scale. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
default tier: Sonnet · routing: Haiku for easy, Opus for hard · levers: caching + Batch · credits secured: POC + Activate · out-of-pocket: $0
The direct Anthropic API bills your card; Claude Sonnet on Bedrock draws down AWS credits — under your existing IAM, VPC, and billing. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who makes Sonnet the default, builds the Haiku/Opus routing, and turns on caching. Customer pays $0.