A neutral reference for running Anthropic's smallest Claude — Haiku — on Amazon Bedrock in 2026: what Haiku is and where it sits below Sonnet and Opus, the model ID and how to enable access, the lowest per-token pricing in the Claude family, the latency profile that makes it feel near-instant, the jobs Haiku is genuinely right for (high-volume classification, routing and triage, simple chat, sub-agents) versus the ones where you should pay up for Sonnet or Opus, and the two levers — prompt caching and Batch — that cut Haiku's already-low cost further. Plus how AWS credits make Haiku usage $0.
Claude Haiku is the fast, low-cost tier of Anthropic's Claude family on Amazon Bedrock. It is the smallest of the three Claude tiers, tuned so that the trade-off it makes — slightly less raw intelligence than its larger siblings in exchange for much lower cost and much lower latency — is the right trade for a very large share of real production traffic.
The Claude lineup on Bedrock is a ladder of three deliberate trade-offs. Opus is the most capable tier, built for the hardest reasoning and complex agentic work where quality dominates and cost is secondary. Sonnet is the balanced workhorse — strong reasoning at a fraction of Opus cost and latency, the sensible default for most real work. Haiku is the fast tier at the bottom of the ladder: the cheapest to run, the quickest to respond, and the one you reach for when a task is high-volume, latency-sensitive, or simple enough that Haiku-grade quality clears the bar. This page is about that bottom rung specifically — the others have their own pages (see the claude-sonnet-on-amazon-bedrock sibling and the flagship claude-on-amazon-bedrock).
The key mental model is that Haiku is not "the weak Claude" — it is the efficient Claude. For a great many tasks the extra intelligence of Sonnet or Opus is simply wasted: classifying a support ticket into one of eight categories, deciding which downstream model should handle a request, pulling three fields out of a form, answering a simple FAQ. On those jobs Haiku produces effectively the same answer as a frontier tier for a fraction of the price and latency — paying for Opus there is like chartering a freight jet to deliver a letter.
Because Haiku is served through the same Amazon Bedrock API as every other model, it inherits Bedrock's whole envelope without extra work: IAM auth, VPC/PrivateLink, KMS, CloudTrail, consolidated billing on your AWS invoice, region-level data residency, and — the part that matters most to a funded startup — AWS-credit eligibility. You build once against Bedrock and choose Haiku per request by setting a model ID.
One caveat, meant throughout: exact Claude version names, model IDs, regional availability, context-window sizes, and per-token prices all change as Anthropic ships new Haiku generations and AWS updates Bedrock. Everything here is representative as of 2026 to convey the tier's structure and relative economics — confirm the current Haiku model ID in the Bedrock model catalog and current rates on the AWS Bedrock pricing page before you build or budget.
Haiku is the cheapest and fastest Claude on Bedrock — the efficient tier, not the weak one. Reach for it on high-volume, latency-sensitive, and simple work, and as tier-1 of a model router. Escalate to Sonnet or Opus only when Haiku misses the quality bar. Switching tiers is a one-line model-ID change.
Before you can call Haiku you have to enable model access in your account — a one-time, no-cost step — and pass the correct model ID on each request. Both are quick, but getting the ID right (rather than guessing it) is what stops a deploy from failing.
Enabling access. In the Bedrock console, open Model access, find the Claude Haiku model you want, and request access. For Haiku this is typically granted effectively immediately. There is no charge for enabling access — you pay only when you actually invoke the model. Access is granted per-account and per-region, so enable Haiku in every region you intend to call from. If you want Bedrock to spread Haiku traffic across regions for better availability and throughput under load — which high-volume Haiku workloads often want — use a cross-region inference profile (see the amazon-bedrock-cross-region-inference sibling).
The model ID. Every model on Bedrock is invoked by a model ID: a string identifying the provider, model, and version, with Claude IDs namespaced under Anthropic (of the shape anthropic.claude-…haiku… with a version suffix). You pass this ID to the API to select Haiku specifically. Because the family shares one API, moving a request from Haiku up to Sonnet or Opus — or the reverse, demoting a request to Haiku to save money — is just a change of the model-ID string. Since IDs advance with each Haiku generation, do not hard-code a guessed value: read the current ID from the Bedrock model catalog in the console, or list it via the API/CLI, and treat it as configuration rather than a literal in your code.
Permissions. The IAM principal needs permission for the relevant Bedrock invoke actions on the Haiku model (and, if you use a cross-region inference profile, permission on the profile) — a least-privilege policy scoped to the specific Haiku model ARNs is the recommended posture. Once access and IAM are in place, calling Haiku is identical to calling any Bedrock chat model through the Converse API; you change only the model ID. The flagship claude-on-amazon-bedrock page shows a minimal Converse snippet that applies unchanged to Haiku.
modelId. You are billed only on invocation.The whole argument for Haiku is economic, so the pricing is the heart of this page. Haiku is billed per token like every Bedrock model — a rate per 1,000 input tokens and a higher rate per 1,000 output tokens — but its rates are the lowest of any Claude tier by a wide margin. The gap between Haiku and the larger tiers is what makes model choice the single biggest cost lever on Bedrock.
The table below gives representative 2026 on-demand rates for Haiku set against Sonnet and Opus-class, shown per 1,000 and per 1,000,000 tokens (the per-million column is the per-1K figure × 1,000; providers increasingly quote per-million). Read it to anchor the relative economics — Haiku versus the tiers you might otherwise default to — not as an audited price sheet. The headline is the spread: Haiku input runs on the order of a quarter of a dollar per million tokens while Opus-class input runs in the tens of dollars, roughly a 60× difference for the same unit of input. Two further levers, covered in section VII and not shown here, push Haiku lower still: prompt caching (stop re-paying input price for a repeated prefix) and Batch (roughly half price for non-interactive async work).
What that spread means in practice: a workload that costs real money on Opus can cost almost nothing on Haiku. If you run one million short classification requests a month, the difference between doing them on Haiku and doing them on a frontier tier is the difference between a rounding error and a line item your finance team asks about. This is why high-volume systems are built Haiku-first and only escalate the requests that truly need more.
| Claude tier | Input / 1K | Output / 1K | Input / 1M | Output / 1M | Cost position |
|---|---|---|---|---|---|
| Claude Haiku | $0.00025 | $0.00125 | $0.25 | $1.25 | Cheapest — this page |
| Claude Sonnet | $0.003 | $0.015 | $3.00 | $15.00 | ~12× Haiku input |
| Claude Opus-class | $0.015 | $0.075 | $15.00 | $75.00 | ~60× Haiku input |
Haiku earns its place on tasks where the work is high-volume, latency-sensitive, or simple enough that its quality clears the bar — which, in most production systems, is a surprisingly large fraction of total traffic. These are the workloads to put on Haiku by default.
The pattern across all of these is the same: the task is well-defined and repetitive, the volume is high, the per-request value is low, and a fast, good-enough answer beats a slow, perfect one. Those are exactly the conditions under which Haiku's cost and latency advantages compound.
A neutral reference has to say where Haiku is the wrong choice. The efficient tier is not the universal tier: on genuinely hard work, reaching for Haiku to save money is a false economy that costs you in errors, retries, and rework. Here is where to escalate.
The deciding question is never "what is cheapest?" in isolation — it is "what is the cheapest tier that clears the quality bar for this task?" For the jobs in section IV that tier is Haiku. For the jobs below it is not, and forcing Haiku onto them is the most common way teams get a bad first impression of small models. Escalate when:
Use the cheapest tier that clears the quality bar. Start on Haiku; if it misses on a class of requests, escalate that class to Sonnet or Opus — not the whole workload. Most systems end up with the bulk on Haiku and a minority escalated, which is exactly the cost-optimal shape.
Cost is half of Haiku's appeal; latency is the other half. Being the smallest Claude, Haiku is also the fastest — it starts responding sooner and finishes sooner than Sonnet or Opus on the same prompt. For interactive and high-throughput workloads, that speed is a feature in its own right.
Two things drive perceived speed: time-to-first-token (how quickly the model starts replying) and tokens-per-second (how fast it streams once started). Haiku leads the family on both, which is why a Haiku-backed chat assistant feels snappy and responsive where a frontier-tier one can feel deliberate. For user-facing chat, that responsiveness often matters more to the experience than a marginal quality difference the user never notices.
Latency also compounds in agentic and pipeline systems. An agent that takes ten internal steps pays the latency of each step in series; running those steps on Haiku rather than a slower tier can be the difference between an agent that responds in a couple of seconds and one that visibly stalls. The same logic applies to any multi-stage pipeline — the faster the per-stage model, the lower the end-to-end latency. Pairing Haiku's speed with streaming (the converse_stream variant) makes interactive responses feel immediate because output appears token-by-token as it is generated.
For non-interactive bulk work, latency per request matters less — there Batch is the right tool, trading immediacy for roughly half the price (section VII). The clean split is: use Haiku's low latency where a human or another service is waiting; use Batch where nothing is waiting and throughput-per-dollar is all that counts.
Haiku is already the cheapest Claude, but two Bedrock levers stack on top of its low rate and routinely cut the effective cost of a high-volume Haiku workload by a large fraction again. Both apply to Haiku exactly as they do to the larger tiers.
Most high-volume Haiku workloads send the same large prefix on every request — a long system prompt, a fixed instruction set with examples, a reference document, or a set of tool definitions — and then a small amount of per-request input. Prompt caching lets Bedrock cache that shared prefix so subsequent requests are not billed full input price for it again. Because the cached prefix is often the bulk of the input tokens on short Haiku calls, caching can cut the input portion of the bill dramatically on classification, routing, and chat workloads with a fixed framing. It is one of the highest-return changes you can make to a Haiku system; the amazon-bedrock-prompt-caching sibling covers the mechanics.
When the work does not need an immediate answer — labelling a backlog, enriching a dataset, processing overnight queues, large-scale extraction — submit it as a Batch job rather than calling the model request-by-request. Batch runs the work asynchronously at roughly half the on-demand price. Stacked on Haiku's already-low rate, Batch makes bulk processing extraordinarily cheap: the lowest-cost tier at half price. The trade is latency — Batch is for throughput, not interactivity — so use it precisely where nothing is waiting on the result. See the amazon-bedrock-batch-inference sibling.
The levers compose. A backlog-labelling job can run on Haiku + Batch + prompt caching simultaneously: the cheapest model, at half price, not re-paying for the shared instruction prefix on every record. For interactive Haiku chat, you cannot use Batch (someone is waiting) but you can and should use prompt caching on the fixed system prompt. The discipline is to apply each lever wherever it fits — and on a high-volume Haiku workload, both usually fit somewhere.
Everything above prices Haiku if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the build with credits — and Haiku usage on Bedrock draws those credits down before it ever touches your card. Haiku is already cheap; on credits it is free during the build.
Haiku inference on Bedrock is ordinary AWS spend, so it is fully credit-eligible and credits apply automatically against your bill until exhausted — covering Haiku tokens, any Batch and prompt-caching usage, and the supporting services around the workload (Knowledge Bases, a vector store, S3, logging). The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) for proving out a GenAI use case; and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). This is the same structural advantage Bedrock has over the direct Anthropic API for a funded team: credits apply to Claude on Bedrock; they do not apply to the direct API.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the Haiku workload — the high-volume classifier, the tiered router with Haiku at tier-1, the Batch pipeline over your backlog, prompt caching on the fixed prefix. The customer pays $0: AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
Put together with the tier-selection discipline above, the picture for a startup is clean: run the high-volume, simple, and latency-sensitive work on Haiku, cache the repeated context, Batch the non-interactive volume, escalate only the genuinely hard requests to Sonnet or Opus — and run the whole thing on a $25K–$100K (or larger) credit pool while you find product-market fit. You pay real money only once usage, and ideally revenue, has scaled well past the credits. Related: AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.
The decision this page exists to inform, in one place: where Haiku is the right tier and where it is not. Match the request to the cheapest tier that clears the bar — for high-volume and simple work that is Haiku; for harder work, escalate. Representative 2026 figures for relative comparison, not quotes.
| Tier | Intelligence | Speed | Relative cost (input/1M) | Reach for it when | Don't use it for |
|---|---|---|---|---|---|
| Claude Haiku | Good | Fastest | ~$0.25 (lowest) | High-volume classification, routing/triage, simple chat, extraction, sub-agents, anything at scale | Hard reasoning, complex coding, high-stakes outputs |
| Claude Sonnet | Strong | Fast | ~$3 (~12× Haiku) | The production default for real work: RAG, agents, support, coding, content | High-volume simple work Haiku already nails |
| Claude Opus-class | Deepest | Moderate | ~$15 (~60× Haiku) | The genuinely hardest reasoning and high-stakes agentic steps | Anything Haiku or Sonnet can do (wasteful) |
Situation: The product classified and routed inbound items at roughly a million requests a month, and every request was hitting a frontier-tier model — paid out of runway, on a separate vendor invoice, and slower than the UX wanted. The work was simple and well-specified (category labelling plus a few extracted fields), so most of that frontier spend was buying intelligence the task never used. They were already an AWS customer for the rest of their stack.
What CloudRoute did: CloudRoute matched them in under 24 hours to a US-East AWS partner with GenAI experience. The partner (1) moved the classifier onto Claude Haiku via Bedrock's Converse API under the team's existing IAM and billing; (2) introduced a tiered router — Haiku handles the bulk, escalating only ambiguous items to Sonnet; (3) turned on prompt caching for the long fixed labelling prompt; (4) moved the non-interactive backlog onto Batch at roughly half price; and (5) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the workload.
Outcome: The classifier now runs on Haiku under the team's AWS account, and the modeled per-request cost fell by well over an order of magnitude (cheaper tier + caching + Batch) while latency improved — but the decisive change was that the spend now draws down AWS credits instead of runway, so the team pays $0 during the build and early scale. The handful of genuinely ambiguous items still escalate to Sonnet. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
moved: frontier tier → Haiku + selective escalation · levers: prompt caching + Batch · credits secured: POC + Activate · out-of-pocket: $0
Haiku is already the cheapest Claude; on AWS credits it is $0 during the build — under your existing IAM, VPC, and billing, not the direct Anthropic API. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who moves Haiku onto Bedrock, builds the tiered router, turns on prompt caching, and Batches the bulk. Customer pays $0.