Bedrock gives you a catalog of dozens of models from Anthropic, Amazon, Meta, Mistral, Cohere, AI21 and more behind one API. The hard part is no longer access — it is choosing. This guide is the model-selection framework: how to map a task to the right model family, how to reason about the quality/cost/latency/context tradeoffs, why you should measure instead of guess, how to route across tiers to cut cost, when fine-tuning actually pays off, and a decision matrix you can copy by use case.
People conflate two decisions that happen at different layers. Choosing a provider — Bedrock vs OpenAI vs Azure OpenAI vs Vertex — is a platform decision about data residency, contracts, ecosystem and lock-in. Choosing a model is what you do after you are inside Bedrock, with a single API and a single bill, picking which of dozens of models serves a given task.
The distinction matters because the two decisions have very different reversibility. Switching providers is a migration: new SDK, new auth, new data-flow review, sometimes a new procurement cycle. Switching models inside Bedrock is, in the common case, changing one string — the `modelId` you pass to the Converse or InvokeModel API. The request and response shapes are largely unified, so moving a workload from, say, Claude Haiku to Amazon Nova Lite to compare them is the work of an afternoon, not a quarter.
That asymmetry should shape how you behave. Because model choice is cheap to revisit, you should not agonize over the "perfect" first pick, and you should never let a model selection ossify just because it was chosen eighteen months ago. New models land on Bedrock continuously, prices drop, and a model that was the right call last year may now be beaten on every axis by something half the price. The discipline is not "choose perfectly once" — it is "choose reasonably, measure, and keep the door open."
This guide assumes you have already decided to run on Bedrock. If you have not made that call yet — if you are still weighing Bedrock against OpenAI or Vertex — that is the provider decision, and it is covered separately. Here we stay strictly inside Bedrock and answer the question that follows: given this task, which model?
One more framing note. "Best model" is not a property of a model; it is a property of a model-and-task pair under your constraints. A model that is wasteful for ticket classification can be exactly right for a multi-step agent, and a model that is too weak for code generation can be perfect for extracting fields from an invoice. Every section below pushes you back toward the task, because the task is what actually determines the answer.
Bedrock hosts models from several providers, and the catalog is large enough to be paralyzing if you read it as a flat list. Read it instead as a small number of families, each with an internal ladder from small/cheap/fast to large/capable/expensive. You almost never compare across all of them at once — you pick a family for the job, then pick a rung.
The families below are the ones most teams actually reach for in 2026. Within each, the pattern is the same: a smaller model for high-volume or latency-sensitive work, a mid model for the general default, and a top model for the genuinely hard reasoning. Holding that ladder in your head is most of the battle.
The ladder: Claude Haiku (small, fast, cheap) → Claude Sonnet (the balanced default most production chat and RAG runs on) → Claude Opus (the heavy reasoning tier for agents, hard analysis, and complex code).
Reach for it when: the task rewards careful reasoning, long-context comprehension, tool use, or instruction-following under nuance — multi-step agents, code generation and review, document analysis, and chat where answer quality is the product.
Watch: the Opus tier is the most expensive class on Bedrock by token, so it should be earned by the task, not used as a default. Haiku is genuinely capable and is often the right floor for a routing setup.
The ladder: Nova Micro (text-only, very cheap, very fast) → Nova Lite (low-cost multimodal) → Nova Pro (capable multimodal default) → Nova Premier (the most capable Nova for complex tasks and as a distillation teacher).
Reach for it when: cost-per-token is a first-order constraint, when you need multimodal input (text, image, and video understanding) at low cost, or when you want an AWS-native model with tight Bedrock integration. Nova Micro is frequently the cheapest sensible option for classification and routing.
Watch: on the hardest open-ended reasoning, the top Claude tier still tends to lead; Nova's strength is the price-performance curve, not topping every benchmark.
The ladder: smaller instruct variants for fast/cheap inference → larger instruct variants for stronger reasoning, plus very large flagship sizes for the hardest tasks.
Reach for it when: you want an open-weight model for governance or portability reasons, you anticipate moving the same weights to self-hosting later, or you want competitive quality without a proprietary lock to one model vendor.
Watch: "open weight on Bedrock" still bills per token like any hosted model; the portability benefit is architectural, not a price cut on Bedrock itself.
The ladder: small efficient models for high-throughput tasks → larger models for stronger general reasoning and code.
Reach for it when: you want strong efficiency, solid coding and reasoning at a competitive price, or you have a preference for the Mistral family for governance or familiarity reasons.
Watch: as with every family, confirm the specific variant's context window and multimodal support against your task — the family spans a wide capability range.
What they are for: this is the family you reach for to power RAG retrieval, semantic search, clustering, and classification-by-similarity — not for generation. Amazon Titan Text Embeddings and Cohere Embed turn text into vectors; Cohere Rerank reorders retrieved passages by relevance.
Reach for it when: you are building RAG or search and need to embed a corpus and embed queries. The embeddings model is a separate choice from your generation model — a typical RAG stack pairs an embeddings model (Titan or Cohere) with a generation model (Claude or Nova).
Watch: embedding dimension, max input length, and language coverage differ across these. And critically — you must embed your corpus and your queries with the same model; changing the embeddings model means re-embedding everything.
Two of these families do generation (Claude, Nova, Llama, Mistral) and one does retrieval (Titan/Cohere embeddings). A real application usually uses both — an embeddings model to retrieve, and a generation model to answer. Picking "a Bedrock model" for RAG is actually picking two models, on two different axes.
Once you have a family in mind, you narrow within it on four axes. Almost every real selection comes down to trading these against each other. The trick is knowing which axis your application cannot compromise on — that one becomes the constraint, and the rest become things you optimize subject to it.
Below, each axis is described in terms of what actually changes as you move up or down a family ladder. The numbers vary by model and region and change over time, so treat the magnitudes as orientation, not as a price sheet — the live per-token figures belong on the pricing pages.
Quality is the most misread axis because teams import it from public benchmarks. A model topping a general reasoning leaderboard tells you little about whether it extracts your invoice fields correctly. Quality is real, but it is task-specific, and the only quality number that matters is the one your own eval produces (Section IV). As a rough prior: the top tiers (Claude Opus, Nova Premier, the largest Llama/Mistral) lead on hard, open-ended reasoning; mid tiers (Claude Sonnet, Nova Pro) are excellent for the broad middle; small tiers (Claude Haiku, Nova Micro) are surprisingly strong on well-scoped tasks and weak on genuinely hard ones.
Cost is where the spread is enormous. Across a family ladder, and across the catalog as a whole, per-token price between the cheapest sensible model and the top tier can differ by 40–60× or more. That is the single biggest reason "just use the best model" is bad engineering: you can be paying tens of times more for quality the task does not need. Cost is billed separately for input and output tokens, and output is typically several times more expensive than input — so output-heavy workloads (long generations) and input-heavy workloads (RAG stuffing huge context) have very different cost shapes. Prompt caching and batch inference change this math materially for the right workloads.
Latency has two components people conflate: time-to-first-token (how long before anything appears) and throughput (tokens per second once it starts). For a streaming chat UI, time-to-first-token dominates the felt experience. For a batch job, total throughput is all that matters and first-token latency is irrelevant. Smaller models are faster on both; the top reasoning tiers are slower, and some spend additional time "thinking" before answering, which is great for quality and bad for a latency budget. Match the axis to the surface: a user-facing autocomplete needs a fast small model; an overnight document-analysis job can afford a slow strong one.
Context window sets how many tokens of prompt plus retrieved material plus conversation history the model can consider at once. Large windows let you stuff whole documents or long histories in, which can substitute for retrieval in some designs. But context is not free: longer inputs cost more (you pay per input token) and can slow the request, and very long contexts can dilute the model's attention to the part that matters. The right move is usually the smallest context that fits the task plus good retrieval — not the largest window you can find. Confirm the specific variant's window; it varies widely within and across families.
This is the most important section, and the one most teams skip. The single behavior that separates teams who pick well from teams who argue in Slack is that the good ones build a tiny evaluation set and let it decide. You do not need an ML platform to do this. You need 20–50 real examples and a rubric.
The reason eval beats intuition is that model quality is non-obvious and non-monotonic on your specific task. A model that "feels smarter" in a demo may lose on your actual distribution of inputs; a cheaper model may match the expensive one on 90% of your traffic. The only way to know is to run them side by side on inputs that look like production. Because switching models on Bedrock is a `modelId` change, running the same eval against three candidates is genuinely a few hours of work — there is no excuse to skip it.
A workable eval loop, concretely:
A subtle but important point: hold the prompt constant across candidates, or you are not measuring the model — you are measuring two different prompts. Once you have a winner, then iterate the prompt. Conflating prompt changes with model changes is the most common way eval results get muddied.
The biggest cost win in production is not picking one model — it is picking several and routing between them. Most real traffic is not uniformly hard. A large fraction of requests are easy and a small fraction are hard, and paying the top-tier rate on the easy majority is pure waste.
The pattern is a cascade. Send every request to a cheap, fast model first. If it can answer confidently and the answer passes a quality check, you are done at a fraction of the cost. If it cannot — low confidence, a refusal, a failed validation, or a classifier flag that says "this is hard" — escalate that request to a stronger, more expensive model. Because only the genuinely hard slice reaches the expensive tier, blended cost drops sharply while the hard cases still get the quality they need.
Concretely, a two-tier cascade might run Claude Haiku or Nova Micro as the floor and escalate to Claude Sonnet or Opus for the hard cases. The escalation trigger can be the small model's own self-assessment, a separate lightweight classifier, a confidence threshold, or a validation step that checks the output against rules. Bedrock's intelligent prompt routing can automate part of this by directing prompts to an appropriate model in a family based on the request — useful when you do not want to hand-build the cascade.
The economics are compelling. If 80% of traffic is handled by a model that costs, say, 1/40th of the top tier, and only 20% escalates, the blended cost is a small fraction of running the top model on everything — often a 60–80% reduction — with quality on the hard cases preserved because those still reach the strong model. The eval set from Section IV is what tells you where to set the threshold: it shows you what fraction of traffic the cheap model can actually handle at your quality bar.
A caution: routing adds a moving part. Every escalation is an extra call and extra latency on the hard slice, and a mis-tuned threshold either over-escalates (losing the savings) or under-escalates (losing quality). Start with a simple, well-instrumented two-tier cascade, watch the escalation rate, and only add tiers if the data justifies them. Complexity you cannot measure is complexity you cannot defend.
Fine-tuning is the most over-reached-for tool in the kit. The instinct is to fine-tune early; the reality is that prompting plus retrieval gets most teams where they need to go, and fine-tuning only pays off in specific situations once the simpler levers are exhausted.
Walk the ladder in order. First, prompt engineering — clearer instructions, few-shot examples, structured output. Second, retrieval (RAG) — give the model the right facts at inference time instead of trying to bake them in. Third, prompt caching and the right model tier — cheaper ways to hit your quality and cost targets. Only after those have plateaued does fine-tuning earn its place. Fine-tuning bakes behavior into the weights; it is powerful and it is also the most expensive and least reversible lever, so it should be the last one you pull, not the first.
Fine-tuning is the right call when: you need a consistent style, tone, or output format that prompting cannot reliably enforce; you have a narrow, well-defined task where a smaller fine-tuned model can match a larger general one at a fraction of the inference cost; you have hundreds to thousands of high-quality labeled examples; or you need to reduce prompt length (and therefore per-request cost) by moving instructions into the weights. In these cases a fine-tuned small model can be both cheaper and better than a prompted large one — which is the whole point.
Fine-tuning is the wrong call when the real problem is missing knowledge (use RAG — fine-tuning teaches behavior, not facts), when you do not have clean labeled data (garbage in, garbage out, and now it is baked in), when the task keeps changing (you will re-tune constantly), or when you have not yet exhausted prompting and retrieval (you will spend money to discover the simpler lever would have worked). Bedrock supports fine-tuning and model distillation for several families, and distillation in particular — training a smaller model from a larger one on your task — is an underused way to get top-tier behavior at small-model cost once you have validated the task with prompting.
Prompt → retrieve → cache + right tier → then fine-tune. Most teams that "need fine-tuning" actually need better retrieval or a smaller model with a tighter prompt. Earn the fine-tune by exhausting the cheaper levers first, and bring labeled data when you do.
This is the section to bookmark. Find your use case, read the starting recommendation, and then — this is not optional — validate it with the eval loop from Section IV against your own data. These are strong starting priors, not verdicts; your data gets the final say.
Each row gives the family/tier to start with and the axis that should drive the final pick. Where a use case needs two models (retrieval plus generation), both are noted.
| Use case | Start here | Driving axis | Notes |
|---|---|---|---|
| High-volume chat / support | Claude Sonnet or Nova Pro; floor on Haiku/Nova Micro | Latency + cost | Stream for felt speed. Route easy turns to the small model, escalate hard ones. |
| RAG (retrieval-augmented) | Titan or Cohere embeddings + Claude Sonnet / Nova Pro to generate | Quality (faithfulness) + cost | Two models. Embed corpus and queries with the same embeddings model. Rerank for precision. |
| Agents / tool use / multi-step | Claude Sonnet, escalate to Opus for hard chains | Quality (reasoning) | Reasoning quality compounds across steps — under-spending here breaks the whole chain. |
| Classification / routing / extraction | Nova Micro or Claude Haiku | Cost + latency | Small models shine on well-scoped tasks. Often a fine-tune or distill target later. |
| Code generation / review | Claude Sonnet, Opus for complex; Mistral as an alternative | Quality | Grade on does-it-compile-and-pass-tests, not on plausibility. |
| Vision / multimodal (image, video) | Nova Lite/Pro or a multimodal Claude tier | Quality + cost | Confirm the specific variant supports your modality (image vs video) before committing. |
| Summarization (bulk, offline) | Nova Lite/Micro or Claude Haiku via batch inference | Cost | Latency is irrelevant offline — optimize purely for throughput and price. |
| Semantic search / dedup / clustering | Titan or Cohere embeddings (no generation model) | Quality (retrieval) + cost | Pure embeddings workload. Match dimension and language coverage to your corpus. |
Most bad model choices are not subtle — they are one of a handful of repeatable mistakes. Knowing them by name is half of avoiding them.
Within most generation families on Bedrock the ladder has three meaningful rungs. This is the shape of the tradeoff you are navigating — orientation, not a price sheet. Map your use case to the row whose driving axis matches your hardest constraint.
| Variable | Small tier (Haiku / Nova Micro) | Mid tier (Sonnet / Nova Pro) | Top tier (Opus / Nova Premier) |
|---|---|---|---|
| Relative cost per token | Lowest (the 1× baseline) | Several× the small tier | 40–60×+ the small tier |
| Latency | Fastest, lowest time-to-first-token | Moderate | Slowest; may add thinking time |
| Reasoning depth | Good on well-scoped tasks | Strong general-purpose | Best on hard, open-ended problems |
| Best default for | Classification, routing, bulk summarization | Production chat, RAG generation, most agents | Hard agents, complex code, deep analysis |
| Role in a routing cascade | The floor — handles the easy majority | The common workhorse / escalation target | The escalation tier for the hard minority |
| Fine-tune / distill target? | Yes — cheap to run once specialized | Sometimes | Rarely (usually the distillation teacher) |
Situation: The team had shipped fast by routing every request — classification, retrieval-answer, and escalation drafting — to a single top-tier model "to be safe." Bedrock spend was climbing faster than usage and latency on the chat surface was hurting activation. No eval set existed, so nobody could argue for a cheaper model without "it might be worse" stopping the conversation. They wanted credits to fund the rework and a partner who had done eval-driven model selection before.
What CloudRoute did: Routed within 20 hours to a Bedrock-experienced AWS partner. The partner built a 40-example eval set from real tickets with a graded rubric, ran Nova Micro, Claude Haiku, and Claude Sonnet against it, and found the small tier cleared the quality bar on ~85% of traffic. They re-architected to a two-tier cascade (Nova Micro floor → Claude Sonnet escalation), moved intent classification to Nova Micro, switched bulk summarization to batch inference, and added prompt caching on the RAG system prompt. Eval was checked into CI so future model swaps are a measured decision.
Outcome: Blended Bedrock cost per request fell ~70% (monthly spend from ~$9K to ~$2.8K at higher volume); p95 chat latency improved materially because the small tier answers the easy majority; quality on the hard escalated slice held because those still reach Sonnet. The eval set is now a durable asset re-run whenever a new model lands. The discovery, eval build, and re-architecture ran as an AWS-funded engagement — CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.
engagement window: ~5 weeks · founder time: ~7 hours · blended cost cut: ~70% · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who builds the eval set, picks the right model and routing, and ships the cost rework — often AWS-funded, so you pay $0. No procurement. No discovery theater.