Bedrock model selection framework · 2026

How to choose a model on Amazon Bedrock — task → family → eval, not vibes.

Bedrock gives you a catalog of dozens of models from Anthropic, Amazon, Meta, Mistral, Cohere, AI21 and more behind one API. The hard part is no longer access — it is choosing. This guide is the model-selection framework: how to map a task to the right model family, how to reason about the quality/cost/latency/context tradeoffs, why you should measure instead of guess, how to route across tiers to cut cost, when fine-tuning actually pays off, and a decision matrix you can copy by use case.

model families on Bedrock
7+
selection axes
4
cost spread (cheapest→top)
40–60×
right way to choose
eval
TL;DR
  • This is a different decision from "which provider." Provider choice is about Bedrock vs OpenAI vs Vertex. Model choice happens after you have committed to Bedrock — you are picking among Claude, Nova, Llama, Mistral, Titan, Cohere and the rest, all behind one API. The good news: switching models on Bedrock is a model-ID string change, so the cost of choosing "wrong" is low and the cost of never measuring is high.
  • Start from the task, not the leaderboard. Map task → model family first (a frontier reasoning model for agents, a small fast model for classification, an embeddings model for RAG retrieval), then narrow within the family on the four axes that actually move: quality, cost, latency, and context window. The most expensive model is rarely the right default.
  • Do not guess — measure. Build a small eval set of 20–50 real examples with a graded rubric, run your 2–3 candidate models against it, and let the numbers decide. Then route: use a cheap model for the easy 80% of traffic and escalate the hard 20% to a frontier model. Reserve fine-tuning for when prompting plus retrieval has plateaued and you have the labeled data to justify it.
framing

IChoosing a model is a different decision from choosing a provider

People conflate two decisions that happen at different layers. Choosing a provider — Bedrock vs OpenAI vs Azure OpenAI vs Vertex — is a platform decision about data residency, contracts, ecosystem and lock-in. Choosing a model is what you do after you are inside Bedrock, with a single API and a single bill, picking which of dozens of models serves a given task.

The distinction matters because the two decisions have very different reversibility. Switching providers is a migration: new SDK, new auth, new data-flow review, sometimes a new procurement cycle. Switching models inside Bedrock is, in the common case, changing one string — the `modelId` you pass to the Converse or InvokeModel API. The request and response shapes are largely unified, so moving a workload from, say, Claude Haiku to Amazon Nova Lite to compare them is the work of an afternoon, not a quarter.

That asymmetry should shape how you behave. Because model choice is cheap to revisit, you should not agonize over the "perfect" first pick, and you should never let a model selection ossify just because it was chosen eighteen months ago. New models land on Bedrock continuously, prices drop, and a model that was the right call last year may now be beaten on every axis by something half the price. The discipline is not "choose perfectly once" — it is "choose reasonably, measure, and keep the door open."

This guide assumes you have already decided to run on Bedrock. If you have not made that call yet — if you are still weighing Bedrock against OpenAI or Vertex — that is the provider decision, and it is covered separately. Here we stay strictly inside Bedrock and answer the question that follows: given this task, which model?

One more framing note. "Best model" is not a property of a model; it is a property of a model-and-task pair under your constraints. A model that is wasteful for ticket classification can be exactly right for a multi-step agent, and a model that is too weak for code generation can be perfect for extracting fields from an invoice. Every section below pushes you back toward the task, because the task is what actually determines the answer.

the field

IIThe Bedrock catalog, organized by what each family is for

Bedrock hosts models from several providers, and the catalog is large enough to be paralyzing if you read it as a flat list. Read it instead as a small number of families, each with an internal ladder from small/cheap/fast to large/capable/expensive. You almost never compare across all of them at once — you pick a family for the job, then pick a rung.

The families below are the ones most teams actually reach for in 2026. Within each, the pattern is the same: a smaller model for high-volume or latency-sensitive work, a mid model for the general default, and a top model for the genuinely hard reasoning. Holding that ladder in your head is most of the battle.

Anthropic Claude — the reasoning and agent workhorse

The ladder: Claude Haiku (small, fast, cheap) → Claude Sonnet (the balanced default most production chat and RAG runs on) → Claude Opus (the heavy reasoning tier for agents, hard analysis, and complex code).

Reach for it when: the task rewards careful reasoning, long-context comprehension, tool use, or instruction-following under nuance — multi-step agents, code generation and review, document analysis, and chat where answer quality is the product.

Watch: the Opus tier is the most expensive class on Bedrock by token, so it should be earned by the task, not used as a default. Haiku is genuinely capable and is often the right floor for a routing setup.

Amazon Nova — the price-performance and multimodal line

The ladder: Nova Micro (text-only, very cheap, very fast) → Nova Lite (low-cost multimodal) → Nova Pro (capable multimodal default) → Nova Premier (the most capable Nova for complex tasks and as a distillation teacher).

Reach for it when: cost-per-token is a first-order constraint, when you need multimodal input (text, image, and video understanding) at low cost, or when you want an AWS-native model with tight Bedrock integration. Nova Micro is frequently the cheapest sensible option for classification and routing.

Watch: on the hardest open-ended reasoning, the top Claude tier still tends to lead; Nova's strength is the price-performance curve, not topping every benchmark.

Meta Llama — open-weight flexibility

The ladder: smaller instruct variants for fast/cheap inference → larger instruct variants for stronger reasoning, plus very large flagship sizes for the hardest tasks.

Reach for it when: you want an open-weight model for governance or portability reasons, you anticipate moving the same weights to self-hosting later, or you want competitive quality without a proprietary lock to one model vendor.

Watch: "open weight on Bedrock" still bills per token like any hosted model; the portability benefit is architectural, not a price cut on Bedrock itself.

Mistral — efficient European-origin models

The ladder: small efficient models for high-throughput tasks → larger models for stronger general reasoning and code.

Reach for it when: you want strong efficiency, solid coding and reasoning at a competitive price, or you have a preference for the Mistral family for governance or familiarity reasons.

Watch: as with every family, confirm the specific variant's context window and multimodal support against your task — the family spans a wide capability range.

Amazon Titan & Cohere — embeddings and retrieval

What they are for: this is the family you reach for to power RAG retrieval, semantic search, clustering, and classification-by-similarity — not for generation. Amazon Titan Text Embeddings and Cohere Embed turn text into vectors; Cohere Rerank reorders retrieved passages by relevance.

Reach for it when: you are building RAG or search and need to embed a corpus and embed queries. The embeddings model is a separate choice from your generation model — a typical RAG stack pairs an embeddings model (Titan or Cohere) with a generation model (Claude or Nova).

Watch: embedding dimension, max input length, and language coverage differ across these. And critically — you must embed your corpus and your queries with the same model; changing the embeddings model means re-embedding everything.

the mental model

Two of these families do generation (Claude, Nova, Llama, Mistral) and one does retrieval (Titan/Cohere embeddings). A real application usually uses both — an embeddings model to retrieve, and a generation model to answer. Picking "a Bedrock model" for RAG is actually picking two models, on two different axes.

how to narrow

IIIThe four axes that actually decide it: quality, cost, latency, context

Once you have a family in mind, you narrow within it on four axes. Almost every real selection comes down to trading these against each other. The trick is knowing which axis your application cannot compromise on — that one becomes the constraint, and the rest become things you optimize subject to it.

Below, each axis is described in terms of what actually changes as you move up or down a family ladder. The numbers vary by model and region and change over time, so treat the magnitudes as orientation, not as a price sheet — the live per-token figures belong on the pricing pages.

Quality — capability on your task, not on a leaderboard

Quality is the most misread axis because teams import it from public benchmarks. A model topping a general reasoning leaderboard tells you little about whether it extracts your invoice fields correctly. Quality is real, but it is task-specific, and the only quality number that matters is the one your own eval produces (Section IV). As a rough prior: the top tiers (Claude Opus, Nova Premier, the largest Llama/Mistral) lead on hard, open-ended reasoning; mid tiers (Claude Sonnet, Nova Pro) are excellent for the broad middle; small tiers (Claude Haiku, Nova Micro) are surprisingly strong on well-scoped tasks and weak on genuinely hard ones.

Cost — the axis with the widest spread

Cost is where the spread is enormous. Across a family ladder, and across the catalog as a whole, per-token price between the cheapest sensible model and the top tier can differ by 40–60× or more. That is the single biggest reason "just use the best model" is bad engineering: you can be paying tens of times more for quality the task does not need. Cost is billed separately for input and output tokens, and output is typically several times more expensive than input — so output-heavy workloads (long generations) and input-heavy workloads (RAG stuffing huge context) have very different cost shapes. Prompt caching and batch inference change this math materially for the right workloads.

Latency — time to first token and tokens per second

Latency has two components people conflate: time-to-first-token (how long before anything appears) and throughput (tokens per second once it starts). For a streaming chat UI, time-to-first-token dominates the felt experience. For a batch job, total throughput is all that matters and first-token latency is irrelevant. Smaller models are faster on both; the top reasoning tiers are slower, and some spend additional time "thinking" before answering, which is great for quality and bad for a latency budget. Match the axis to the surface: a user-facing autocomplete needs a fast small model; an overnight document-analysis job can afford a slow strong one.

Context window — how much you can put in front of the model

Context window sets how many tokens of prompt plus retrieved material plus conversation history the model can consider at once. Large windows let you stuff whole documents or long histories in, which can substitute for retrieval in some designs. But context is not free: longer inputs cost more (you pay per input token) and can slow the request, and very long contexts can dilute the model's attention to the part that matters. The right move is usually the smallest context that fits the task plus good retrieval — not the largest window you can find. Confirm the specific variant's window; it varies widely within and across families.

the core discipline

IVDo not guess — measure with a small eval set

This is the most important section, and the one most teams skip. The single behavior that separates teams who pick well from teams who argue in Slack is that the good ones build a tiny evaluation set and let it decide. You do not need an ML platform to do this. You need 20–50 real examples and a rubric.

The reason eval beats intuition is that model quality is non-obvious and non-monotonic on your specific task. A model that "feels smarter" in a demo may lose on your actual distribution of inputs; a cheaper model may match the expensive one on 90% of your traffic. The only way to know is to run them side by side on inputs that look like production. Because switching models on Bedrock is a `modelId` change, running the same eval against three candidates is genuinely a few hours of work — there is no excuse to skip it.

A workable eval loop, concretely:

  • Collect 20–50 real examples — Pull actual inputs from your domain — real tickets, real documents, real user questions — spanning easy, typical, and hard cases. Twenty good examples beat two hundred synthetic ones.
  • Define a graded rubric — Decide what "good" means for the task: exact-match for classification, a 1–5 quality score for chat, faithfulness-to-source for RAG, does-it-compile-and-pass-tests for code. Write the rubric down before you look at outputs.
  • Run 2–3 candidate models — Pick candidates that bracket the tradeoff — e.g. a small/cheap model, a mid model, and a top model. Run all of them against the same examples with the same prompt.
  • Score and compare on all four axes — Record quality score, cost per request, and latency per request for each candidate. Now you can see the actual quality-per-dollar curve instead of arguing about it.
  • Pick the cheapest model that clears the quality bar — Not the highest scorer — the cheapest one that meets your quality threshold. If the small model clears the bar on 92% of cases, that plus routing the hard 8% up (Section V) often beats running the expensive model on everything.
  • Keep the eval and re-run it — When a new model lands or a price drops, re-run the same eval. The eval set is a durable asset; the model choice is a snapshot. Bedrock's built-in model evaluation tooling can help operationalize this once you outgrow a spreadsheet.

A subtle but important point: hold the prompt constant across candidates, or you are not measuring the model — you are measuring two different prompts. Once you have a winner, then iterate the prompt. Conflating prompt changes with model changes is the most common way eval results get muddied.

the cost lever

VTiered routing: use the cheap model for the easy 80%

The biggest cost win in production is not picking one model — it is picking several and routing between them. Most real traffic is not uniformly hard. A large fraction of requests are easy and a small fraction are hard, and paying the top-tier rate on the easy majority is pure waste.

The pattern is a cascade. Send every request to a cheap, fast model first. If it can answer confidently and the answer passes a quality check, you are done at a fraction of the cost. If it cannot — low confidence, a refusal, a failed validation, or a classifier flag that says "this is hard" — escalate that request to a stronger, more expensive model. Because only the genuinely hard slice reaches the expensive tier, blended cost drops sharply while the hard cases still get the quality they need.

Concretely, a two-tier cascade might run Claude Haiku or Nova Micro as the floor and escalate to Claude Sonnet or Opus for the hard cases. The escalation trigger can be the small model's own self-assessment, a separate lightweight classifier, a confidence threshold, or a validation step that checks the output against rules. Bedrock's intelligent prompt routing can automate part of this by directing prompts to an appropriate model in a family based on the request — useful when you do not want to hand-build the cascade.

The economics are compelling. If 80% of traffic is handled by a model that costs, say, 1/40th of the top tier, and only 20% escalates, the blended cost is a small fraction of running the top model on everything — often a 60–80% reduction — with quality on the hard cases preserved because those still reach the strong model. The eval set from Section IV is what tells you where to set the threshold: it shows you what fraction of traffic the cheap model can actually handle at your quality bar.

A caution: routing adds a moving part. Every escalation is an extra call and extra latency on the hard slice, and a mis-tuned threshold either over-escalates (losing the savings) or under-escalates (losing quality). Start with a simple, well-instrumented two-tier cascade, watch the escalation rate, and only add tiers if the data justifies them. Complexity you cannot measure is complexity you cannot defend.

when prompting plateaus

VIWhen to fine-tune (and when not to)

Fine-tuning is the most over-reached-for tool in the kit. The instinct is to fine-tune early; the reality is that prompting plus retrieval gets most teams where they need to go, and fine-tuning only pays off in specific situations once the simpler levers are exhausted.

Walk the ladder in order. First, prompt engineering — clearer instructions, few-shot examples, structured output. Second, retrieval (RAG) — give the model the right facts at inference time instead of trying to bake them in. Third, prompt caching and the right model tier — cheaper ways to hit your quality and cost targets. Only after those have plateaued does fine-tuning earn its place. Fine-tuning bakes behavior into the weights; it is powerful and it is also the most expensive and least reversible lever, so it should be the last one you pull, not the first.

Fine-tuning is the right call when: you need a consistent style, tone, or output format that prompting cannot reliably enforce; you have a narrow, well-defined task where a smaller fine-tuned model can match a larger general one at a fraction of the inference cost; you have hundreds to thousands of high-quality labeled examples; or you need to reduce prompt length (and therefore per-request cost) by moving instructions into the weights. In these cases a fine-tuned small model can be both cheaper and better than a prompted large one — which is the whole point.

Fine-tuning is the wrong call when the real problem is missing knowledge (use RAG — fine-tuning teaches behavior, not facts), when you do not have clean labeled data (garbage in, garbage out, and now it is baked in), when the task keeps changing (you will re-tune constantly), or when you have not yet exhausted prompting and retrieval (you will spend money to discover the simpler lever would have worked). Bedrock supports fine-tuning and model distillation for several families, and distillation in particular — training a smaller model from a larger one on your task — is an underused way to get top-tier behavior at small-model cost once you have validated the task with prompting.

the order that saves money

Prompt → retrieve → cache + right tier → then fine-tune. Most teams that "need fine-tuning" actually need better retrieval or a smaller model with a tighter prompt. Earn the fine-tune by exhausting the cheaper levers first, and bring labeled data when you do.

the answer key

VIIThe decision matrix by use case

This is the section to bookmark. Find your use case, read the starting recommendation, and then — this is not optional — validate it with the eval loop from Section IV against your own data. These are strong starting priors, not verdicts; your data gets the final say.

Each row gives the family/tier to start with and the axis that should drive the final pick. Where a use case needs two models (retrieval plus generation), both are noted.

starting model picks by use case · Amazon Bedrock · 2026
Use caseStart hereDriving axisNotes
High-volume chat / supportClaude Sonnet or Nova Pro; floor on Haiku/Nova MicroLatency + costStream for felt speed. Route easy turns to the small model, escalate hard ones.
RAG (retrieval-augmented)Titan or Cohere embeddings + Claude Sonnet / Nova Pro to generateQuality (faithfulness) + costTwo models. Embed corpus and queries with the same embeddings model. Rerank for precision.
Agents / tool use / multi-stepClaude Sonnet, escalate to Opus for hard chainsQuality (reasoning)Reasoning quality compounds across steps — under-spending here breaks the whole chain.
Classification / routing / extractionNova Micro or Claude HaikuCost + latencySmall models shine on well-scoped tasks. Often a fine-tune or distill target later.
Code generation / reviewClaude Sonnet, Opus for complex; Mistral as an alternativeQualityGrade on does-it-compile-and-pass-tests, not on plausibility.
Vision / multimodal (image, video)Nova Lite/Pro or a multimodal Claude tierQuality + costConfirm the specific variant supports your modality (image vs video) before committing.
Summarization (bulk, offline)Nova Lite/Micro or Claude Haiku via batch inferenceCostLatency is irrelevant offline — optimize purely for throughput and price.
Semantic search / dedup / clusteringTitan or Cohere embeddings (no generation model)Quality (retrieval) + costPure embeddings workload. Match dimension and language coverage to your corpus.
Every row is a starting point, not a final answer. The eval loop in Section IV — 20–50 real examples, a rubric, 2–3 candidates — is what turns one of these priors into a defensible decision for your specific data.
honest pitfalls

VIIIThe five mistakes that make teams choose badly

Most bad model choices are not subtle — they are one of a handful of repeatable mistakes. Knowing them by name is half of avoiding them.

  • Defaulting to the most expensive model — Using the top tier "to be safe" is the most common and most expensive error. You pay 40–60× for quality the task usually does not need. Start lower; let eval prove you need to go up.
  • Choosing from leaderboards instead of your data — A model topping a public benchmark may lose on your actual inputs. The leaderboard is a prior, not your answer. Your eval set is the only quality number that counts.
  • Never re-evaluating — A choice made eighteen months ago is almost certainly stale — new models, lower prices. Because switching is a modelId change, there is no excuse for letting a selection ossify. Re-run the eval when the field moves.
  • Picking one model for everything — A single global model wastes money on easy traffic and risks quality on hard traffic. Tiered routing — cheap floor, escalate the hard slice — is usually the right architecture, not a single pick.
  • Reaching for fine-tuning first — Fine-tuning before exhausting prompting and retrieval spends money to learn the simpler lever would have worked — and for missing-knowledge problems, fine-tuning is the wrong tool entirely. Walk the ladder.
side by side

The three tiers, compared on the axes that matter

Within most generation families on Bedrock the ladder has three meaningful rungs. This is the shape of the tradeoff you are navigating — orientation, not a price sheet. Map your use case to the row whose driving axis matches your hardest constraint.

VariableSmall tier (Haiku / Nova Micro)Mid tier (Sonnet / Nova Pro)Top tier (Opus / Nova Premier)
Relative cost per tokenLowest (the 1× baseline)Several× the small tier40–60×+ the small tier
LatencyFastest, lowest time-to-first-tokenModerateSlowest; may add thinking time
Reasoning depthGood on well-scoped tasksStrong general-purposeBest on hard, open-ended problems
Best default forClassification, routing, bulk summarizationProduction chat, RAG generation, most agentsHard agents, complex code, deep analysis
Role in a routing cascadeThe floor — handles the easy majorityThe common workhorse / escalation targetThe escalation tier for the hard minority
Fine-tune / distill target?Yes — cheap to run once specializedSometimesRarely (usually the distillation teacher)
The right architecture is rarely one row — it is the small tier as a floor with the mid or top tier as an escalation target. Pick the cheapest tier that clears your quality bar on the easy cases, and route the rest up.
not sure which model your workload needs?
Get matched with a partner who runs the eval and picks the model for you
Start in 3 minutes →
a recent match

A model-selection + cost overhaul — anonymized

inquiry · series-a data/AI startup, support-automation product
Series-A B2B SaaS, ~20 engineers, building a customer-support automation product on Bedrock, already live at ~$9K/month of Bedrock spend

Situation: The team had shipped fast by routing every request — classification, retrieval-answer, and escalation drafting — to a single top-tier model "to be safe." Bedrock spend was climbing faster than usage and latency on the chat surface was hurting activation. No eval set existed, so nobody could argue for a cheaper model without "it might be worse" stopping the conversation. They wanted credits to fund the rework and a partner who had done eval-driven model selection before.

What CloudRoute did: Routed within 20 hours to a Bedrock-experienced AWS partner. The partner built a 40-example eval set from real tickets with a graded rubric, ran Nova Micro, Claude Haiku, and Claude Sonnet against it, and found the small tier cleared the quality bar on ~85% of traffic. They re-architected to a two-tier cascade (Nova Micro floor → Claude Sonnet escalation), moved intent classification to Nova Micro, switched bulk summarization to batch inference, and added prompt caching on the RAG system prompt. Eval was checked into CI so future model swaps are a measured decision.

Outcome: Blended Bedrock cost per request fell ~70% (monthly spend from ~$9K to ~$2.8K at higher volume); p95 chat latency improved materially because the small tier answers the easy majority; quality on the hard escalated slice held because those still reach Sonnet. The eval set is now a durable asset re-run whenever a new model lands. The discovery, eval build, and re-architecture ran as an AWS-funded engagement — CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.

engagement window: ~5 weeks · founder time: ~7 hours · blended cost cut: ~70% · cost to customer: $0

faq

Common questions

What is the difference between choosing a Bedrock model and choosing a provider?
Choosing a provider (Bedrock vs OpenAI vs Azure OpenAI vs Vertex) is a platform decision about data residency, contracts, ecosystem and lock-in, and it is expensive to reverse. Choosing a Bedrock model happens after you are already on Bedrock: you are picking among Claude, Nova, Llama, Mistral, Titan, Cohere and others behind one API, and switching is usually just a modelId change. This guide is strictly about the second decision.
What is the best model on Amazon Bedrock?
There is no single best model — "best" is a property of a model-and-task pair under your constraints, not of a model alone. The top Claude and Nova tiers lead on hard reasoning, small tiers like Haiku and Nova Micro win on cost and latency for well-scoped tasks, and embeddings models like Titan and Cohere are for retrieval, not generation. Map your task to a family, then pick the cheapest tier that clears your quality bar — proven by an eval set, not a leaderboard.
How do I actually decide between two models?
Build a small eval set: 20–50 real examples from your domain spanning easy, typical, and hard cases, with a written rubric for what "good" means. Run both candidates with the same prompt, score quality and record cost and latency per request, and pick the cheapest model that clears your quality threshold. Because switching models on Bedrock is a modelId change, running this comparison is usually only a few hours of work.
Should I just use the most expensive model to be safe?
No — this is the most common and most expensive mistake. Per-token cost between the cheapest sensible model and the top tier can differ by 40–60× or more, so defaulting to the top tier often means paying tens of times over for quality the task does not need. Start with a smaller model and let your eval set prove whether you actually need to escalate.
What is tiered routing and how much does it save?
Tiered routing sends every request to a cheap, fast model first and escalates only the hard cases to a stronger model, based on confidence, a classifier, or a validation check. Because most traffic is easy, this typically cuts blended cost 60–80% versus running the top model on everything, while preserving quality on the hard slice because those requests still reach the strong model. Bedrock intelligent prompt routing can automate part of this.
When should I fine-tune a Bedrock model?
Fine-tune only after prompting and retrieval have plateaued — and only for behavior, not knowledge. It is the right call when you need a consistent style or output format prompting cannot enforce, when a small fine-tuned model can match a large general one at far lower cost, or when you have hundreds to thousands of clean labeled examples. It is the wrong call for missing facts (use RAG), for changing tasks, or when you lack labeled data. Distillation — training a small model from a large one — is an underused way to get top-tier behavior at small-model cost.
Which model should I use for RAG?
RAG needs two models on two different axes: an embeddings model (Amazon Titan or Cohere) to retrieve, and a generation model (Claude Sonnet or Nova Pro is a common default) to answer. Critically, embed your corpus and your queries with the same embeddings model — changing it means re-embedding everything. Grade the generation model on faithfulness to the retrieved source, and consider a reranker to improve retrieval precision before you reach for a bigger generation model.
How often should I revisit my model choice?
Treat the model choice as a snapshot and the eval set as the durable asset. New models land on Bedrock continuously and prices drop, so a choice made even a year ago may now be beaten on every axis. Whenever a relevant new model or a price change appears, re-run your existing eval against it — because switching is a modelId change, acting on the result is cheap.

Want the cheapest Bedrock model that still clears your quality bar?

CloudRoute routes you to a vetted AWS partner who builds the eval set, picks the right model and routing, and ships the cost rework — often AWS-funded, so you pay $0. No procurement. No discovery theater.

matched within< 24h
typical blended cost cut60–80%
cost to you$0
How to Choose a Model on Amazon Bedrock (2026) · CloudRoute