for AWS partners →Fund your routing build →

FinOps for GenAI · routing playbook · 2026

Tiered model routing on Bedrock — cut LLM cost 40–70% without losing quality (2026).

Most production GenAI bills are inflated by a single habit: sending every request to a frontier model when the majority would be answered identically by a model that costs a fraction as much. Tiered routing fixes that — each request goes to the cheapest model that still clears your quality bar, with a fallback when it does not. This guide covers the routing patterns (classifier, cascade, confidence-based), Amazon Bedrock Intelligent Prompt Routing versus a custom router, the architecture, the quality gates that keep it safe, how to measure the savings honestly with a counterfactual, the pitfalls, and the worked math.

Fund your routing build →→ jump to the worked savings math

typical cost cut

40–70%

price spread, small↔frontier

10–30×

Bedrock IPR savings

up to ~30%

routing patterns covered

TL;DR

Tiered routing sends each request to the cheapest model that meets the quality bar and escalates to a stronger model only when needed. The economics work because the per-token price spread between a small model and a frontier model in the same family is often 10–30×, and on real traffic most requests are easy. Routing the easy slice down typically removes 40–70% of spend with no measurable quality loss — provided every route is gated by an eval set.
Two ways to do it. Amazon Bedrock Intelligent Prompt Routing (IPR) is the managed option: AWS routes each prompt to the best-fit model within a configured family and, in published benchmarks, can cut cost by up to roughly 30% with negligible quality impact — zero infrastructure to run. A custom router (a cheap classifier, a cascade, or confidence-based escalation) routinely beats that because it can route across families, tune the escalation threshold to your own quality tolerance, and use your own evals — at the cost of building and maintaining it.
The hard part is not the routing — it is the guardrails. Without a representative eval set, a quality gate on each tier, and a fallback path, routing trades a smaller bill for silent quality regressions that cost more in rework than they ever saved in tokens. And you cannot claim the savings credibly without a counterfactual: log what the frontier model would have cost on the same traffic and compare. During the build phase, AWS credits (Activate, Bedrock POC, GenAI programs) can absorb the spend entirely so the effective cost is $0 while you tune the router.

first principles

IWhy one model for everything is the most expensive default in GenAI

The largest, most consistent source of waste in production GenAI is not a misconfigured cache or an un-batched job. It is architectural: a single frontier model wired to every code path, answering trivial requests at premium prices because nobody decided otherwise.

Bedrock bills per token, separately for input and output, at rates that differ enormously by model. Within a single model family the spread between the smallest and largest model is frequently more than 10× per token; across families it is larger still, and once you compare a small model to a frontier model the gap is commonly 10–30×. A request that a small model (Claude Haiku-class, Nova Lite/Micro-class, Llama 8B-class) answers indistinguishably from a frontier model (Claude Opus-class, Nova Premier-class) therefore costs you 10–30× more than it needed to, every time it runs.

The reason this default takes hold is understandable. Teams prototype against the most capable model because it "just works," ship it, and never revisit which calls actually need that capability. The frontier model becomes the path of least resistance for every task — classification, extraction, routing, short answers, and genuinely hard reasoning alike — and because each individual request is cheap in isolation, the waste stays invisible until the monthly invoice forces the question.

The insight that makes routing work is that production traffic is not uniformly difficult. On most real workloads the distribution is heavily skewed: a large fraction of requests are easy (a clear category, a short factual answer, a well-formed extraction), a smaller fraction are moderate, and a small tail is genuinely hard. If you measure your own traffic you will usually find that the easy slice — the slice a small model handles perfectly — is the majority of your volume. Paying frontier prices for that majority is the waste tiered routing eliminates.

Tiered routing is the discipline of matching model capability to request difficulty at the granularity of the individual request. Instead of one model for the application, you maintain a tier ladder — a cheap model, a mid model, a frontier model — and a routing decision sends each request to the lowest tier that will still clear the quality bar, escalating only when the request demands it. Done with proper guardrails, it is the highest-impact cost lever in the GenAI FinOps toolkit, and unlike caching or batch it attacks the rate term of the cost formula on the dominant slice of your traffic.

the one number to find first

Before building anything, sample a few hundred real production requests and label them easy / moderate / hard against your quality bar. The fraction that is "easy" is your addressable routing slice — the share of traffic you can move to a cheap model. If 70% of your traffic is easy and the cheap model is 15× cheaper, the theoretical ceiling on this lever is roughly a 65% bill cut. The distribution decides the prize; measure it before you write a line of routing code.

the three patterns

IIThe three routing patterns — classifier, cascade, confidence-based

Every routing system is one of three patterns or a blend of them. They differ in where the decision happens, what it costs to make, and how they fail. Choosing the right one for your workload is the first design decision.

All three share the same goal — send each request to the cheapest sufficient tier — but they answer "how do we decide?" differently. The classifier predicts difficulty up front; the cascade tries cheap first and escalates on failure; confidence-based routing lets the model itself signal when it is unsure. Understanding the trade each one makes is what lets you pick correctly rather than defaulting to whichever you read about first.

Pattern A — Classifier (predictive routing)

A lightweight classifier sits in front of the model tiers and predicts, from the request alone, which tier should handle it. The classifier can be a small fast model (a Haiku-class model prompted to output a difficulty label), a fine-tuned embedding-plus-logistic-regression model, or even a rules engine for well-understood inputs. It runs once per request, adds a few milliseconds and a tiny token cost, and emits a routing decision before any expensive model is invoked.

Strength: it pays the cheap model exactly once and never invokes a higher tier for traffic it routes down — there is no double-spend. Weakness: the classifier can be wrong, sending a hard request to a model that cannot handle it (a false "easy"). That makes classifier quality the whole game, and it is why classifier routing is usually paired with a confidence signal or an output check so a misroute can still be caught. Classifier routing is the right default when request difficulty is reasonably predictable from the input and your volume justifies tuning a classifier.

Pattern B — Cascade (escalate on failure)

The cascade tries the cheapest tier first, evaluates the result, and escalates to the next tier only if the cheap result fails a quality check. The check can be a validator (does the JSON parse, does the answer match a schema, does it pass a regex or a business rule), a self-grade (a cheap model scoring the answer), or a confidence threshold. If the cheap tier passes, you are done at the cheap price; if it fails, you re-run on the stronger model and pay for both.

Strength: the routing decision is grounded in the actual output, not a prediction, so it is robust to inputs the classifier would misjudge. Weakness: escalated requests cost the sum of every tier they touched — a request that fails cheap and mid before succeeding on frontier costs more than going straight to frontier. The cascade only wins when the cheap tier succeeds often enough that the savings on passes outweigh the double-spend on escalations. It shines when you have a cheap, reliable correctness check (structured output, verifiable answers) and the cheap model passes the majority of the time.

Pattern C — Confidence-based (model-signaled escalation)

Confidence-based routing uses a signal from the model itself to decide whether to trust the cheap answer or escalate. The signal can be token log-probabilities (low probability mass on the chosen tokens implies uncertainty), an explicit self-assessment ("rate your confidence 1–10"), or an abstention pattern where the cheap model is instructed to return a sentinel ("I am not sure") rather than guess. Low confidence triggers escalation to a stronger tier.

Strength: it targets escalation precisely at the requests the cheap model is actually unsure about, rather than a blanket difficulty prediction. Weakness: confidence signals are imperfect — models are often confidently wrong, and self-reported confidence is weakly calibrated — so the signal must be validated against your evals before you trust it to gate quality. In practice confidence-based routing is most powerful blended into a cascade or classifier: use the classifier or cheap attempt for the bulk decision, and the confidence signal as the tie-breaker that decides whether to escalate a borderline case.

managed vs custom

IIIBedrock Intelligent Prompt Routing vs a custom router

AWS ships a managed routing primitive — Amazon Bedrock Intelligent Prompt Routing — that implements tiered routing for you with no infrastructure to run. The alternative is building your own. The right answer depends on how much control you need and how much you are willing to operate.

Amazon Bedrock Intelligent Prompt Routing (IPR) lets you call a single routing endpoint instead of a specific model. For each prompt, AWS predicts which model within a configured family will produce a response of comparable quality and routes to the cheapest one that clears that bar, escalating to the larger model when the prompt is hard. In AWS's published benchmarks, IPR can reduce cost by up to roughly 30% on mixed traffic with negligible impact on quality — and because it is fully managed, you write no classifier, run no escalation logic, and maintain no routing service. You point your application at the router and AWS does the rest.

The trade for that simplicity is control. IPR routes within a model family (for example, between two sizes of the same provider's models) rather than across providers, and the routing decision and its threshold are AWS's, tuned to a general notion of quality parity rather than your specific tolerance. For many teams that is exactly right: it captures the easy 20–30% with zero engineering and zero maintenance, and it is the obvious first move while you are still learning your traffic. It is the lever you turn on in an afternoon.

A custom router trades engineering effort for ceiling. Because you own the classifier (or cascade, or confidence logic), you can route across families and providers — sending a trivial classification to the cheapest small model available anywhere in your catalog, not just the small model in one family — and you can set the escalation threshold to your own quality bar. Teams that build their own routing layer frequently report larger savings than the managed router, often well past 30% and into the 40–70% range, precisely because they exploit the full cross-family price spread and tune aggressively against their own evals. The cost is real: you build the router, you maintain the classifier as your traffic drifts, and you own the failure modes.

The pragmatic path is sequential, not either/or. Turn on IPR first to capture the easy savings immediately and to establish a quality-parity baseline you can measure against. Then, if your volume is large enough that the marginal savings justify the engineering, build a custom router for the cross-family routing and the tighter threshold that the managed primitive cannot give you. Many mature stacks run both — IPR within families for the bulk of traffic, a custom layer for the high-volume tasks where cross-family routing moves real money.

the decision rule

Use IPR if you want most of the savings with none of the maintenance, or while you are still learning your traffic. Build a custom router when your volume is large enough that the gap between ~30% (managed, within-family) and 40–70% (custom, cross-family, tuned) is worth an engineering investment and ongoing upkeep. They compose — start managed, add custom where the volume justifies it.

the architecture

IVThe reference architecture for a custom router

A production routing layer is more than a classifier. It is a small pipeline — decision, dispatch, gate, fallback, and telemetry — and each stage exists to keep the savings real and the quality safe.

A robust custom router has five stages, and skipping any of the last three is how routing projects quietly regress quality. The components, in request order:

1. The router (decision) — The classifier, cascade trigger, or confidence reader that decides which tier handles this request. Keep it cheap and fast — a small model or a lightweight model over embeddings — because it runs on every request. Its only job is to emit a tier and a confidence.
2. The dispatcher (tier ladder) — A clean abstraction over your model tiers (cheap → mid → frontier) so the router emits a tier and the dispatcher resolves it to a concrete Bedrock model invocation. Defining tiers as config, not hard-coded model IDs, is what lets you swap models and re-tune without touching the router logic.
3. The quality gate (per tier) — A check on the response before it is returned: schema/parse validation, a business-rule check, a confidence threshold, or a cheap self-grade. The gate is what converts a misroute into an escalation instead of a bad answer reaching the user. This is the single most important component and the one most often omitted.
4. The fallback path (escalation) — When the gate fails, escalate to the next tier and re-run. Cap the escalation depth (cheap → frontier, not an infinite ladder) and define a terminal behavior if even the top tier fails the gate (return best-effort with a flag, or hard-fail to a human). Fallback is what makes routing safe to ship.
5. Telemetry (the counterfactual) — Log, per request: chosen tier, gate result, whether it escalated, tokens and cost at each tier, and — critically — the cost the frontier model would have incurred on the same request. Without this last field you can run the router but you cannot prove it saved money. Telemetry is what turns "we added routing" into "we cut cost 58%, verified."

Two cross-cutting concerns sit on top of this pipeline. Caching interacts with routing: prompt caching is keyed per model, so if the router sends the same conversation to different tiers on different turns you fragment the cache and lose hits — pin a session to one tier for its duration, or scope caching to a system prompt the whole family shares. Latency interacts with cascades: a cascade that escalates adds the latency of every tier it touches, so for interactive paths prefer a classifier (one model call) over a deep cascade, and reserve cascades for asynchronous or batch work where the extra latency is free. Design the router with caching and latency in mind from the start; bolting them on later usually means re-architecting the dispatcher.

the guardrails

VQuality gates and fallback — the part that makes routing safe

Routing without quality gates is not optimization; it is a quality regression with a smaller invoice. The gate is the difference between "the cheap model handles 70% of traffic" and "the cheap model silently mangles 70% of traffic and nobody noticed for a month."

The failure mode of naive routing is specific and dangerous: a small model produces output that is plausible, well-formatted, and wrong. It parses, it reads fluently, it passes a casual glance — and it is incorrect in a way that only a real evaluation would catch. Because the output looks fine, the regression is silent. The bill drops, everyone celebrates, and weeks later you discover the support agent has been giving subtly wrong answers, the extraction pipeline has been dropping a field, or the classifier has been mislabeling a category. The savings were real; so was the damage, and the damage usually costs more.

The defense is a quality gate on every route that can fail. The strongest gates are deterministic and cheap: does the JSON parse and match the schema, does the answer satisfy a business rule (a date in range, a value in an enum), does a verifiable claim check out against a source. Where the task has no deterministic check, a self-grade — a cheap model scoring the cheap answer against the question — or a confidence threshold on log-probabilities serves as a softer gate. The gate runs on the cheap output; if it passes, you keep the cheap answer and the savings; if it fails, you escalate.

The fallback path is the gate's other half. A failed gate must escalate to a stronger tier and re-run, with a capped depth so you never loop, and a defined terminal behavior when even the top tier fails — return a best-effort answer flagged for review, or hard-fail to a human queue, depending on how expensive a wrong answer is in your domain. The combination of gate plus fallback is what makes aggressive routing safe: you can route the majority of traffic down knowing that anything the cheap tier gets wrong is caught and escalated rather than shipped.

None of this works without the foundation: a representative eval set. You cannot set a sensible escalation threshold, validate a confidence signal, or trust a self-grade without a labeled set of real requests with known-good answers. The correct sequence is always — build the eval set, measure each tier against it, choose the threshold that holds your quality bar on the eval set, then ship. Optimization without an eval harness is guessing, and on reasoning-heavy tasks guessing costs more in rework than it ever saves in tokens. The eval set is also what lets you re-validate the router as your traffic drifts, which it will.

route down, never default down

The safe rule: capable model by default, route down only on tasks where your eval set proves the cheap model holds quality. Never start from the cheap model and hope. Multi-step reasoning, long-context synthesis, code generation in unfamiliar stacks, and anything customer-facing where a wrong answer is expensive should escalate readily and route down conservatively. Asymmetric caution — cheap to escalate, expensive to downgrade — is the posture that keeps routing from becoming a regression.

proving it worked

VIMeasuring the savings — the counterfactual is everything

A routing project that cannot prove its savings is indistinguishable from one that quietly broke quality. The only credible measurement is a counterfactual: what would the same traffic have cost on the frontier model, and did quality hold?

The naive way to "measure" routing savings is to compare this month's bill to last month's. This is almost always wrong, because traffic volume, request mix, and prompt sizes all change month to month — a bill that dropped 30% while traffic also dropped 30% saved nothing, and a bill that stayed flat while traffic doubled saved half. Absolute bill comparisons confound the routing effect with everything else moving in your system, and they cannot isolate what the router actually contributed.

The correct measurement is a counterfactual on the same traffic. For every routed request, log both the actual cost (the tier it used, including any escalation) and the cost the frontier-only baseline would have incurred on that exact request — same input tokens, frontier output rate, estimated or measured output length. The savings is the difference, summed over real traffic: savings% = 1 − (Σ actual cost / Σ frontier-baseline cost). Because both numbers are computed on identical requests, the comparison isolates the routing effect from volume and mix changes entirely.

Savings is only half the counterfactual; the other half is quality parity. Periodically — on a sampled slice of live traffic or on your eval set — run the same requests through both the router and the frontier-only baseline and compare outputs on your quality metric (exact match, rubric score, human grade, downstream task success). The routing is only a win if quality on the routed path is within your tolerance of the frontier baseline. Track both numbers together: a savings figure without a parity figure is meaningless, because you can always "save" 100% by returning garbage. The honest report is "X% cheaper at Y% quality parity," and both figures come from the counterfactual.

Operationally, this means the telemetry from the architecture section is not optional instrumentation — it is the product. The per-request frontier-baseline cost field and a periodic parity check are what let you state a defensible savings number to finance, catch quality drift the moment it appears, and decide whether a more aggressive threshold is safe. Teams that skip the counterfactual end up either under-claiming (afraid to trust a number they cannot defend) or over-claiming (a bill drop that was really a traffic drop) — and either way they cannot safely tune the router because they cannot see what their changes do.

what goes wrong

VIIThe pitfalls that turn routing into a regression

Routing is high-leverage, which means its failure modes are also high-leverage. These are the mistakes that turn a 60% savings into a quality incident, a maintenance burden, or a number nobody trusts.

No eval set — The cardinal sin. Without a labeled, representative eval set you cannot set a threshold, validate a confidence signal, or prove parity. You are guessing which traffic is safe to route down, and on reasoning tasks guessing produces plausible-but-wrong output that no one catches. Build the eval set first; everything else depends on it.
Routing down without a gate — Sending traffic to a cheap model with no check on the output is how silent regressions ship. The cheap model's wrong answers look fine and reach users. Every route that can fail needs a quality gate (schema, business rule, self-grade, or confidence) plus a fallback.
Cascades that escalate too often — A cascade only saves money if the cheap tier passes most of the time. If your cheap-tier pass rate is low, escalations dominate and you pay cheap + mid + frontier on many requests — more than going straight to frontier. Measure the pass rate; if it is low, switch to a classifier or raise the cheap tier.
Trusting confidence signals blindly — Models are frequently confident and wrong; self-reported confidence is weakly calibrated. A confidence threshold that has not been validated against your evals will let confident-wrong answers through. Calibrate the signal on labeled data before you let it gate quality.
Cache fragmentation — Prompt caching is keyed per model. A router that bounces a single conversation across tiers shatters the cache and you lose the ~90% read discount you were counting on. Pin a session to one tier for its duration, or scope caching to a shared system prompt. Routing and caching must be designed together.
Latency blow-ups on interactive paths — Deep cascades add the latency of every tier they touch. On a user-facing path that turns a snappy response into a slow one. Use a single-call classifier for interactive traffic and reserve multi-hop cascades for async or batch work where extra latency is free.
Set-and-forget classifier drift — Your traffic distribution changes — new features, new users, new request shapes. A classifier tuned on last quarter's traffic slowly mis-routes more of this quarter's. Re-validate the router against fresh traffic on a schedule, and watch the parity metric for the early warning.
Claiming savings without a counterfactual — Comparing this month's bill to last month's confounds routing with volume and mix changes. The only defensible number is the per-request frontier-baseline counterfactual. Without it, your savings figure is a guess and you cannot safely tune.

the worked example

VIIIThe savings math — a worked example

Routing savings are arithmetic, not magic. Walking one concrete example end to end shows exactly where the 40–70% comes from, why the easy-traffic share dominates the result, and how escalation overhead eats into the ceiling.

Take a workload of 1,000,000 requests/month. Assume each request averages roughly the same token shape, and that the frontier model costs $10 per request-equivalent in tokens while the cheap model in the same family costs $0.70 — a 14× spread, well within the typical 10–30× range. The frontier-only baseline is therefore 1,000,000 × $10 = $10,000/month (units scaled for a clean illustration; the ratios are what matter, not the absolute dollars).

Step 1 — measure the addressable slice. You sample real traffic and label it: 70% of requests are "easy" (the cheap model clears the bar on your eval set), 30% are "hard" (need the frontier model). That 70% is your addressable routing slice.

Step 2 — route with a perfect classifier (the ceiling). If routing were perfect, the 700,000 easy requests run on the cheap model (700,000 × $0.70 = $490) and the 300,000 hard requests run on the frontier model (300,000 × $10 = $3,000). New total: $3,490 versus the $10,000 baseline — a 65% cut. That is the theoretical ceiling for this traffic distribution and this price spread.

Step 3 — subtract real-world escalation overhead. No classifier is perfect, and it is not free. Two overheads eat into the ceiling. First, misroutes: say the classifier sends 10% of hard requests (30,000) down as "easy"; they run cheap, fail the quality gate, and escalate to frontier — so each pays the cheap cost on top of the frontier cost it would have paid anyway, an extra 30,000 × $0.70 ≈ $21. Second, the classifier itself runs on all 1,000,000 requests; kept genuinely cheap (a small model emitting a label) at, say, $0.05 each, that is ≈ $50 of overhead. Add both to the ceiling and the realized total lands around $3,560 — still roughly a 64% cut, because the escalation and classifier overheads are small when the classifier is good and the gate is cheap. (Use a frontier model as your classifier and that $0.05 becomes $10, erasing the savings — the router must be cheap.)

Step 4 — see why the distribution dominates. Re-run the ceiling with a different mix. If only 40% of traffic is easy, the cut falls to about 37% (400,000 × $0.70 + 600,000 × $10 = $6,280 versus $10,000). If 90% is easy, the cut rises to about 84%. The price spread sets the slope, but the easy-traffic share sets the prize — which is exactly why Section I insists you measure your distribution before building anything. The same router on different traffic produces wildly different savings.

the two levers in one line

Routing savings ≈ (easy-traffic share) × (1 − cheap-rate / frontier-rate), minus a small escalation-and-classifier overhead. A 70% easy share at a 14× spread yields a ~65% ceiling; a good classifier and a cheap gate realize most of it. Push the prize up by widening the addressable slice (better evals reveal more "easy" traffic) and by routing across families to find a cheaper small model than the one in your frontier model's family.

the decision table

Routing approaches and patterns, side by side

The managed primitive, the three custom patterns, and the no-routing baseline — what each costs to build, where it routes, its typical savings, and when to choose it. Savings ranges are directional and depend entirely on your traffic distribution (Section VIII).

Approach	Where it routes	Build / maintenance	Typical savings	Best when
No routing (baseline)	Everything to one frontier model	None	0% (the baseline)	Prototype only; never a steady state
Bedrock IPR (managed)	Within a model family, AWS-decided	None — turn it on	up to ~30%	Want savings with zero maintenance; learning traffic
Custom — classifier	Any tier/family, predicted up front	Build + tune classifier; re-validate on drift	40–70% (eval-gated)	Difficulty predictable from input; high volume
Custom — cascade	Cheap first, escalate on gate failure	Build gate + fallback; watch pass rate	40–70% if cheap-tier pass rate is high	Cheap, reliable correctness check exists; async OK
Custom — confidence-based	Escalate when model signals low confidence	Build + calibrate confidence signal	40–70%, signal-dependent	As a tie-breaker blended into classifier/cascade
AWS credits	Funds the spend during build	Partner-filed application	up to 100% during build phase	Build / pre-revenue / migration phase

IPR is the fast first move — most of nothing-to-maintain savings in an afternoon. Custom routers earn the 40–70% band by routing across families and tuning to your own quality bar, at the cost of building and maintaining the router and its gates. In practice mature stacks run IPR within families and a custom layer where cross-family volume justifies it. Credits are the multiplier that takes the optimized bill to zero while you tune.

want this built, gated, and funded?

Get matched with an AWS partner who builds your routing layer (often AWS-funded)

Start in 3 minutes →

a recent match

A classification + chat workload, re-routed — anonymized

inquiry · series-a b2b saas, AI workflow product, US

Series-A B2B SaaS, 16 engineers, AI workflow product on Bedrock running ~$22K/month on-demand

Situation: Every request — document classification, field extraction, short Q&A, and the occasional hard multi-step reasoning task — hit one frontier model on-demand. The team suspected most of that traffic was easy but had no eval set to prove it, no quality gate to route down safely, and no way to measure savings credibly without risking a silent quality regression in a customer-facing product. The bill was scaling faster than seats.

What CloudRoute did: Routed within 24 hours to a US AWS partner with a Bedrock FinOps and GenAI track record. The partner first built a representative eval set from real traffic, which showed ~68% of requests were "easy." They turned on Bedrock Intelligent Prompt Routing within the family to capture the immediate ~25% with zero code, then built a custom cross-family classifier for the high-volume classification and extraction tasks (cheap model + schema/confidence gate + frontier fallback), pinned each chat session to one tier to protect cache hits, and instrumented a per-request frontier-baseline counterfactual to measure savings and parity.

Outcome: On a measured counterfactual over four weeks of real traffic, blended cost fell ~62% at quality parity within tolerance on the eval set; the run-rate dropped from ~$22K toward ~$8.4K at the same volume — before credits. The partner then filed a Bedrock POC + Activate application; approved credits covered the remaining build-phase spend, taking the effective bill to $0 while the router was tuned. CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.

engagement window: 6 weeks · founder time: ~8 hours · measured cost cut: ~62% at parity · effective bill during build: $0

faq

Common questions

What is tiered model routing on Bedrock?

Tiered model routing sends each request to the cheapest model that still meets your quality bar, escalating to a stronger model only when the request needs it — instead of defaulting every request to one frontier model. It works because the per-token price spread between a small model and a frontier model in the same family is often 10–30×, and on real traffic most requests are easy enough for the small model. Routing the easy slice down typically removes 40–70% of spend. The decision can be made up front by a classifier, after a cheap attempt by a cascade, or from a model confidence signal — and every route is gated by a quality check with a fallback so misroutes escalate rather than ship.

How much can model routing actually save on LLM cost?

For most production workloads, 40–70%. The exact figure is set by two things: the share of your traffic that is "easy" (handled by a cheap model at quality) and the price spread between your cheap and frontier models. A workload where 70% of traffic is easy and the cheap model is ~14× cheaper has a theoretical ceiling around a 65% cut, and a good classifier with a cheap quality gate realizes most of it. Amazon Bedrock Intelligent Prompt Routing, the managed option, delivers up to roughly 30% with zero engineering; a custom router that routes across model families and tunes to your own quality bar typically reaches the higher 40–70% band. Measure your easy-traffic share first — the distribution decides the prize.

What is Amazon Bedrock Intelligent Prompt Routing and how is it different from a custom router?

Amazon Bedrock Intelligent Prompt Routing (IPR) is AWS's managed routing primitive: you call a routing endpoint instead of a specific model, and AWS routes each prompt to the cheapest model within a configured family that will produce comparable quality, escalating to the larger model on hard prompts. In AWS's published benchmarks it can cut cost by up to ~30% with negligible quality impact, and there is no infrastructure to build or maintain. A custom router trades engineering effort for a higher ceiling: because you own the logic you can route across families and providers (not just within one family) and tune the escalation threshold to your own quality tolerance, which is why custom routers often reach 40–70%. The common path is to turn on IPR first for the easy savings, then build a custom layer where your volume justifies the cross-family routing.

Which routing pattern should I use — classifier, cascade, or confidence-based?

Use a classifier when request difficulty is reasonably predictable from the input and your volume justifies tuning one — it predicts the tier up front and never double-spends, but a misroute needs a gate to catch it. Use a cascade when you have a cheap, reliable correctness check (structured output, verifiable answers) and the cheap tier passes most of the time — it grounds the decision in the actual output but pays for every tier an escalated request touches, so it only wins at a high cheap-tier pass rate. Use confidence-based routing as a tie-breaker blended into the other two rather than alone, because model confidence signals are imperfect and must be calibrated against your evals before they can gate quality. Many production routers are a classifier for the bulk decision with a confidence signal and an output gate for the borderline cases.

Does routing to a cheaper model hurt quality?

Only if you route down without measuring. Many tasks — classification, extraction, routing, short factual answers — are handled by a small model indistinguishably from a frontier model, at a fraction of the cost. The danger is on reasoning-heavy, long-context, or customer-facing tasks where a small model can produce plausible-but-wrong output that a weak eval suite misses; that is a silent regression. The safe procedure is to build a representative eval set, measure each tier against your quality bar, set the escalation threshold to hold that bar, and put a quality gate plus a fallback on every route that can fail so misroutes escalate instead of reaching users. The posture is asymmetric: cheap to escalate, conservative to downgrade.

How do I measure routing savings credibly?

With a counterfactual on the same traffic — not a month-over-month bill comparison, which confounds routing with changes in volume, mix, and prompt size. For every routed request, log both the actual cost (the tier used, including any escalation) and the cost the frontier-only baseline would have incurred on that exact request, then compute savings% = 1 − (sum of actual cost / sum of frontier-baseline cost). Pair that with a quality-parity check: periodically run the same requests through both the router and the frontier baseline and compare on your quality metric. The honest report is "X% cheaper at Y% parity" — a savings number without a parity number is meaningless, because you can always "save" 100% by returning garbage.

What are the biggest pitfalls when building a router?

The cardinal one is having no eval set — without it you cannot set a threshold, validate a confidence signal, or prove parity, so you are guessing which traffic is safe to route down. Closely related is routing down with no quality gate, which ships silent regressions. Other common failures: cascades that escalate too often (they only save money if the cheap tier passes most of the time), trusting un-calibrated confidence signals, fragmenting your prompt cache by bouncing one conversation across tiers, blowing up latency with deep cascades on interactive paths, letting a classifier drift as your traffic changes, and claiming savings without a counterfactual. Each one either erodes the savings or trades them for a quality problem.

Does routing interact with prompt caching and batch inference?

Yes, and you have to design for it. Prompt caching is keyed per model, so a router that sends the same conversation to different tiers on different turns fragments the cache and loses the ~90% read discount — pin a session to one tier for its duration, or scope caching to a system prompt the whole family shares. Batch inference composes cleanly with routing: you can route a batch job to a small model and take the flat 50% batch discount on top, and cascades that add latency are free in a batch context. The levers stack because they attack different terms of the cost formula — routing changes the rate via model choice, caching discounts reused input tokens, and batch halves the rate on async work — so a routed, cached, batched workload lands at a small fraction of the naive baseline.

Can AWS credits cover the cost of building and running a router?

During the build phase, often entirely. AWS funds GenAI work through several pools — Activate Portfolio credits, Bedrock POC credits, and the Generative AI programs — that apply directly against Bedrock spend, including the inference cost of building and tuning a routing layer. A team that has already routed, cached, and batched has a small bill to begin with, and credits can absorb the remainder so the effective cost is $0 through the build. The credits are typically partner-filed: an AWS partner submits the application on your behalf, AWS funds the engagement, and the customer pays nothing. Once you are past the credit window, the routing work is what keeps the steady-state bill 40–70% below the naive baseline.

Cut your LLM bill 40–70% with tiered routing — and let AWS fund the build.

CloudRoute routes you to a vetted AWS partner who builds your routing layer (classifier or cascade, quality gates, fallback, and a counterfactual to prove the savings) and files the credit application that can take the bill to $0. Customer pays $0 — AWS funds the engagement.

Get matched in 24h →→ see the data & AI persona detail

typical cost cut40–70%

cost to you$0

matched within< 24h