Most production GenAI bills are inflated by a single habit: sending every request to a frontier model when the majority would be answered identically by a model that costs a fraction as much. Tiered routing fixes that — each request goes to the cheapest model that still clears your quality bar, with a fallback when it does not. This guide covers the routing patterns (classifier, cascade, confidence-based), Amazon Bedrock Intelligent Prompt Routing versus a custom router, the architecture, the quality gates that keep it safe, how to measure the savings honestly with a counterfactual, the pitfalls, and the worked math.
The largest, most consistent source of waste in production GenAI is not a misconfigured cache or an un-batched job. It is architectural: a single frontier model wired to every code path, answering trivial requests at premium prices because nobody decided otherwise.
Bedrock bills per token, separately for input and output, at rates that differ enormously by model. Within a single model family the spread between the smallest and largest model is frequently more than 10× per token; across families it is larger still, and once you compare a small model to a frontier model the gap is commonly 10–30×. A request that a small model (Claude Haiku-class, Nova Lite/Micro-class, Llama 8B-class) answers indistinguishably from a frontier model (Claude Opus-class, Nova Premier-class) therefore costs you 10–30× more than it needed to, every time it runs.
The reason this default takes hold is understandable. Teams prototype against the most capable model because it "just works," ship it, and never revisit which calls actually need that capability. The frontier model becomes the path of least resistance for every task — classification, extraction, routing, short answers, and genuinely hard reasoning alike — and because each individual request is cheap in isolation, the waste stays invisible until the monthly invoice forces the question.
The insight that makes routing work is that production traffic is not uniformly difficult. On most real workloads the distribution is heavily skewed: a large fraction of requests are easy (a clear category, a short factual answer, a well-formed extraction), a smaller fraction are moderate, and a small tail is genuinely hard. If you measure your own traffic you will usually find that the easy slice — the slice a small model handles perfectly — is the majority of your volume. Paying frontier prices for that majority is the waste tiered routing eliminates.
Tiered routing is the discipline of matching model capability to request difficulty at the granularity of the individual request. Instead of one model for the application, you maintain a tier ladder — a cheap model, a mid model, a frontier model — and a routing decision sends each request to the lowest tier that will still clear the quality bar, escalating only when the request demands it. Done with proper guardrails, it is the highest-impact cost lever in the GenAI FinOps toolkit, and unlike caching or batch it attacks the rate term of the cost formula on the dominant slice of your traffic.
Before building anything, sample a few hundred real production requests and label them easy / moderate / hard against your quality bar. The fraction that is "easy" is your addressable routing slice — the share of traffic you can move to a cheap model. If 70% of your traffic is easy and the cheap model is 15× cheaper, the theoretical ceiling on this lever is roughly a 65% bill cut. The distribution decides the prize; measure it before you write a line of routing code.
Every routing system is one of three patterns or a blend of them. They differ in where the decision happens, what it costs to make, and how they fail. Choosing the right one for your workload is the first design decision.
All three share the same goal — send each request to the cheapest sufficient tier — but they answer "how do we decide?" differently. The classifier predicts difficulty up front; the cascade tries cheap first and escalates on failure; confidence-based routing lets the model itself signal when it is unsure. Understanding the trade each one makes is what lets you pick correctly rather than defaulting to whichever you read about first.
A lightweight classifier sits in front of the model tiers and predicts, from the request alone, which tier should handle it. The classifier can be a small fast model (a Haiku-class model prompted to output a difficulty label), a fine-tuned embedding-plus-logistic-regression model, or even a rules engine for well-understood inputs. It runs once per request, adds a few milliseconds and a tiny token cost, and emits a routing decision before any expensive model is invoked.
Strength: it pays the cheap model exactly once and never invokes a higher tier for traffic it routes down — there is no double-spend. Weakness: the classifier can be wrong, sending a hard request to a model that cannot handle it (a false "easy"). That makes classifier quality the whole game, and it is why classifier routing is usually paired with a confidence signal or an output check so a misroute can still be caught. Classifier routing is the right default when request difficulty is reasonably predictable from the input and your volume justifies tuning a classifier.
The cascade tries the cheapest tier first, evaluates the result, and escalates to the next tier only if the cheap result fails a quality check. The check can be a validator (does the JSON parse, does the answer match a schema, does it pass a regex or a business rule), a self-grade (a cheap model scoring the answer), or a confidence threshold. If the cheap tier passes, you are done at the cheap price; if it fails, you re-run on the stronger model and pay for both.
Strength: the routing decision is grounded in the actual output, not a prediction, so it is robust to inputs the classifier would misjudge. Weakness: escalated requests cost the sum of every tier they touched — a request that fails cheap and mid before succeeding on frontier costs more than going straight to frontier. The cascade only wins when the cheap tier succeeds often enough that the savings on passes outweigh the double-spend on escalations. It shines when you have a cheap, reliable correctness check (structured output, verifiable answers) and the cheap model passes the majority of the time.
Confidence-based routing uses a signal from the model itself to decide whether to trust the cheap answer or escalate. The signal can be token log-probabilities (low probability mass on the chosen tokens implies uncertainty), an explicit self-assessment ("rate your confidence 1–10"), or an abstention pattern where the cheap model is instructed to return a sentinel ("I am not sure") rather than guess. Low confidence triggers escalation to a stronger tier.
Strength: it targets escalation precisely at the requests the cheap model is actually unsure about, rather than a blanket difficulty prediction. Weakness: confidence signals are imperfect — models are often confidently wrong, and self-reported confidence is weakly calibrated — so the signal must be validated against your evals before you trust it to gate quality. In practice confidence-based routing is most powerful blended into a cascade or classifier: use the classifier or cheap attempt for the bulk decision, and the confidence signal as the tie-breaker that decides whether to escalate a borderline case.
AWS ships a managed routing primitive — Amazon Bedrock Intelligent Prompt Routing — that implements tiered routing for you with no infrastructure to run. The alternative is building your own. The right answer depends on how much control you need and how much you are willing to operate.
Amazon Bedrock Intelligent Prompt Routing (IPR) lets you call a single routing endpoint instead of a specific model. For each prompt, AWS predicts which model within a configured family will produce a response of comparable quality and routes to the cheapest one that clears that bar, escalating to the larger model when the prompt is hard. In AWS's published benchmarks, IPR can reduce cost by up to roughly 30% on mixed traffic with negligible impact on quality — and because it is fully managed, you write no classifier, run no escalation logic, and maintain no routing service. You point your application at the router and AWS does the rest.
The trade for that simplicity is control. IPR routes within a model family (for example, between two sizes of the same provider's models) rather than across providers, and the routing decision and its threshold are AWS's, tuned to a general notion of quality parity rather than your specific tolerance. For many teams that is exactly right: it captures the easy 20–30% with zero engineering and zero maintenance, and it is the obvious first move while you are still learning your traffic. It is the lever you turn on in an afternoon.
A custom router trades engineering effort for ceiling. Because you own the classifier (or cascade, or confidence logic), you can route across families and providers — sending a trivial classification to the cheapest small model available anywhere in your catalog, not just the small model in one family — and you can set the escalation threshold to your own quality bar. Teams that build their own routing layer frequently report larger savings than the managed router, often well past 30% and into the 40–70% range, precisely because they exploit the full cross-family price spread and tune aggressively against their own evals. The cost is real: you build the router, you maintain the classifier as your traffic drifts, and you own the failure modes.
The pragmatic path is sequential, not either/or. Turn on IPR first to capture the easy savings immediately and to establish a quality-parity baseline you can measure against. Then, if your volume is large enough that the marginal savings justify the engineering, build a custom router for the cross-family routing and the tighter threshold that the managed primitive cannot give you. Many mature stacks run both — IPR within families for the bulk of traffic, a custom layer for the high-volume tasks where cross-family routing moves real money.
Use IPR if you want most of the savings with none of the maintenance, or while you are still learning your traffic. Build a custom router when your volume is large enough that the gap between ~30% (managed, within-family) and 40–70% (custom, cross-family, tuned) is worth an engineering investment and ongoing upkeep. They compose — start managed, add custom where the volume justifies it.
A production routing layer is more than a classifier. It is a small pipeline — decision, dispatch, gate, fallback, and telemetry — and each stage exists to keep the savings real and the quality safe.
A robust custom router has five stages, and skipping any of the last three is how routing projects quietly regress quality. The components, in request order:
Two cross-cutting concerns sit on top of this pipeline. Caching interacts with routing: prompt caching is keyed per model, so if the router sends the same conversation to different tiers on different turns you fragment the cache and lose hits — pin a session to one tier for its duration, or scope caching to a system prompt the whole family shares. Latency interacts with cascades: a cascade that escalates adds the latency of every tier it touches, so for interactive paths prefer a classifier (one model call) over a deep cascade, and reserve cascades for asynchronous or batch work where the extra latency is free. Design the router with caching and latency in mind from the start; bolting them on later usually means re-architecting the dispatcher.
Routing without quality gates is not optimization; it is a quality regression with a smaller invoice. The gate is the difference between "the cheap model handles 70% of traffic" and "the cheap model silently mangles 70% of traffic and nobody noticed for a month."
The failure mode of naive routing is specific and dangerous: a small model produces output that is plausible, well-formatted, and wrong. It parses, it reads fluently, it passes a casual glance — and it is incorrect in a way that only a real evaluation would catch. Because the output looks fine, the regression is silent. The bill drops, everyone celebrates, and weeks later you discover the support agent has been giving subtly wrong answers, the extraction pipeline has been dropping a field, or the classifier has been mislabeling a category. The savings were real; so was the damage, and the damage usually costs more.
The defense is a quality gate on every route that can fail. The strongest gates are deterministic and cheap: does the JSON parse and match the schema, does the answer satisfy a business rule (a date in range, a value in an enum), does a verifiable claim check out against a source. Where the task has no deterministic check, a self-grade — a cheap model scoring the cheap answer against the question — or a confidence threshold on log-probabilities serves as a softer gate. The gate runs on the cheap output; if it passes, you keep the cheap answer and the savings; if it fails, you escalate.
The fallback path is the gate's other half. A failed gate must escalate to a stronger tier and re-run, with a capped depth so you never loop, and a defined terminal behavior when even the top tier fails — return a best-effort answer flagged for review, or hard-fail to a human queue, depending on how expensive a wrong answer is in your domain. The combination of gate plus fallback is what makes aggressive routing safe: you can route the majority of traffic down knowing that anything the cheap tier gets wrong is caught and escalated rather than shipped.
None of this works without the foundation: a representative eval set. You cannot set a sensible escalation threshold, validate a confidence signal, or trust a self-grade without a labeled set of real requests with known-good answers. The correct sequence is always — build the eval set, measure each tier against it, choose the threshold that holds your quality bar on the eval set, then ship. Optimization without an eval harness is guessing, and on reasoning-heavy tasks guessing costs more in rework than it ever saves in tokens. The eval set is also what lets you re-validate the router as your traffic drifts, which it will.
The safe rule: capable model by default, route down only on tasks where your eval set proves the cheap model holds quality. Never start from the cheap model and hope. Multi-step reasoning, long-context synthesis, code generation in unfamiliar stacks, and anything customer-facing where a wrong answer is expensive should escalate readily and route down conservatively. Asymmetric caution — cheap to escalate, expensive to downgrade — is the posture that keeps routing from becoming a regression.
A routing project that cannot prove its savings is indistinguishable from one that quietly broke quality. The only credible measurement is a counterfactual: what would the same traffic have cost on the frontier model, and did quality hold?
The naive way to "measure" routing savings is to compare this month's bill to last month's. This is almost always wrong, because traffic volume, request mix, and prompt sizes all change month to month — a bill that dropped 30% while traffic also dropped 30% saved nothing, and a bill that stayed flat while traffic doubled saved half. Absolute bill comparisons confound the routing effect with everything else moving in your system, and they cannot isolate what the router actually contributed.
The correct measurement is a counterfactual on the same traffic. For every routed request, log both the actual cost (the tier it used, including any escalation) and the cost the frontier-only baseline would have incurred on that exact request — same input tokens, frontier output rate, estimated or measured output length. The savings is the difference, summed over real traffic: savings% = 1 − (Σ actual cost / Σ frontier-baseline cost). Because both numbers are computed on identical requests, the comparison isolates the routing effect from volume and mix changes entirely.
Savings is only half the counterfactual; the other half is quality parity. Periodically — on a sampled slice of live traffic or on your eval set — run the same requests through both the router and the frontier-only baseline and compare outputs on your quality metric (exact match, rubric score, human grade, downstream task success). The routing is only a win if quality on the routed path is within your tolerance of the frontier baseline. Track both numbers together: a savings figure without a parity figure is meaningless, because you can always "save" 100% by returning garbage. The honest report is "X% cheaper at Y% quality parity," and both figures come from the counterfactual.
Operationally, this means the telemetry from the architecture section is not optional instrumentation — it is the product. The per-request frontier-baseline cost field and a periodic parity check are what let you state a defensible savings number to finance, catch quality drift the moment it appears, and decide whether a more aggressive threshold is safe. Teams that skip the counterfactual end up either under-claiming (afraid to trust a number they cannot defend) or over-claiming (a bill drop that was really a traffic drop) — and either way they cannot safely tune the router because they cannot see what their changes do.
Routing is high-leverage, which means its failure modes are also high-leverage. These are the mistakes that turn a 60% savings into a quality incident, a maintenance burden, or a number nobody trusts.
Routing savings are arithmetic, not magic. Walking one concrete example end to end shows exactly where the 40–70% comes from, why the easy-traffic share dominates the result, and how escalation overhead eats into the ceiling.
Take a workload of 1,000,000 requests/month. Assume each request averages roughly the same token shape, and that the frontier model costs $10 per request-equivalent in tokens while the cheap model in the same family costs $0.70 — a 14× spread, well within the typical 10–30× range. The frontier-only baseline is therefore 1,000,000 × $10 = $10,000/month (units scaled for a clean illustration; the ratios are what matter, not the absolute dollars).
Step 1 — measure the addressable slice. You sample real traffic and label it: 70% of requests are "easy" (the cheap model clears the bar on your eval set), 30% are "hard" (need the frontier model). That 70% is your addressable routing slice.
Step 2 — route with a perfect classifier (the ceiling). If routing were perfect, the 700,000 easy requests run on the cheap model (700,000 × $0.70 = $490) and the 300,000 hard requests run on the frontier model (300,000 × $10 = $3,000). New total: $3,490 versus the $10,000 baseline — a 65% cut. That is the theoretical ceiling for this traffic distribution and this price spread.
Step 3 — subtract real-world escalation overhead. No classifier is perfect, and it is not free. Two overheads eat into the ceiling. First, misroutes: say the classifier sends 10% of hard requests (30,000) down as "easy"; they run cheap, fail the quality gate, and escalate to frontier — so each pays the cheap cost on top of the frontier cost it would have paid anyway, an extra 30,000 × $0.70 ≈ $21. Second, the classifier itself runs on all 1,000,000 requests; kept genuinely cheap (a small model emitting a label) at, say, $0.05 each, that is ≈ $50 of overhead. Add both to the ceiling and the realized total lands around $3,560 — still roughly a 64% cut, because the escalation and classifier overheads are small when the classifier is good and the gate is cheap. (Use a frontier model as your classifier and that $0.05 becomes $10, erasing the savings — the router must be cheap.)
Step 4 — see why the distribution dominates. Re-run the ceiling with a different mix. If only 40% of traffic is easy, the cut falls to about 37% (400,000 × $0.70 + 600,000 × $10 = $6,280 versus $10,000). If 90% is easy, the cut rises to about 84%. The price spread sets the slope, but the easy-traffic share sets the prize — which is exactly why Section I insists you measure your distribution before building anything. The same router on different traffic produces wildly different savings.
Routing savings ≈ (easy-traffic share) × (1 − cheap-rate / frontier-rate), minus a small escalation-and-classifier overhead. A 70% easy share at a 14× spread yields a ~65% ceiling; a good classifier and a cheap gate realize most of it. Push the prize up by widening the addressable slice (better evals reveal more "easy" traffic) and by routing across families to find a cheaper small model than the one in your frontier model's family.
The managed primitive, the three custom patterns, and the no-routing baseline — what each costs to build, where it routes, its typical savings, and when to choose it. Savings ranges are directional and depend entirely on your traffic distribution (Section VIII).
| Approach | Where it routes | Build / maintenance | Typical savings | Best when |
|---|---|---|---|---|
| No routing (baseline) | Everything to one frontier model | None | 0% (the baseline) | Prototype only; never a steady state |
| Bedrock IPR (managed) | Within a model family, AWS-decided | None — turn it on | up to ~30% | Want savings with zero maintenance; learning traffic |
| Custom — classifier | Any tier/family, predicted up front | Build + tune classifier; re-validate on drift | 40–70% (eval-gated) | Difficulty predictable from input; high volume |
| Custom — cascade | Cheap first, escalate on gate failure | Build gate + fallback; watch pass rate | 40–70% if cheap-tier pass rate is high | Cheap, reliable correctness check exists; async OK |
| Custom — confidence-based | Escalate when model signals low confidence | Build + calibrate confidence signal | 40–70%, signal-dependent | As a tie-breaker blended into classifier/cascade |
| AWS credits | Funds the spend during build | Partner-filed application | up to 100% during build phase | Build / pre-revenue / migration phase |
Situation: Every request — document classification, field extraction, short Q&A, and the occasional hard multi-step reasoning task — hit one frontier model on-demand. The team suspected most of that traffic was easy but had no eval set to prove it, no quality gate to route down safely, and no way to measure savings credibly without risking a silent quality regression in a customer-facing product. The bill was scaling faster than seats.
What CloudRoute did: Routed within 24 hours to a US AWS partner with a Bedrock FinOps and GenAI track record. The partner first built a representative eval set from real traffic, which showed ~68% of requests were "easy." They turned on Bedrock Intelligent Prompt Routing within the family to capture the immediate ~25% with zero code, then built a custom cross-family classifier for the high-volume classification and extraction tasks (cheap model + schema/confidence gate + frontier fallback), pinned each chat session to one tier to protect cache hits, and instrumented a per-request frontier-baseline counterfactual to measure savings and parity.
Outcome: On a measured counterfactual over four weeks of real traffic, blended cost fell ~62% at quality parity within tolerance on the eval set; the run-rate dropped from ~$22K toward ~$8.4K at the same volume — before credits. The partner then filed a Bedrock POC + Activate application; approved credits covered the remaining build-phase spend, taking the effective bill to $0 while the router was tuned. CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.
engagement window: 6 weeks · founder time: ~8 hours · measured cost cut: ~62% at parity · effective bill during build: $0
CloudRoute routes you to a vetted AWS partner who builds your routing layer (classifier or cascade, quality gates, fallback, and a counterfactual to prove the savings) and files the credit application that can take the bill to $0. Customer pays $0 — AWS funds the engagement.