Inference, not training, is where production AI bills are won or lost. Training is a one-time capital event; inference is a per-token cost that compounds with every user, every day, forever. This guide maps the three ways to serve models on AWS — Bedrock managed, SageMaker endpoints, and self-hosting on EC2/Inferentia — their real cost models, the specific levers that move each one, and a break-even framework for deciding which to use at your scale.
Most teams budget for training because it has a big, visible number attached. But for any product that actually ships and gets used, inference is the line item that grows without bound. Understanding why reframes every decision below.
Training is a capital event: you spend a fixed sum once (or once per major model version) and you are done. Inference is an operating cost that scales linearly with usage — every request, from every user, on every day the product is live, costs money. A model trained for $200K can easily generate $200K of inference cost per quarter once it is serving real traffic. The asymmetry is the whole point: a training run ends; inference never does.
The cost of a single inference call is driven by three things — the number of tokens processed (input + output), the size and architecture of the model, and the hardware it runs on. Input tokens (the prompt, the retrieved context, the system instructions) are usually cheaper per token than output tokens, because output is generated autoregressively one token at a time while input can be processed in a single forward pass. On most managed platforms output tokens cost 3–5× more than input tokens — which is why verbose prompts hurt less than verbose responses.
This is also why retrieval-augmented generation (RAG) quietly inflates bills. Every RAG call stuffs retrieved documents into the prompt, and those documents are billed as input tokens on every single request. A RAG system that retrieves 4,000 tokens of context per query pays for those 4,000 tokens millions of times over the product's life. The fix is rarely "use a cheaper model" — it is "send fewer tokens," through better retrieval, context compression, and caching.
The mental model for the rest of this guide: your inference bill is roughly (requests × tokens-per-request × price-per-token), and price-per-token is itself a function of model size and hardware efficiency. Every optimization lever attacks one of those multiplicands. The biggest wins come from attacking the ones with the largest exponents — usually token volume and model size — before micro-optimizing the hardware.
Every inference workload on AWS lands on one of three serving substrates. They are not interchangeable; each has a distinct cost shape that makes it cheap in one regime and expensive in another. Choosing the wrong one is the most common and most expensive mistake.
The defining variable is how each option charges for idle capacity. Bedrock charges nothing when no request is in flight. SageMaker real-time endpoints charge the full instance rate whether traffic is zero or saturated. Self-hosted EC2 charges for the instance the moment it boots, idle or not. That single difference — what you pay for nothing — determines which option wins at your traffic profile.
What it is: A serverless API to foundation models (Anthropic Claude, Meta Llama, Amazon Nova/Titan, Mistral, Cohere, and others). You send tokens, you get tokens, AWS owns all the infrastructure. No instances, no scaling, no GPUs to manage.
How it bills: Per input token and per output token, at a published per-model rate. On-demand has zero idle cost — you pay only for tokens actually processed. For steady high volume, Provisioned Throughput reserves dedicated capacity (billed per "model unit" per hour, optionally with 1- or 6-month commitments at a discount) which can beat on-demand once utilization is high. Batch inference processes large jobs asynchronously at roughly 50% of the on-demand token price.
Cost shape: Linear with token volume, zero fixed cost. This is the cheapest option for spiky, low, or unpredictable traffic — you never pay for idle. It becomes relatively expensive only at very high, very steady volume, where the per-token markup over raw hardware cost starts to dominate.
What it is: Managed model hosting where you bring a model (open-weights or your own fine-tune) and SageMaker runs it on instances you select. Real-time endpoints keep instances warm; Serverless Inference scales to zero; Asynchronous endpoints queue large or bursty requests.
How it bills: Real-time endpoints bill per instance-hour for as long as the endpoint exists, regardless of request volume. A warm ml.g5.xlarge costs the same at 2% utilization as at 90%. Serverless Inference bills per millisecond of compute actually used (with cold starts as the trade-off); Asynchronous bills per instance-hour but can scale to zero between batches.
Cost shape: Real-time is cheap only when utilization is high — the per-instance-hour rate is amortized across many requests. At low utilization it is the most expensive of all three options because you pay full freight for idle GPUs. This is the right choice when you need a custom or fine-tuned open-weights model served with predictable, sustained load.
What it is: You run the model yourself on raw EC2 instances — NVIDIA GPU instances (P5/P4d/G6/G5), AWS Inferentia2 (Inf2), or CPU/Graviton for small models — with your own serving stack (vLLM, TGI, TensorRT-LLM, or the Neuron SDK for Inferentia). Maximum control, maximum responsibility.
How it bills: Per instance-hour for the EC2 instance, plus storage and data transfer. On-demand is the list price; Spot Instances cut 60–90% for interruptible workloads; Savings Plans and Reserved Instances cut 30–72% for committed steady use. You also pay — in engineering time — for autoscaling, batching, health checks, model loading, and on-call.
Cost shape: Lowest possible unit cost at high, steady utilization, especially on Inferentia2 or with Spot. But you pay for every idle second, and the operational burden is real. This wins only above the break-even volume and only when you can keep the hardware busy.
On a managed pay-per-token platform you cannot touch the hardware — so every lever is about sending fewer, cheaper tokens, or shifting work to cheaper price tiers. Three levers do almost all the work.
Because Bedrock has zero idle cost, optimization here is purely about the token bill. You are not fighting utilization; you are fighting volume and price tier. The three highest-leverage moves are prompt caching, batch processing, and model routing — in roughly that order of impact for most workloads.
Prompt caching lets you mark a stable prefix of the prompt — a long system instruction, a tool schema, a few-shot block, a retrieved document set — so that repeated calls reuse the already-processed prefix instead of re-billing it at full input price. Cache reads are dramatically cheaper than fresh input tokens (a large fraction off, depending on model), and cache writes carry a small premium on the first call.
The economics are decisive for any workload with a large, repeated context: agents with long tool definitions, chatbots with a fixed persona, RAG systems re-sending the same knowledge base chunks, or document-processing pipelines that reuse the same instructions across thousands of files. On these patterns prompt caching commonly removes 50–90% of input-token cost. The discipline is structural: put everything stable at the front of the prompt and everything variable at the end, so the cacheable prefix is as long as possible.
Bedrock batch inference runs large jobs asynchronously and prices tokens at roughly 50% of the on-demand rate. Any workload that does not need a synchronous, sub-second response is a candidate: nightly summarization, bulk classification, embeddings generation, dataset labeling, content moderation backfills, evaluation runs. The trade-off is latency — jobs complete on AWS's schedule, not instantly — in exchange for halving the token bill.
The common mistake is running batch-shaped work through the real-time API because that was the first integration built. Auditing your traffic for "does this actually need to be synchronous?" and moving the answer-is-no portion to batch is often a same-day 20–40% reduction on total Bedrock spend, with no model or quality change whatsoever.
Not every request needs the largest, most capable model. A routing layer classifies each request by difficulty and sends easy ones (simple extraction, short classification, formatting) to a small fast model and hard ones (multi-step reasoning, nuanced generation) to a frontier model. Because the price gap between a small and a large model on Bedrock can be 10–20×, even routing 40–60% of traffic to the small model produces large savings.
The pattern that holds quality is the cascade: try the cheap model first, and escalate to the expensive model only when a confidence check or validator flags the cheap answer as inadequate. Done well, a cascade captures most of the cost savings of the small model while preserving frontier-model quality on the requests that genuinely need it. The investment is in the router and the validators, not the models themselves.
These compound multiplicatively, not additively. A workload that caches a 4,000-token RAG prefix (−70% input cost), moves its nightly bulk jobs to batch (−50% on that slice), and routes half its traffic to a small model (−10× on that half) can land at 20–35% of its original Bedrock bill with no quality regression. Always measure each lever in isolation first so you know which one is actually carrying the savings.
When you run the hardware — whether raw EC2 or a SageMaker real-time endpoint — the enemy is idle capacity and overpriced silicon. The levers here are about utilization and about choosing chips priced for inference rather than training.
The first principle of self-hosting is brutal: an idle GPU is pure loss. A warm p4d left running overnight at 3% utilization is burning the same dollars as one at 90%. So every lever below is ultimately about either keeping the hardware busy (right-sizing, autoscaling, batching) or paying less for each hour it runs (Spot, Savings Plans, cheaper chips).
Most self-hosted inference fleets are over-provisioned because the team sized for peak and never scaled down. Right-sizing means matching instance type to the model's actual memory and throughput needs (a 7B model does not need an 80GB GPU) and configuring autoscaling so instance count tracks real traffic — scaling out under load and, critically, scaling in (or to zero, on Serverless/Asynchronous endpoints) when traffic falls.
Continuous (in-flight) batching is the highest-leverage software lever: servers like vLLM and TGI pack many concurrent requests through the GPU together, raising throughput per instance several-fold versus naive one-request-at-a-time serving. Higher throughput per instance means fewer instances for the same traffic, which means lower cost — often a 2–4× efficiency gain from the serving stack alone, before any hardware change.
Spot Instances sell spare EC2 capacity at 60–90% below on-demand, with the catch that AWS can reclaim them on two minutes' notice. For inference this is often acceptable: stateless replicas behind a load balancer can lose a node and recover, and batch/asynchronous jobs can checkpoint and resume. The pattern is a mixed fleet — a baseline of on-demand or Savings-Plan capacity for guaranteed availability, topped up with Spot for the variable load.
Spot is a poor fit for a single-replica real-time endpoint with strict SLAs, because an interruption is a visible outage. It shines for embeddings pipelines, batch scoring, asynchronous endpoints, and any horizontally-scaled fleet where losing one of many replicas degrades gracefully rather than failing hard.
AWS Inferentia2 (Inf2 instances) is a purpose-built inference accelerator that, for many transformer workloads, delivers materially lower cost-per-token than comparable NVIDIA GPU instances — commonly cited in the range of up to ~70% lower inference cost and meaningfully better performance-per-watt. Trainium is its training-oriented sibling and can also serve inference. The trade-off is the toolchain: you compile and serve models through the AWS Neuron SDK rather than a stock CUDA stack, which adds integration work and means not every model or custom op is supported out of the box.
For high-volume, steady inference on well-supported architectures (Llama-family, many standard transformers), Inferentia2 is frequently the single largest unit-cost lever available — larger than Spot, larger than right-sizing — precisely because it attacks the price of the silicon itself. The decision is usually "is my model supported on Neuron, and is my volume high enough to justify the porting effort?" If both are yes, it is hard to beat on dollars per token.
Not every model needs a GPU. Small models, embeddings, classical ML, and many distilled or quantized sub-3B language models run perfectly well on CPU — and AWS Graviton (Arm-based) instances offer strong price-performance for exactly this. For a high-volume embeddings service or a small classifier, a Graviton fleet can be a fraction of the cost of keeping GPUs warm, with the added benefit that CPU capacity is abundant and cheap on Spot.
The rule of thumb: if the model fits and meets latency on CPU, do not pay for a GPU. Reserve accelerators (GPU/Inferentia) for the models that genuinely need them, and push everything small onto Graviton. This tiering — Graviton for small, Inferentia2 for large, GPU only where required — is how cost-disciplined teams structure a self-hosted fleet.
Platform and instance choices move cost by factors of two or three. Model-level optimization can move it by factors of four to ten — and it works on every serving option simultaneously, because a smaller model is cheaper everywhere. This is the lever to pull first.
A model's cost is dominated by its size: parameter count drives the memory it needs, the hardware it fits on, and the compute per token. Make the model smaller without losing the accuracy you need, and every downstream cost shrinks in proportion — fewer/cheaper instances if you self-host, fewer GPUs per request, sometimes a jump to a cheaper hardware tier entirely (GPU → Inferentia, or even GPU → Graviton CPU). The two dominant techniques are quantization and distillation.
Quantization stores and computes model weights at lower numerical precision — FP16 or BF16 instead of FP32, or INT8/FP8/INT4 instead of 16-bit. Halving precision roughly halves memory footprint and increases throughput, which directly cuts the hardware needed per token. Modern post-training quantization (e.g. INT8/FP8 and well-tuned INT4 schemes) typically preserves accuracy within a small tolerance for most production tasks, making it close to free savings.
The practical payoff is often a hardware-tier change: a model that needed two GPUs at FP16 may fit on one at INT8, halving the instance cost outright. On a memory-bound workload, quantization can be the difference between needing an 80GB accelerator and fitting comfortably on a 24GB one. Always validate quality on your own evals after quantizing — the accuracy hit is usually small, but "usually" is not "always," and it is task-dependent.
Distillation trains a small "student" model to reproduce the behavior of a large "teacher" model on your specific task distribution. The result is a compact model that, on the narrow domain you care about, approaches the quality of a model many times its size — at a fraction of the inference cost. For a focused production task (classification, extraction, a specific style of generation, routing), a well-distilled small model frequently matches a frontier model closely enough while costing 5–15× less to serve.
The trade-off is up-front effort and generality: distillation requires a training pipeline and high-quality teacher outputs, and the student is specialized — it will not generalize beyond its training distribution the way a frontier model does. The pattern that wins is hybrid: distill the small model for the high-volume, narrow, repetitive 80% of traffic, and keep a frontier model (on Bedrock) for the long-tail, hard, or open-ended 20%. You pay frontier prices only where frontier capability is actually required.
Optimize in this order: (1) workload — cache, batch, send fewer tokens; (2) model — quantize, then distill for narrow high-volume tasks; (3) serving option — Bedrock vs SageMaker vs self-host; (4) instance — Inferentia2/Graviton, Spot, Savings Plans. Teams routinely jump straight to step 4 (chasing a cheaper instance) and leave the 5–10× wins in steps 1–2 on the table.
The recurring question is whether to stay on managed Bedrock or self-host on EC2/Inferentia. The honest answer is a break-even calculation — and the break-even is higher than most teams assume once engineering time is priced in.
The seduction of self-hosting is the raw per-token math: at high utilization, an Inferentia2 or Spot-GPU fleet can serve tokens at a fraction of Bedrock's per-token rate. The trap is that the per-token math ignores three real costs — idle capacity, engineering time, and operational risk — that managed platforms absorb on your behalf.
Idle capacity is the silent killer. Bedrock charges you nothing between requests; a self-hosted fleet charges for every idle second. If your traffic is spiky or your utilization sits below ~50%, the idle hours can erase the entire per-token advantage and then some. Self-hosting only wins when you can keep the hardware genuinely busy — high, steady, predictable load.
Then there is the human cost. A production self-hosted inference stack needs autoscaling, continuous batching, model loading and versioning, health checks, GPU monitoring, capacity planning, Spot-interruption handling, and on-call coverage. That is real, ongoing senior-engineer time — frequently the equivalent of a meaningful fraction of an FTE. At a loaded engineering cost, that labor often exceeds the infrastructure savings until volume is large.
As a working heuristic for 2026: below roughly $8K–$15K/month of equivalent on-demand Bedrock spend, managed almost always wins on total cost of ownership once you price the engineering time honestly. Above that band — with steady, predictable traffic and a model that is well-supported on Inferentia2 — self-hosting can cut the unit cost 40–70%, and the savings start to dwarf the operational overhead. The break-even is not a universal constant; it moves with your utilization, your team's existing ML-ops maturity, and how interruptible your workload is. But the shape is consistent: low/spiky volume → managed; high/steady volume → self-host.
The hybrid answer is usually the right one in practice. Run the steady, high-volume, latency-tolerant core of your traffic on a self-hosted Inferentia2 fleet for the unit-cost win, and burst the spiky overflow and the long-tail hard requests to Bedrock so you never pay for idle on the variable portion. This captures most of the self-hosting savings without taking on the full risk of owning 100% of capacity.
Pulling it together into a sequence you can actually run. This is the order of questions a cost-conscious ML team should ask — each step either resolves the decision or passes you to the next.
The framework optimizes for total cost of ownership, not just the headline per-token number. Work through it top to bottom; most teams find their answer in the first three questions.
Across real inference workloads, the same handful of errors account for most of the overspend. None require exotic engineering to fix — they require knowing where to look.
The three serving options compared on the variables that actually drive total cost of ownership. The right answer is entirely a function of your traffic shape and volume — there is no universally cheapest option.
| Variable | Bedrock (managed) | SageMaker real-time | Self-host EC2 (GPU/Inferentia) |
|---|---|---|---|
| Billing unit | Per input/output token | Per instance-hour (warm) | Per instance-hour + storage/transfer |
| Idle cost | Zero | Full instance rate | Full instance rate |
| Cheapest when… | Spiky / low / unpredictable volume | High, sustained utilization | High, steady volume + good utilization |
| Lowest unit cost at scale | No (per-token markup) | Moderate | Yes (esp. Inferentia2 + Spot) |
| Ops burden | None (serverless) | Low–medium (AWS-managed hosting) | High (you own the stack + on-call) |
| Custom / fine-tuned models | Limited to hosted catalog | Yes (bring your own) | Yes (full control) |
| Biggest cost lever | Caching · batch · routing | Utilization · right-sizing | Inferentia2 · Spot · Savings Plans |
| Best for | Most teams, most of the time | Custom model, steady load | High-volume steady traffic at scale |
Situation: Inference was the largest line on the AWS bill and growing faster than revenue. A RAG pipeline re-sent ~5K tokens of retrieved context plus a long system prompt on every request; every ticket also went to a frontier model regardless of difficulty; nightly transcript summarization ran through the real-time API. The ML lead suspected 40%+ was waste but had no bandwidth to instrument and re-architect it.
What CloudRoute did: Routed within ~24h to a vetted AWS partner with a Bedrock + FinOps track record. The partner instrumented token spend by call type, then: (1) moved the stable RAG prefix + system prompt into prompt caching; (2) added a cascade router sending easy tickets to a small model and escalating only on low confidence; (3) shifted nightly summarization to Bedrock batch; (4) modeled an Inferentia2 self-host for the steady embeddings workload as a phase-two option. Scoped under AWS POC / Well-Architected funding so the engagement was AWS-funded.
Outcome: Inference spend fell from ~$26K to ~$10K/month (~62%) within 6 weeks with no measurable quality regression on their evals: caching removed ~70% of input tokens, the cascade moved ~55% of traffic to the small model, and batch halved the summarization slice. CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.
engagement window: 6 weeks · founder/ML time: ~12 hours · monthly savings: ~$16K · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who instruments your token spend and applies the right levers — caching, batch, routing, Inferentia2 — frequently under AWS POC / Well-Architected funding, so the engagement costs you $0.