for AWS partners →Cut my inference bill →

AI inference cost · 2026 cross-platform guide

AI inference cost optimization on AWS — the definitive cross-platform guide (2026).

Inference, not training, is where production AI bills are won or lost. Training is a one-time capital event; inference is a per-token cost that compounds with every user, every day, forever. This guide maps the three ways to serve models on AWS — Bedrock managed, SageMaker endpoints, and self-hosting on EC2/Inferentia — their real cost models, the specific levers that move each one, and a break-even framework for deciding which to use at your scale.

Cut my inference bill →→ jump to the decision framework

serving options

typical savings

40–80%

Inferentia2 vs GPU

up to 70% lower $/token

break-even (self-host)

~$8K–$15K/mo

TL;DR

There are three ways to serve models on AWS and they have fundamentally different cost shapes. Bedrock is pure pay-per-token with zero idle cost — cheapest until volume is high and steady. SageMaker real-time endpoints bill per instance-hour whether or not a request arrives — cheapest only when utilization is high. Self-hosting on EC2 (GPU or Inferentia) is the lowest unit cost at scale but you pay for every idle second and own the ops.
The single biggest lever is not the platform — it is the model. A distilled or quantized model that is 4× smaller serves roughly 4× cheaper on identical hardware, and prompt/response caching can remove 30–90% of token volume on repetitive workloads before you touch infrastructure. Optimize the workload first, then the serving option, then the instance.
The build-vs-buy break-even on AWS in 2026 sits around $8K–$15K/month of equivalent Bedrock spend. Below it, managed (Bedrock) almost always wins on total cost of ownership once you price in engineering time. Above it — with steady, predictable traffic — self-hosting on Inferentia2 or Graviton-backed instances with Spot and Savings Plans can cut the unit cost 40–70%, but only if utilization stays high.

the cost that compounds

IWhy inference — not training — is where the money goes

Most teams budget for training because it has a big, visible number attached. But for any product that actually ships and gets used, inference is the line item that grows without bound. Understanding why reframes every decision below.

Training is a capital event: you spend a fixed sum once (or once per major model version) and you are done. Inference is an operating cost that scales linearly with usage — every request, from every user, on every day the product is live, costs money. A model trained for $200K can easily generate $200K of inference cost per quarter once it is serving real traffic. The asymmetry is the whole point: a training run ends; inference never does.

The cost of a single inference call is driven by three things — the number of tokens processed (input + output), the size and architecture of the model, and the hardware it runs on. Input tokens (the prompt, the retrieved context, the system instructions) are usually cheaper per token than output tokens, because output is generated autoregressively one token at a time while input can be processed in a single forward pass. On most managed platforms output tokens cost 3–5× more than input tokens — which is why verbose prompts hurt less than verbose responses.

This is also why retrieval-augmented generation (RAG) quietly inflates bills. Every RAG call stuffs retrieved documents into the prompt, and those documents are billed as input tokens on every single request. A RAG system that retrieves 4,000 tokens of context per query pays for those 4,000 tokens millions of times over the product's life. The fix is rarely "use a cheaper model" — it is "send fewer tokens," through better retrieval, context compression, and caching.

The mental model for the rest of this guide: your inference bill is roughly (requests × tokens-per-request × price-per-token), and price-per-token is itself a function of model size and hardware efficiency. Every optimization lever attacks one of those multiplicands. The biggest wins come from attacking the ones with the largest exponents — usually token volume and model size — before micro-optimizing the hardware.

the three serving options

IIThe three ways to serve a model on AWS — and how each one bills

Every inference workload on AWS lands on one of three serving substrates. They are not interchangeable; each has a distinct cost shape that makes it cheap in one regime and expensive in another. Choosing the wrong one is the most common and most expensive mistake.

The defining variable is how each option charges for idle capacity. Bedrock charges nothing when no request is in flight. SageMaker real-time endpoints charge the full instance rate whether traffic is zero or saturated. Self-hosted EC2 charges for the instance the moment it boots, idle or not. That single difference — what you pay for nothing — determines which option wins at your traffic profile.

Option A — Amazon Bedrock (fully managed, pay-per-token)

What it is: A serverless API to foundation models (Anthropic Claude, Meta Llama, Amazon Nova/Titan, Mistral, Cohere, and others). You send tokens, you get tokens, AWS owns all the infrastructure. No instances, no scaling, no GPUs to manage.

How it bills: Per input token and per output token, at a published per-model rate. On-demand has zero idle cost — you pay only for tokens actually processed. For steady high volume, Provisioned Throughput reserves dedicated capacity (billed per "model unit" per hour, optionally with 1- or 6-month commitments at a discount) which can beat on-demand once utilization is high. Batch inference processes large jobs asynchronously at roughly 50% of the on-demand token price.

Cost shape: Linear with token volume, zero fixed cost. This is the cheapest option for spiky, low, or unpredictable traffic — you never pay for idle. It becomes relatively expensive only at very high, very steady volume, where the per-token markup over raw hardware cost starts to dominate.

Option B — SageMaker endpoints (managed hosting, pay-per-instance-hour)

What it is: Managed model hosting where you bring a model (open-weights or your own fine-tune) and SageMaker runs it on instances you select. Real-time endpoints keep instances warm; Serverless Inference scales to zero; Asynchronous endpoints queue large or bursty requests.

How it bills: Real-time endpoints bill per instance-hour for as long as the endpoint exists, regardless of request volume. A warm ml.g5.xlarge costs the same at 2% utilization as at 90%. Serverless Inference bills per millisecond of compute actually used (with cold starts as the trade-off); Asynchronous bills per instance-hour but can scale to zero between batches.

Cost shape: Real-time is cheap only when utilization is high — the per-instance-hour rate is amortized across many requests. At low utilization it is the most expensive of all three options because you pay full freight for idle GPUs. This is the right choice when you need a custom or fine-tuned open-weights model served with predictable, sustained load.

Option C — Self-host on EC2 (GPU or Inferentia, maximum control)

What it is: You run the model yourself on raw EC2 instances — NVIDIA GPU instances (P5/P4d/G6/G5), AWS Inferentia2 (Inf2), or CPU/Graviton for small models — with your own serving stack (vLLM, TGI, TensorRT-LLM, or the Neuron SDK for Inferentia). Maximum control, maximum responsibility.

How it bills: Per instance-hour for the EC2 instance, plus storage and data transfer. On-demand is the list price; Spot Instances cut 60–90% for interruptible workloads; Savings Plans and Reserved Instances cut 30–72% for committed steady use. You also pay — in engineering time — for autoscaling, batching, health checks, model loading, and on-call.

Cost shape: Lowest possible unit cost at high, steady utilization, especially on Inferentia2 or with Spot. But you pay for every idle second, and the operational burden is real. This wins only above the break-even volume and only when you can keep the hardware busy.

levers · managed (Bedrock)

IIICutting cost on Bedrock: caching, batch, and routing

On a managed pay-per-token platform you cannot touch the hardware — so every lever is about sending fewer, cheaper tokens, or shifting work to cheaper price tiers. Three levers do almost all the work.

Because Bedrock has zero idle cost, optimization here is purely about the token bill. You are not fighting utilization; you are fighting volume and price tier. The three highest-leverage moves are prompt caching, batch processing, and model routing — in roughly that order of impact for most workloads.

Prompt caching — stop paying for the same tokens twice

Prompt caching lets you mark a stable prefix of the prompt — a long system instruction, a tool schema, a few-shot block, a retrieved document set — so that repeated calls reuse the already-processed prefix instead of re-billing it at full input price. Cache reads are dramatically cheaper than fresh input tokens (a large fraction off, depending on model), and cache writes carry a small premium on the first call.

The economics are decisive for any workload with a large, repeated context: agents with long tool definitions, chatbots with a fixed persona, RAG systems re-sending the same knowledge base chunks, or document-processing pipelines that reuse the same instructions across thousands of files. On these patterns prompt caching commonly removes 50–90% of input-token cost. The discipline is structural: put everything stable at the front of the prompt and everything variable at the end, so the cacheable prefix is as long as possible.

Batch inference — half price for anything not real-time

Bedrock batch inference runs large jobs asynchronously and prices tokens at roughly 50% of the on-demand rate. Any workload that does not need a synchronous, sub-second response is a candidate: nightly summarization, bulk classification, embeddings generation, dataset labeling, content moderation backfills, evaluation runs. The trade-off is latency — jobs complete on AWS's schedule, not instantly — in exchange for halving the token bill.

The common mistake is running batch-shaped work through the real-time API because that was the first integration built. Auditing your traffic for "does this actually need to be synchronous?" and moving the answer-is-no portion to batch is often a same-day 20–40% reduction on total Bedrock spend, with no model or quality change whatsoever.

Model routing — match model size to task difficulty

Not every request needs the largest, most capable model. A routing layer classifies each request by difficulty and sends easy ones (simple extraction, short classification, formatting) to a small fast model and hard ones (multi-step reasoning, nuanced generation) to a frontier model. Because the price gap between a small and a large model on Bedrock can be 10–20×, even routing 40–60% of traffic to the small model produces large savings.

The pattern that holds quality is the cascade: try the cheap model first, and escalate to the expensive model only when a confidence check or validator flags the cheap answer as inadequate. Done well, a cascade captures most of the cost savings of the small model while preserving frontier-model quality on the requests that genuinely need it. The investment is in the router and the validators, not the models themselves.

stack the levers

These compound multiplicatively, not additively. A workload that caches a 4,000-token RAG prefix (−70% input cost), moves its nightly bulk jobs to batch (−50% on that slice), and routes half its traffic to a small model (−10× on that half) can land at 20–35% of its original Bedrock bill with no quality regression. Always measure each lever in isolation first so you know which one is actually carrying the savings.

levers · self-host (EC2 / SageMaker)

IVCutting cost when you own the instances: right-sizing, Spot, Inferentia, Graviton

When you run the hardware — whether raw EC2 or a SageMaker real-time endpoint — the enemy is idle capacity and overpriced silicon. The levers here are about utilization and about choosing chips priced for inference rather than training.

The first principle of self-hosting is brutal: an idle GPU is pure loss. A warm p4d left running overnight at 3% utilization is burning the same dollars as one at 90%. So every lever below is ultimately about either keeping the hardware busy (right-sizing, autoscaling, batching) or paying less for each hour it runs (Spot, Savings Plans, cheaper chips).

Right-sizing and autoscaling — kill idle capacity

Most self-hosted inference fleets are over-provisioned because the team sized for peak and never scaled down. Right-sizing means matching instance type to the model's actual memory and throughput needs (a 7B model does not need an 80GB GPU) and configuring autoscaling so instance count tracks real traffic — scaling out under load and, critically, scaling in (or to zero, on Serverless/Asynchronous endpoints) when traffic falls.

Continuous (in-flight) batching is the highest-leverage software lever: servers like vLLM and TGI pack many concurrent requests through the GPU together, raising throughput per instance several-fold versus naive one-request-at-a-time serving. Higher throughput per instance means fewer instances for the same traffic, which means lower cost — often a 2–4× efficiency gain from the serving stack alone, before any hardware change.

Spot Instances — 60–90% off for fault-tolerant inference

Spot Instances sell spare EC2 capacity at 60–90% below on-demand, with the catch that AWS can reclaim them on two minutes' notice. For inference this is often acceptable: stateless replicas behind a load balancer can lose a node and recover, and batch/asynchronous jobs can checkpoint and resume. The pattern is a mixed fleet — a baseline of on-demand or Savings-Plan capacity for guaranteed availability, topped up with Spot for the variable load.

Spot is a poor fit for a single-replica real-time endpoint with strict SLAs, because an interruption is a visible outage. It shines for embeddings pipelines, batch scoring, asynchronous endpoints, and any horizontally-scaled fleet where losing one of many replicas degrades gracefully rather than failing hard.

Inferentia2 and Trainium — silicon priced for inference

AWS Inferentia2 (Inf2 instances) is a purpose-built inference accelerator that, for many transformer workloads, delivers materially lower cost-per-token than comparable NVIDIA GPU instances — commonly cited in the range of up to ~70% lower inference cost and meaningfully better performance-per-watt. Trainium is its training-oriented sibling and can also serve inference. The trade-off is the toolchain: you compile and serve models through the AWS Neuron SDK rather than a stock CUDA stack, which adds integration work and means not every model or custom op is supported out of the box.

For high-volume, steady inference on well-supported architectures (Llama-family, many standard transformers), Inferentia2 is frequently the single largest unit-cost lever available — larger than Spot, larger than right-sizing — precisely because it attacks the price of the silicon itself. The decision is usually "is my model supported on Neuron, and is my volume high enough to justify the porting effort?" If both are yes, it is hard to beat on dollars per token.

Graviton — CPU inference for small models

Not every model needs a GPU. Small models, embeddings, classical ML, and many distilled or quantized sub-3B language models run perfectly well on CPU — and AWS Graviton (Arm-based) instances offer strong price-performance for exactly this. For a high-volume embeddings service or a small classifier, a Graviton fleet can be a fraction of the cost of keeping GPUs warm, with the added benefit that CPU capacity is abundant and cheap on Spot.

The rule of thumb: if the model fits and meets latency on CPU, do not pay for a GPU. Reserve accelerators (GPU/Inferentia) for the models that genuinely need them, and push everything small onto Graviton. This tiering — Graviton for small, Inferentia2 for large, GPU only where required — is how cost-disciplined teams structure a self-hosted fleet.

the biggest lever

VOptimize the model before the infrastructure: quantization and distillation

Platform and instance choices move cost by factors of two or three. Model-level optimization can move it by factors of four to ten — and it works on every serving option simultaneously, because a smaller model is cheaper everywhere. This is the lever to pull first.

A model's cost is dominated by its size: parameter count drives the memory it needs, the hardware it fits on, and the compute per token. Make the model smaller without losing the accuracy you need, and every downstream cost shrinks in proportion — fewer/cheaper instances if you self-host, fewer GPUs per request, sometimes a jump to a cheaper hardware tier entirely (GPU → Inferentia, or even GPU → Graviton CPU). The two dominant techniques are quantization and distillation.

Quantization — same model, lower precision

Quantization stores and computes model weights at lower numerical precision — FP16 or BF16 instead of FP32, or INT8/FP8/INT4 instead of 16-bit. Halving precision roughly halves memory footprint and increases throughput, which directly cuts the hardware needed per token. Modern post-training quantization (e.g. INT8/FP8 and well-tuned INT4 schemes) typically preserves accuracy within a small tolerance for most production tasks, making it close to free savings.

The practical payoff is often a hardware-tier change: a model that needed two GPUs at FP16 may fit on one at INT8, halving the instance cost outright. On a memory-bound workload, quantization can be the difference between needing an 80GB accelerator and fitting comfortably on a 24GB one. Always validate quality on your own evals after quantizing — the accuracy hit is usually small, but "usually" is not "always," and it is task-dependent.

Distillation — a smaller model trained to imitate a bigger one

Distillation trains a small "student" model to reproduce the behavior of a large "teacher" model on your specific task distribution. The result is a compact model that, on the narrow domain you care about, approaches the quality of a model many times its size — at a fraction of the inference cost. For a focused production task (classification, extraction, a specific style of generation, routing), a well-distilled small model frequently matches a frontier model closely enough while costing 5–15× less to serve.

The trade-off is up-front effort and generality: distillation requires a training pipeline and high-quality teacher outputs, and the student is specialized — it will not generalize beyond its training distribution the way a frontier model does. The pattern that wins is hybrid: distill the small model for the high-volume, narrow, repetitive 80% of traffic, and keep a frontier model (on Bedrock) for the long-tail, hard, or open-ended 20%. You pay frontier prices only where frontier capability is actually required.

sequence matters

Optimize in this order: (1) workload — cache, batch, send fewer tokens; (2) model — quantize, then distill for narrow high-volume tasks; (3) serving option — Bedrock vs SageMaker vs self-host; (4) instance — Inferentia2/Graviton, Spot, Savings Plans. Teams routinely jump straight to step 4 (chasing a cheaper instance) and leave the 5–10× wins in steps 1–2 on the table.

build vs buy

VIThe build-vs-buy break-even: when self-hosting actually pays off

The recurring question is whether to stay on managed Bedrock or self-host on EC2/Inferentia. The honest answer is a break-even calculation — and the break-even is higher than most teams assume once engineering time is priced in.

The seduction of self-hosting is the raw per-token math: at high utilization, an Inferentia2 or Spot-GPU fleet can serve tokens at a fraction of Bedrock's per-token rate. The trap is that the per-token math ignores three real costs — idle capacity, engineering time, and operational risk — that managed platforms absorb on your behalf.

Idle capacity is the silent killer. Bedrock charges you nothing between requests; a self-hosted fleet charges for every idle second. If your traffic is spiky or your utilization sits below ~50%, the idle hours can erase the entire per-token advantage and then some. Self-hosting only wins when you can keep the hardware genuinely busy — high, steady, predictable load.

Then there is the human cost. A production self-hosted inference stack needs autoscaling, continuous batching, model loading and versioning, health checks, GPU monitoring, capacity planning, Spot-interruption handling, and on-call coverage. That is real, ongoing senior-engineer time — frequently the equivalent of a meaningful fraction of an FTE. At a loaded engineering cost, that labor often exceeds the infrastructure savings until volume is large.

As a working heuristic for 2026: below roughly $8K–$15K/month of equivalent on-demand Bedrock spend, managed almost always wins on total cost of ownership once you price the engineering time honestly. Above that band — with steady, predictable traffic and a model that is well-supported on Inferentia2 — self-hosting can cut the unit cost 40–70%, and the savings start to dwarf the operational overhead. The break-even is not a universal constant; it moves with your utilization, your team's existing ML-ops maturity, and how interruptible your workload is. But the shape is consistent: low/spiky volume → managed; high/steady volume → self-host.

The hybrid answer is usually the right one in practice. Run the steady, high-volume, latency-tolerant core of your traffic on a self-hosted Inferentia2 fleet for the unit-cost win, and burst the spiky overflow and the long-tail hard requests to Bedrock so you never pay for idle on the variable portion. This captures most of the self-hosting savings without taking on the full risk of owning 100% of capacity.

decision framework

VIIA decision framework: which serving option, in which order

Pulling it together into a sequence you can actually run. This is the order of questions a cost-conscious ML team should ask — each step either resolves the decision or passes you to the next.

The framework optimizes for total cost of ownership, not just the headline per-token number. Work through it top to bottom; most teams find their answer in the first three questions.

1. Have you optimized the workload yet? — Before choosing any platform, cache stable prompt prefixes, move non-real-time work to batch, and cut token volume (better retrieval, context compression). These are platform-independent and frequently the biggest single win. Do this first regardless of where you serve.
2. Is your traffic spiky, low, or unpredictable? — If yes → Bedrock on-demand. Zero idle cost makes it the cheapest option whenever utilization would be low. Do not self-host spiky traffic — you will pay for idle GPUs.
3. Do you need a custom/fine-tuned open-weights model, with steady load? — If you need a specific model Bedrock does not host and traffic is sustained → SageMaker real-time endpoint (high utilization) or self-host. If load is bursty → SageMaker Serverless or Asynchronous so you scale to zero.
4. Is equivalent volume above ~$8K–$15K/mo with steady traffic? — If yes, self-hosting becomes worth evaluating. Below that, the engineering and idle cost of self-hosting usually outweighs the per-token savings — stay managed.
5. Is your model well-supported on Inferentia2 / Neuron? — If yes and volume is high → Inferentia2 is typically the lowest unit cost. If the model is small enough → Graviton CPU. Reserve GPUs for models that genuinely require them.
6. Can your workload tolerate interruptions? — If yes (stateless replicas, batch, async) → layer Spot Instances for 60–90% off. If not → cover the SLA-critical baseline with Savings Plans / Reserved capacity and use Spot only for overflow.
7. Could a smaller model do the job? — Always revisit. Quantize the model you are serving; distill a small student for the high-volume narrow tasks and keep a frontier model on Bedrock for the hard long tail. A 4× smaller model is ~4× cheaper on whatever you chose above.

honest answers

VIIIThe five most expensive mistakes (and the fix)

Across real inference workloads, the same handful of errors account for most of the overspend. None require exotic engineering to fix — they require knowing where to look.

Running batch-shaped work through the real-time API — Nightly summarization, bulk classification, and embeddings backfills sent synchronously pay full on-demand price for no latency benefit. Fix: move them to Bedrock batch (~50% off) or a scale-to-zero asynchronous endpoint. Often a same-day double-digit-percent cut.
Idle GPUs left warm — A self-hosted or SageMaker real-time endpoint at low utilization is the single most expensive failure mode — full price for idle silicon. Fix: autoscaling that scales in, scale-to-zero for bursty traffic, and right-sized instances.
Re-sending the same context on every request — RAG and agent workloads re-bill long, identical prefixes (knowledge chunks, tool schemas, system prompts) on every call. Fix: prompt caching, with all stable content moved to the front of the prompt. 50–90% off input cost on repetitive workloads.
Using the frontier model for everything — Sending trivial extraction and classification to the largest model wastes the 10–20× price gap to a small one. Fix: a routing/cascade layer that tries a cheap model first and escalates only on low confidence.
Self-hosting before the break-even — Standing up a GPU fleet at low volume to "save money" usually loses money once engineering time and idle hours are counted. Fix: stay on Bedrock until steady volume clears the ~$8K–$15K/month band, then revisit with Inferentia2.

side by side

Bedrock vs SageMaker endpoints vs self-hosted EC2 — the cost models

The three serving options compared on the variables that actually drive total cost of ownership. The right answer is entirely a function of your traffic shape and volume — there is no universally cheapest option.

Variable	Bedrock (managed)	SageMaker real-time	Self-host EC2 (GPU/Inferentia)
Billing unit	Per input/output token	Per instance-hour (warm)	Per instance-hour + storage/transfer
Idle cost	Zero	Full instance rate	Full instance rate
Cheapest when…	Spiky / low / unpredictable volume	High, sustained utilization	High, steady volume + good utilization
Lowest unit cost at scale	No (per-token markup)	Moderate	Yes (esp. Inferentia2 + Spot)
Ops burden	None (serverless)	Low–medium (AWS-managed hosting)	High (you own the stack + on-call)
Custom / fine-tuned models	Limited to hosted catalog	Yes (bring your own)	Yes (full control)
Biggest cost lever	Caching · batch · routing	Utilization · right-sizing	Inferentia2 · Spot · Savings Plans
Best for	Most teams, most of the time	Custom model, steady load	High-volume steady traffic at scale

Hybrid is common and usually optimal: self-host the steady high-volume core on Inferentia2 for unit cost, and burst spiky overflow + long-tail hard requests to Bedrock so you never pay for idle on the variable portion.

not sure where your inference budget is leaking?

Get matched with a partner who instruments and cuts your inference bill

Start in 3 minutes →

a recent match

A RAG bill cut ~62% — anonymized

inquiry · series-b ai support platform, remote-EU

Series-B AI customer-support platform, ~22 engineers, ~$26K/month Bedrock spend (Claude on Bedrock, heavy RAG)

Situation: Inference was the largest line on the AWS bill and growing faster than revenue. A RAG pipeline re-sent ~5K tokens of retrieved context plus a long system prompt on every request; every ticket also went to a frontier model regardless of difficulty; nightly transcript summarization ran through the real-time API. The ML lead suspected 40%+ was waste but had no bandwidth to instrument and re-architect it.

What CloudRoute did: Routed within ~24h to a vetted AWS partner with a Bedrock + FinOps track record. The partner instrumented token spend by call type, then: (1) moved the stable RAG prefix + system prompt into prompt caching; (2) added a cascade router sending easy tickets to a small model and escalating only on low confidence; (3) shifted nightly summarization to Bedrock batch; (4) modeled an Inferentia2 self-host for the steady embeddings workload as a phase-two option. Scoped under AWS POC / Well-Architected funding so the engagement was AWS-funded.

Outcome: Inference spend fell from ~$26K to ~$10K/month (~62%) within 6 weeks with no measurable quality regression on their evals: caching removed ~70% of input tokens, the cascade moved ~55% of traffic to the small model, and batch halved the summarization slice. CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.

engagement window: 6 weeks · founder/ML time: ~12 hours · monthly savings: ~$16K · cost to customer: $0

faq

Common questions

Is it cheaper to use Bedrock or to self-host my own model on AWS?

It depends almost entirely on volume and traffic shape. Bedrock has zero idle cost and is cheapest for spiky, low, or unpredictable traffic. Self-hosting on EC2 (especially Inferentia2 with Spot and Savings Plans) reaches the lowest unit cost but only at high, steady utilization. The rough 2026 break-even is around $8K–$15K/month of equivalent Bedrock spend once you price in engineering time and idle capacity. Below it, managed usually wins; above it with steady traffic, self-hosting can cut unit cost 40–70%.

What is the single biggest lever for reducing inference cost?

The model, not the platform. Quantization (lower precision, often near-free) and distillation (a small model trained to imitate a big one on your task) can cut cost 4–10× and they apply to every serving option at once. After the model, the biggest workload-level lever is usually prompt caching plus moving non-real-time work to batch. Chase those before optimizing instances — most teams have it backwards.

Why are output tokens more expensive than input tokens?

Input tokens (the prompt) can be processed in a single parallel forward pass, while output tokens are generated autoregressively — one token at a time, each depending on the last — which is far less parallelizable and more compute-intensive per token. On most managed platforms output tokens cost roughly 3–5× more than input tokens, which is why verbose responses hurt the bill more than verbose prompts.

How much can prompt caching actually save?

On workloads with a large, repeated prefix — agents with long tool schemas, chatbots with fixed personas, RAG re-sending the same context — prompt caching commonly removes 50–90% of input-token cost. Cache reads are far cheaper than fresh input tokens; the first call carries a small cache-write premium. The discipline is structural: put everything stable at the front of the prompt and everything variable at the end so the cacheable prefix is as long as possible.

When does AWS Inferentia2 make sense versus NVIDIA GPUs?

Inferentia2 (Inf2) is purpose-built for inference and frequently delivers materially lower cost-per-token than comparable GPU instances — commonly cited up to ~70% lower — with better performance-per-watt. The trade-off is the AWS Neuron SDK toolchain: you compile and serve through Neuron rather than a stock CUDA stack, and not every model or custom op is supported. It is usually the largest unit-cost lever for high-volume, steady inference on well-supported architectures like Llama-family models.

Can I run inference on CPU to save money?

For small models, embeddings, classical ML, and many distilled/quantized sub-3B models — yes. AWS Graviton (Arm) instances offer strong price-performance and are cheap and abundant on Spot. The rule of thumb: if the model fits and meets your latency target on CPU, do not pay for a GPU. Tier your fleet — Graviton for small, Inferentia2 for large, GPU only where genuinely required.

How does batch inference cut cost, and what is the catch?

Bedrock batch inference processes large jobs asynchronously at roughly 50% of the on-demand token price. Any workload that does not need a synchronous response — nightly summarization, bulk classification, embeddings generation, evals, moderation backfills — is a candidate. The only catch is latency: jobs finish on AWS's schedule rather than instantly. Moving batch-shaped work off the real-time API is often a same-day 20–40% reduction with no quality change.

Will any of these optimizations hurt model quality?

Most are quality-neutral by construction: caching, batch, and right-sizing change cost without touching outputs. Model routing/cascades preserve quality when you escalate hard requests to the frontier model on low confidence. The two that can affect quality are quantization (usually a small, task-dependent accuracy hit) and distillation (a specialized student that will not generalize like a frontier model). Always validate on your own evals after any model-level change — the hit is usually small, but it is not guaranteed.

Cut your AWS inference bill — often AWS-funded

CloudRoute routes you to a vetted AWS partner who instruments your token spend and applies the right levers — caching, batch, routing, Inferentia2 — frequently under AWS POC / Well-Architected funding, so the engagement costs you $0.

Get matched in 24h →→ see the data & AI persona detail

matched within< 24h

typical inference savings40–80%

cost to you$0