A neutral reference for choosing between Amazon Bedrock on-demand and Provisioned Throughput in 2026: how per-token billing and reserved model-unit billing actually differ, the break-even volume with a fully worked example and table, the latency and throughput guarantees each one gives you, the cases where Provisioned Throughput is the only option (custom models), where Batch fits as a cheaper third lane, and a verdict by scenario. Plus how AWS credits — and FinOps discipline — make the right choice cheaper, often $0.
Amazon Bedrock gives you two fundamentally different ways to pay for inference on the same models. They are not tiers of the same thing — they are different cost structures, and choosing between them is the largest cost-and-reliability lever on a heavy Bedrock workload. Start by being precise about what each one actually charges for.
On-demand is the default. You call a model, you pay a published rate per 1,000 input tokens and per 1,000 output tokens for that specific model, and you commit to nothing. Capacity is shared across every account in the region and governed by per-account throughput quotas. Your bill is a direct function of usage: send more tokens, pay more; send none, pay nothing. There is no floor and no ceiling — it scales perfectly with traffic, in both directions.
Provisioned Throughput (PT) inverts that. You reserve dedicated inference capacity for one specific model, measured in model units, and pay a flat hourly rate per model unit for as long as the allocation exists — independent of how many tokens you actually push through it. Your bill is a function of reserved capacity and time (model units × hours), not usage. Saturate the unit or leave it idle for an hour and you pay the same. (The dedicated PT page covers the model-unit lifecycle, commitment terms, and buy/manage steps in depth; this page is about choosing between the two.)
The cleanest way to hold the difference in your head: on-demand cost tracks tokens; provisioned cost tracks time. One is a usage meter, the other is a capacity lease. Everything else — the break-even math, the latency contract, the custom-model rule — follows from that single distinction.
A familiar analogy: on-demand is paying per ride; Provisioned Throughput is leasing a dedicated vehicle. Ride occasionally and pay-per-ride is cheaper. Commute heavily every day and the lease is cheaper and always available — but you pay for it even on the days you stay home. The entire decision is figuring out which describes your traffic, and recognizing the cases where you are not allowed to take rides at all (custom models) and must lease.
There is also a crucial third option that the "on-demand vs provisioned" framing tends to hide: Batch, which runs large jobs asynchronously at roughly half the on-demand token rate. For any workload that does not need an answer in real time, Batch quietly beats both of the headline options on cost. We bring it back in Section VI, because a real Bedrock cost plan is usually three lanes, not two.
On-demand = pay per token, commit to nothing, shared best-effort capacity. Provisioned Throughput = reserve model units, pay a flat hourly rate per unit regardless of tokens, get guaranteed throughput/latency, and unlock custom-model hosting. On-demand cost tracks usage; provisioned cost tracks time.
Before any break-even math, it helps to see the two cost curves clearly, because their shapes are what make the decision. One is a sloped line through the origin; the other is a flat line above zero. Where they cross is the whole game.
Picture monthly cost on the vertical axis and monthly token volume on the horizontal axis.
So the cost comparison is literally the intersection of a sloped line and a flat line. Left of the crossover, on-demand is cheaper; right of it, Provisioned Throughput is. The practical question — "am I left or right of the crossover?" — is what Section V works out with real numbers. But notice what the geometry already tells you: Provisioned Throughput can only win on cost if your reserved capacity is genuinely busy, because a flat bill spread over little usage is a terrible per-token effective rate. This is why over-provisioning is the classic PT mistake — you push yourself left of the crossover and quietly overpay.
On-demand is a straight line starting at the origin: zero traffic costs zero, and cost rises linearly with tokens. The slope is set by the model's per-token rates (output tokens typically cost several times more than input tokens). Because it passes through zero, on-demand is unbeatable at low volume — there is simply no cost when there is no traffic. The downside is that the line keeps climbing forever: at very high, steady volume, that ever-rising cost is exactly what a flat reserved bill can undercut.
Provisioned Throughput is a horizontal line: a fixed monthly cost (model units × hourly rate × hours in the month) that does not move with usage. At zero traffic you are still paying the full reserved amount; at maximum saturation you pay the same. The flat line sits well above zero, so at low volume it is far more expensive than on-demand — but because it is flat while on-demand keeps climbing, the two lines cross at some volume, and beyond that point the flat bill is the cheaper one.
Cost is only half the decision. The two billing models also give you very different performance and reliability guarantees, and for production systems that difference is often the deciding factor before the math even enters the picture.
On-demand capacity is shared and best-effort within your quota. Most of the time it is perfectly fine. But it is multi-tenant: during regional demand surges you can hit throttling — the ThrottlingException that latency-sensitive teams learn to design around — and tail latency is not contractually fixed. You also operate inside per-account, per-model throughput quotas, which can become a ceiling for a high-volume path even when AWS has capacity. On-demand asks you to engineer around variability: retries with backoff, queueing, and often cross-region inference to spread load across regions.
A provisioned model unit gives you isolated, guaranteed throughput — a fixed tokens-per-minute capacity that is yours, insulated from other tenants' spikes. Latency is consistent and there is no contention-driven throttling, because nobody else is using your reserved capacity. For an interactive product under a latency SLA, a checkout or onboarding flow that must not stall, or a regulated workflow that has to complete deterministically, that predictability is frequently worth more than any cost comparison. The capacity is reserved, so it behaves the same at 3pm on launch day as at 3am on a quiet night.
There is a useful pairing to keep straight here. Cross-region inference is the on-demand answer to spike-smoothing — it lets a single request automatically draw on capacity in multiple regions to reduce throttling — while Provisioned Throughput is the reserved-capacity answer. They solve overlapping problems differently: cross-region inference improves the odds on shared capacity without a commitment; PT removes the shared-capacity problem entirely by giving you your own. (See the cross-region-inference sibling for that mechanism in detail.)
The reliability axis is why the decision is genuinely two-dimensional rather than a pure cost optimization. You can be below the cost break-even — where on-demand is cheaper per token — and still correctly choose Provisioned Throughput because best-effort capacity is not an acceptable posture for that specific path. Putting it bluntly: cost decides the default; an SLA can override it. A path that absolutely cannot throttle may justify reserved capacity even when the units will not be fully saturated, because the alternative is breaching the SLA during exactly the traffic spikes that matter most.
On-demand = best-effort, shared, can throttle at spikes, latency not contractually fixed. Provisioned Throughput = isolated, guaranteed throughput and consistent latency, no contention throttling. A hard latency/availability SLA can justify PT below the cost break-even — the guarantee, not the price, is the reason.
For base models, on-demand vs Provisioned Throughput is a genuine choice you can make either way. For an important class of workloads it is not a choice at all: Provisioned Throughput is the only option. This is the case teams most often miss when budgeting a custom-model project, so it is worth being explicit.
The defining rule: most custom models on Bedrock can only be served via Provisioned Throughput. On-demand is reserved for the shared, multi-tenant base models that many accounts call. Anything that is specifically yours needs dedicated capacity to host it, because there is no shared endpoint for a model only your account has. That covers the main ways a model becomes "yours":
The planning consequence is concrete: if your roadmap includes fine-tuning, distillation, or importing a model, the "on-demand vs provisioned" question is partly already answered — that path will be on Provisioned Throughput, and the only remaining question is how few model units can serve the load. Build the standing hosting cost into the budget from day one. The most common custom-model budgeting mistake is pricing only the training and forgetting that the resulting model then sits on a 24/7 hourly charge for as long as it is live. For many narrow use cases, that standing cost is precisely why the team should sanity-check whether a base model on-demand — with good prompting, RAG, or prompt caching — would have been cheaper overall. (The fine-tuning sibling covers that decision.)
Base models: on-demand or Provisioned Throughput — your choice. Custom models (fine-tuned, distilled, imported): Provisioned Throughput only. A fine-tune decision is therefore also a commitment to a standing hourly hosting cost — budget the hosting, not just the training.
For a base model, the cost half of the decision reduces to one number: the volume at which Provisioned Throughput becomes cheaper than on-demand. Below it, stay on-demand; above it, reserve. Here is exactly how to compute your own line, with representative numbers to show the shape. The numbers are illustrative 2026 figures, not quotes — confirm current rates on the AWS pricing page.
The method does not depend on the exact rates: compare the fixed monthly cost of the model units you would need against the per-token on-demand cost of the same traffic. Where the two cross is your break-even.
Step 1 — size the model units. A model unit delivers a published throughput (tokens/minute) for a given model. Take your peak sustained throughput requirement and divide by the per-unit throughput to get the number of units you must reserve to serve the load without throttling. Say a workload needs 2 model units of a Sonnet-class model to serve its peak.
Step 2 — compute the fixed monthly PT cost. Suppose, for illustration, the discounted (longer-commitment) rate for that model is on the order of $25 per model-unit-hour. Two units running continuously across a 730-hour month is 2 × $25 × 730 ≈ $36,500/month, fixed, regardless of token volume.
Step 3 — price the same traffic on-demand. Now cost the actual token throughput at on-demand rates. If those two saturated units, run flat-out, would push roughly 3.0 billion input and 0.6 billion output tokens per month, then at representative Sonnet-class on-demand rates ($3 per 1M input, $15 per 1M output) that is 3,000 × $3 + 600 × $15 ≈ $9,000 + $9,000 ≈ $18,000/month. At that volume on-demand is roughly half the cost of reserving — so do not provision.
Step 4 — find the crossover. Provisioned Throughput only wins once on-demand cost climbs past the ~$36,500 fixed bill. Holding the same 5:1 input:output mix, on-demand reaches ~$36,500/month at roughly 6.1 billion input + 1.2 billion output tokens — about double the volume of the half-utilized case above. In other words, those two model units have to run close to saturation, around the clock, before reservation pays off on cost alone. The table below lays the two curves side by side at three utilization levels so the crossover is visible.
The blunt lesson: on a base model, Provisioned Throughput beats on-demand on cost only at genuinely high, sustained utilization. If the units would sit half-idle, on-demand is cheaper — often dramatically so. Two factors shift the line in PT's favor beyond raw cost. (1) Reliability: if on-demand throttling would breach an SLA, the guarantee can justify PT below the pure cost crossover. (2) Custom models: there is no on-demand line to compare against, so the question becomes "how few units can serve the load," not "PT or on-demand." For everything else, the rule of thumb holds: provision only when you can keep the units busy.
| Scenario | Monthly tokens (in / out) | On-demand cost/mo | Provisioned cost/mo (2 MU) | Cheaper option |
|---|---|---|---|---|
| ~50% utilization | 3.0B / 0.6B | ≈ $18,000 | ≈ $36,500 (fixed) | On-demand (by ~$18.5K) |
| ~break-even (~100%) | 6.1B / 1.2B | ≈ $36,500 | ≈ $36,500 (fixed) | Tie — the crossover |
| Sustained max + spikes | 8.0B / 1.6B | ≈ $48,000 | ≈ $36,500 (fixed) | Provisioned (by ~$11.5K) |
Provisioned Throughput beats on-demand on cost only when reserved model units run at high, sustained utilization (busy most hours of most days). Idle reserved capacity is pure waste. Below the crossover, on-demand — plus Batch and prompt caching where they fit — is cheaper. Above it, and for guaranteed-SLA or custom-model paths, reserve.
The "on-demand vs provisioned" question quietly assumes you need answers in real time. A large share of GenAI work does not. For anything that tolerates latency, Batch is a third lane that undercuts both — and ignoring it is one of the most common ways teams overpay on Bedrock.
Batch inference processes a large set of inputs asynchronously: you submit a job (typically input data staged in Amazon S3), Bedrock runs it when capacity is available, and you collect the results when it completes. In exchange for giving up real-time responses, the per-token rate is roughly 50% lower than on-demand for supported models. For non-interactive work, that is a structural discount no amount of provisioning tuning can match on the same tokens.
Crucially, Batch is a different billing model from both of the headline options. It is still per-token (like on-demand), so it commits you to nothing and scales with usage — but at about half the rate. It is not reserved capacity (unlike PT), so there is no idle-cost risk. The catch is purely latency: results arrive when the job finishes, not within milliseconds, so Batch is unsuitable for anything a user is waiting on live.
The classes of work that belong on Batch are large and common: bulk document processing (summarization, extraction, classification across a corpus), embedding generation for a knowledge base or search index, offline evaluation and dataset labeling, nightly enrichment jobs, and any periodic pipeline where the output is consumed later. Moving these off on-demand — and certainly off Provisioned Throughput — is often the single biggest line-item saving on a Bedrock bill. (The Batch sibling covers job mechanics and limits.)
So the real decision is not binary. A well-engineered Bedrock workload usually runs three lanes at once: Batch for everything asynchronous, on-demand (often with prompt caching to cut repeat-context cost) for interactive traffic that is variable or moderate, and Provisioned Throughput reserved only for the one or two hot, steady paths — or any custom model — that genuinely justify dedicated capacity. Picking "on-demand vs provisioned" for the whole account is the wrong altitude; the right move is to route each workload to the lane that fits its latency tolerance and volume shape.
Batch (~50% off on-demand) for anything asynchronous — bulk processing, embeddings, offline eval. On-demand (with prompt caching) for variable or moderate interactive traffic. Provisioned Throughput only for hot, steady, high-volume paths or custom models. Match each workload to a lane rather than picking one billing model for everything.
Pulling the whole decision into a single set of calls. Find the row that matches your workload; the recommendation column is the default, and the reasoning column tells you when to override it.
These are defaults, not absolutes — an SLA, a compliance constraint, or a credit-funded runway can shift any of them. But for most teams, matching the workload to the row below gets the billing model right on the first try.
One meta-point ties the table together: the question is per-workload, not per-account. Most teams that get Bedrock costs right do not pick a single billing model — they run all three lanes and route each path to the right one, then keep watching utilization and re-routing as traffic shape changes. That ongoing right-sizing — measuring real traffic, computing the break-even, moving paths between lanes, and cleaning up idle reserved capacity — is exactly the FinOps work a vetted partner handles in the engagements CloudRoute routes.
Whichever lane a workload lands in, the spend is ordinary Bedrock spend — which means AWS credits can absorb it, and disciplined cost engineering can make the bill smaller before credits even apply. Together they change the risk calculus of the whole on-demand-vs-provisioned decision.
On-demand tokens, Batch jobs, and Provisioned-Throughput model-unit-hours are all fully credit-eligible — credits in your AWS account apply automatically against each, the same way they apply to fine-tuning, embeddings, and the rest of your bill. The relevant pools are the familiar ones: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case, and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups).
Why credits matter specifically for this decision: the scariest thing about choosing Provisioned Throughput is paying for reserved capacity during the months before a workload has fully ramped — the "what if the volume does not materialize" risk that keeps teams on more expensive on-demand longer than they should be. When the commitment is drawn from a credit pool rather than runway, that risk is largely defused. You can reserve what a custom model or a high-SLA path needs, run it through launch and ramp, and let credits cover the standing cost while you prove the workload out. The discipline becomes "make the credits last" rather than "protect the bank balance" — which often means the correct billing model gets chosen instead of the cheapest-looking one.
Credits do not remove the need for FinOps, though — they extend it. The same right-sizing that controls a paid bill controls a credit burn: routing each workload to the cheapest lane that fits (Batch wherever latency allows), sizing model units from measured peak rather than a guess, wiring CloudWatch utilization alarms, and decommissioning idle allocations promptly. Done well, the cost-optimization sibling's techniques and the credit pool compound — the credits last far longer because the underlying bill is already lean.
The practical mechanic is that these pools are largely partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. CloudRoute matches you to the right pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and does the actual cost engineering: choosing the lane per workload, running the break-even math on real traffic, sizing and managing any reserved capacity, and keeping the bill lean. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. (For the credit mechanics, see AWS credits for generative-AI startups and the Bedrock POC funding page.)
The scannable version of the whole decision: the two headline billing models plus the Batch third lane, across how they bill, cost shape, performance, commitment, custom-model support, and where each one wins. Figures are representative 2026 illustrations, not quotes.
| Variable | On-Demand | Provisioned Throughput | Batch |
|---|---|---|---|
| How you pay | Per token (input + output) | Hourly per model unit | Per token, ~50% off on-demand |
| Cost shape | Scales with usage, from zero | Fixed (capacity × time) | Scales with usage, half-rate |
| Commitment | None — cancel any time | No-commit / 1mo / 6mo terms | None — per job |
| Latency | Real-time, best-effort | Real-time, guaranteed | Asynchronous (job completes later) |
| Throttling risk | Possible at spikes | None (reserved, isolated) | n/a (queued/async) |
| Serves custom models? | No | Yes (the only way) | Base models (where supported) |
| Idle-cost risk | None | High (pays even when idle) | None |
| When it wins | Variable / low / unknown interactive volume; prototypes | High steady volume; SLA paths; custom models | Bulk / offline / async work at any volume |
Situation: The team had defaulted everything to on-demand and watched the Bedrock bill climb as usage grew. They could not tell whether they should move to Provisioned Throughput, and a back-of-envelope estimate suggested reserving capacity for their whole workload would actually cost more than on-demand because much of the traffic was bursty. Meanwhile their nightly document-processing job was being billed at full on-demand rates, and one customer-facing review path was occasionally throttling under load — a contractual SLA risk. They needed someone to run the real break-even math rather than guess.
What CloudRoute did: CloudRoute matched them within 24 hours to a UK-region AWS partner with GenAI cost-engineering experience. The partner pulled real traffic data and routed each workload to the right lane: (1) moved the nightly bulk document job to Batch, cutting that line roughly in half; (2) kept the bursty interactive chat on on-demand and added prompt caching to cut repeat-context cost; (3) ran the break-even calculation on the high-SLA review path, found it ran steady and near-saturation, and reserved one model unit of Provisioned Throughput for it on a 1-month term to start — chosen for the latency guarantee as much as the cost. The partner also filed a Bedrock POC credit application plus an Activate Portfolio application to fund the whole bill.
Outcome: The three-lane split lowered the underlying Bedrock run-rate before credits even applied, the SLA path stopped throttling, and the approved credits then covered the remaining spend — so the team paid $0 through launch and ramp. Once the review path's volume proved out over the month, the partner rolled it onto a 6-month commitment for the deeper rate. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
lanes: Batch + on-demand + 1×PT MU · run-rate cut before credits, then $0 · credits secured: POC + Activate · out-of-pocket during build: $0
On-demand, Provisioned Throughput, or Batch — the right answer is usually all three, routed per workload. CloudRoute connects you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who runs the break-even math on your real traffic and engineers the mix. Customer pays $0.