for AWS partners →Fund either path with AWS credits →

bedrock on-demand vs provisioned · the billing-model decision · 2026

Bedrock on-demand vs Provisioned — which billing model fits your traffic.

A neutral reference for choosing between Amazon Bedrock on-demand and Provisioned Throughput in 2026: how per-token billing and reserved model-unit billing actually differ, the break-even volume with a fully worked example and table, the latency and throughput guarantees each one gives you, the cases where Provisioned Throughput is the only option (custom models), where Batch fits as a cheaper third lane, and a verdict by scenario. Plus how AWS credits — and FinOps discipline — make the right choice cheaper, often $0.

Fund either path with AWS credits →→ jump to the break-even math

on-demand bills

per token

provisioned bills

per MU / hour

cheaper third lane

Batch (~50%)

cost with credits

TL;DR

On-demand bills per token with zero commitment and best-effort shared capacity; Provisioned Throughput reserves dedicated capacity for one model and bills a flat hourly rate per "model unit" regardless of tokens, with guaranteed throughput and latency. On-demand wins for variable, low, or unknown traffic; Provisioned Throughput wins for steady high volume and for any path that cannot tolerate throttling.
The cost choice between them is a single break-even volume. Below it, on-demand is cheaper; above it, the flat provisioned bill is. The worked example below shows that on a base model, Provisioned Throughput only beats on-demand once reserved capacity runs at high, sustained utilization — idle reserved capacity is pure waste. Two things override the pure cost line: an SLA that throttling would breach, and custom models, which on-demand cannot serve at all.
It is not actually a two-way choice — Batch is the cheaper third lane (around 50% off on-demand) for anything that tolerates latency. The honest pattern: Batch for bulk, on-demand (with prompt caching) for interactive, Provisioned Throughput only for the one or two hot, steady, or custom paths that justify it. All three are credit-eligible Bedrock spend; CloudRoute routes you to a credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted partner to pick and right-size the mix — customer pays $0.

the two billing models

ITwo billing models, one decision

Amazon Bedrock gives you two fundamentally different ways to pay for inference on the same models. They are not tiers of the same thing — they are different cost structures, and choosing between them is the largest cost-and-reliability lever on a heavy Bedrock workload. Start by being precise about what each one actually charges for.

On-demand is the default. You call a model, you pay a published rate per 1,000 input tokens and per 1,000 output tokens for that specific model, and you commit to nothing. Capacity is shared across every account in the region and governed by per-account throughput quotas. Your bill is a direct function of usage: send more tokens, pay more; send none, pay nothing. There is no floor and no ceiling — it scales perfectly with traffic, in both directions.

Provisioned Throughput (PT) inverts that. You reserve dedicated inference capacity for one specific model, measured in model units, and pay a flat hourly rate per model unit for as long as the allocation exists — independent of how many tokens you actually push through it. Your bill is a function of reserved capacity and time (model units × hours), not usage. Saturate the unit or leave it idle for an hour and you pay the same. (The dedicated PT page covers the model-unit lifecycle, commitment terms, and buy/manage steps in depth; this page is about choosing between the two.)

The cleanest way to hold the difference in your head: on-demand cost tracks tokens; provisioned cost tracks time. One is a usage meter, the other is a capacity lease. Everything else — the break-even math, the latency contract, the custom-model rule — follows from that single distinction.

A familiar analogy: on-demand is paying per ride; Provisioned Throughput is leasing a dedicated vehicle. Ride occasionally and pay-per-ride is cheaper. Commute heavily every day and the lease is cheaper and always available — but you pay for it even on the days you stay home. The entire decision is figuring out which describes your traffic, and recognizing the cases where you are not allowed to take rides at all (custom models) and must lease.

There is also a crucial third option that the "on-demand vs provisioned" framing tends to hide: Batch, which runs large jobs asynchronously at roughly half the on-demand token rate. For any workload that does not need an answer in real time, Batch quietly beats both of the headline options on cost. We bring it back in Section VI, because a real Bedrock cost plan is usually three lanes, not two.

the one-line distinction

On-demand = pay per token, commit to nothing, shared best-effort capacity. Provisioned Throughput = reserve model units, pay a flat hourly rate per unit regardless of tokens, get guaranteed throughput/latency, and unlock custom-model hosting. On-demand cost tracks usage; provisioned cost tracks time.

the cost shapes

IIHow each one actually bills

Before any break-even math, it helps to see the two cost curves clearly, because their shapes are what make the decision. One is a sloped line through the origin; the other is a flat line above zero. Where they cross is the whole game.

Picture monthly cost on the vertical axis and monthly token volume on the horizontal axis.

So the cost comparison is literally the intersection of a sloped line and a flat line. Left of the crossover, on-demand is cheaper; right of it, Provisioned Throughput is. The practical question — "am I left or right of the crossover?" — is what Section V works out with real numbers. But notice what the geometry already tells you: Provisioned Throughput can only win on cost if your reserved capacity is genuinely busy, because a flat bill spread over little usage is a terrible per-token effective rate. This is why over-provisioning is the classic PT mistake — you push yourself left of the crossover and quietly overpay.

On-demand — a sloped line from zero

On-demand is a straight line starting at the origin: zero traffic costs zero, and cost rises linearly with tokens. The slope is set by the model's per-token rates (output tokens typically cost several times more than input tokens). Because it passes through zero, on-demand is unbeatable at low volume — there is simply no cost when there is no traffic. The downside is that the line keeps climbing forever: at very high, steady volume, that ever-rising cost is exactly what a flat reserved bill can undercut.

Provisioned Throughput — a flat line above zero

Provisioned Throughput is a horizontal line: a fixed monthly cost (model units × hourly rate × hours in the month) that does not move with usage. At zero traffic you are still paying the full reserved amount; at maximum saturation you pay the same. The flat line sits well above zero, so at low volume it is far more expensive than on-demand — but because it is flat while on-demand keeps climbing, the two lines cross at some volume, and beyond that point the flat bill is the cheaper one.

beyond cost

IIILatency, throughput, and the reliability contract

Cost is only half the decision. The two billing models also give you very different performance and reliability guarantees, and for production systems that difference is often the deciding factor before the math even enters the picture.

On-demand capacity is shared and best-effort within your quota. Most of the time it is perfectly fine. But it is multi-tenant: during regional demand surges you can hit throttling — the ThrottlingException that latency-sensitive teams learn to design around — and tail latency is not contractually fixed. You also operate inside per-account, per-model throughput quotas, which can become a ceiling for a high-volume path even when AWS has capacity. On-demand asks you to engineer around variability: retries with backoff, queueing, and often cross-region inference to spread load across regions.

A provisioned model unit gives you isolated, guaranteed throughput — a fixed tokens-per-minute capacity that is yours, insulated from other tenants' spikes. Latency is consistent and there is no contention-driven throttling, because nobody else is using your reserved capacity. For an interactive product under a latency SLA, a checkout or onboarding flow that must not stall, or a regulated workflow that has to complete deterministically, that predictability is frequently worth more than any cost comparison. The capacity is reserved, so it behaves the same at 3pm on launch day as at 3am on a quiet night.

There is a useful pairing to keep straight here. Cross-region inference is the on-demand answer to spike-smoothing — it lets a single request automatically draw on capacity in multiple regions to reduce throttling — while Provisioned Throughput is the reserved-capacity answer. They solve overlapping problems differently: cross-region inference improves the odds on shared capacity without a commitment; PT removes the shared-capacity problem entirely by giving you your own. (See the cross-region-inference sibling for that mechanism in detail.)

The reliability axis is why the decision is genuinely two-dimensional rather than a pure cost optimization. You can be below the cost break-even — where on-demand is cheaper per token — and still correctly choose Provisioned Throughput because best-effort capacity is not an acceptable posture for that specific path. Putting it bluntly: cost decides the default; an SLA can override it. A path that absolutely cannot throttle may justify reserved capacity even when the units will not be fully saturated, because the alternative is breaching the SLA during exactly the traffic spikes that matter most.

the reliability override

On-demand = best-effort, shared, can throttle at spikes, latency not contractually fixed. Provisioned Throughput = isolated, guaranteed throughput and consistent latency, no contention throttling. A hard latency/availability SLA can justify PT below the cost break-even — the guarantee, not the price, is the reason.

no choice involved

IVThe custom-model rule — when it is not a choice at all

For base models, on-demand vs Provisioned Throughput is a genuine choice you can make either way. For an important class of workloads it is not a choice at all: Provisioned Throughput is the only option. This is the case teams most often miss when budgeting a custom-model project, so it is worth being explicit.

The defining rule: most custom models on Bedrock can only be served via Provisioned Throughput. On-demand is reserved for the shared, multi-tenant base models that many accounts call. Anything that is specifically yours needs dedicated capacity to host it, because there is no shared endpoint for a model only your account has. That covers the main ways a model becomes "yours":

Fine-tuned models — A base model you fine-tuned on your own data is a private artifact unique to your account, so serving it requires Provisioned Throughput for that custom model. This is the recurring cost that surprises teams: the fine-tuning training run was a small one-time charge, but keeping the result available is a standing stream of model-unit-hours for as long as it is deployed.
Distilled models — Model distillation trains a smaller, cheaper model to mimic a larger one for a narrow task. The distilled artifact is custom, so it follows the same rule — serve it on Provisioned Throughput. The trade can still be excellent: a small distilled model on one model unit can be far cheaper at volume than a frontier model on-demand. But you are now in the provisioned cost model, not the on-demand one.
Imported custom-weight models — Where Bedrock supports importing your own model weights (Custom Model Import for supported architectures), the imported model is served from dedicated capacity you provision. No shared on-demand path exists, because the weights are private to you.

The planning consequence is concrete: if your roadmap includes fine-tuning, distillation, or importing a model, the "on-demand vs provisioned" question is partly already answered — that path will be on Provisioned Throughput, and the only remaining question is how few model units can serve the load. Build the standing hosting cost into the budget from day one. The most common custom-model budgeting mistake is pricing only the training and forgetting that the resulting model then sits on a 24/7 hourly charge for as long as it is live. For many narrow use cases, that standing cost is precisely why the team should sanity-check whether a base model on-demand — with good prompting, RAG, or prompt caching — would have been cheaper overall. (The fine-tuning sibling covers that decision.)

the rule to remember

Base models: on-demand or Provisioned Throughput — your choice. Custom models (fine-tuned, distilled, imported): Provisioned Throughput only. A fine-tune decision is therefore also a commitment to a standing hourly hosting cost — budget the hosting, not just the training.

the worked math

VThe break-even math — a worked example

For a base model, the cost half of the decision reduces to one number: the volume at which Provisioned Throughput becomes cheaper than on-demand. Below it, stay on-demand; above it, reserve. Here is exactly how to compute your own line, with representative numbers to show the shape. The numbers are illustrative 2026 figures, not quotes — confirm current rates on the AWS pricing page.

The method does not depend on the exact rates: compare the fixed monthly cost of the model units you would need against the per-token on-demand cost of the same traffic. Where the two cross is your break-even.

Step 1 — size the model units. A model unit delivers a published throughput (tokens/minute) for a given model. Take your peak sustained throughput requirement and divide by the per-unit throughput to get the number of units you must reserve to serve the load without throttling. Say a workload needs 2 model units of a Sonnet-class model to serve its peak.

Step 2 — compute the fixed monthly PT cost. Suppose, for illustration, the discounted (longer-commitment) rate for that model is on the order of $25 per model-unit-hour. Two units running continuously across a 730-hour month is 2 × $25 × 730 ≈ $36,500/month, fixed, regardless of token volume.

Step 3 — price the same traffic on-demand. Now cost the actual token throughput at on-demand rates. If those two saturated units, run flat-out, would push roughly 3.0 billion input and 0.6 billion output tokens per month, then at representative Sonnet-class on-demand rates ($3 per 1M input, $15 per 1M output) that is 3,000 × $3 + 600 × $15 ≈ $9,000 + $9,000 ≈ $18,000/month. At that volume on-demand is roughly half the cost of reserving — so do not provision.

Step 4 — find the crossover. Provisioned Throughput only wins once on-demand cost climbs past the ~$36,500 fixed bill. Holding the same 5:1 input:output mix, on-demand reaches ~$36,500/month at roughly 6.1 billion input + 1.2 billion output tokens — about double the volume of the half-utilized case above. In other words, those two model units have to run close to saturation, around the clock, before reservation pays off on cost alone. The table below lays the two curves side by side at three utilization levels so the crossover is visible.

The blunt lesson: on a base model, Provisioned Throughput beats on-demand on cost only at genuinely high, sustained utilization. If the units would sit half-idle, on-demand is cheaper — often dramatically so. Two factors shift the line in PT's favor beyond raw cost. (1) Reliability: if on-demand throttling would breach an SLA, the guarantee can justify PT below the pure cost crossover. (2) Custom models: there is no on-demand line to compare against, so the question becomes "how few units can serve the load," not "PT or on-demand." For everything else, the rule of thumb holds: provision only when you can keep the units busy.

worked break-even · 2 Sonnet-class model units · representative 2026 figures (not quotes)

Scenario	Monthly tokens (in / out)	On-demand cost/mo	Provisioned cost/mo (2 MU)	Cheaper option
~50% utilization	3.0B / 0.6B	≈ $18,000	≈ $36,500 (fixed)	On-demand (by ~$18.5K)
~break-even (~100%)	6.1B / 1.2B	≈ $36,500	≈ $36,500 (fixed)	Tie — the crossover
Sustained max + spikes	8.0B / 1.6B	≈ $48,000	≈ $36,500 (fixed)	Provisioned (by ~$11.5K)

Illustrative only — assumes representative Sonnet-class rates ($3/1M input, $15/1M output on-demand; ~$25/model-unit-hour provisioned) and a 730-hour month. Confirm current rates on the AWS Bedrock pricing page. The provisioned column is flat because reserved capacity bills on time, not tokens; the on-demand column rises with volume. Below the crossover on-demand wins; above it provisioned does. Reliability needs or custom-model hosting can justify provisioned below the crossover.

the break-even rule of thumb

Provisioned Throughput beats on-demand on cost only when reserved model units run at high, sustained utilization (busy most hours of most days). Idle reserved capacity is pure waste. Below the crossover, on-demand — plus Batch and prompt caching where they fit — is cheaper. Above it, and for guaranteed-SLA or custom-model paths, reserve.

the third lane

VIWhere Batch fits — the cheaper third lane

The "on-demand vs provisioned" question quietly assumes you need answers in real time. A large share of GenAI work does not. For anything that tolerates latency, Batch is a third lane that undercuts both — and ignoring it is one of the most common ways teams overpay on Bedrock.

Batch inference processes a large set of inputs asynchronously: you submit a job (typically input data staged in Amazon S3), Bedrock runs it when capacity is available, and you collect the results when it completes. In exchange for giving up real-time responses, the per-token rate is roughly 50% lower than on-demand for supported models. For non-interactive work, that is a structural discount no amount of provisioning tuning can match on the same tokens.

Crucially, Batch is a different billing model from both of the headline options. It is still per-token (like on-demand), so it commits you to nothing and scales with usage — but at about half the rate. It is not reserved capacity (unlike PT), so there is no idle-cost risk. The catch is purely latency: results arrive when the job finishes, not within milliseconds, so Batch is unsuitable for anything a user is waiting on live.

The classes of work that belong on Batch are large and common: bulk document processing (summarization, extraction, classification across a corpus), embedding generation for a knowledge base or search index, offline evaluation and dataset labeling, nightly enrichment jobs, and any periodic pipeline where the output is consumed later. Moving these off on-demand — and certainly off Provisioned Throughput — is often the single biggest line-item saving on a Bedrock bill. (The Batch sibling covers job mechanics and limits.)

So the real decision is not binary. A well-engineered Bedrock workload usually runs three lanes at once: Batch for everything asynchronous, on-demand (often with prompt caching to cut repeat-context cost) for interactive traffic that is variable or moderate, and Provisioned Throughput reserved only for the one or two hot, steady paths — or any custom model — that genuinely justify dedicated capacity. Picking "on-demand vs provisioned" for the whole account is the wrong altitude; the right move is to route each workload to the lane that fits its latency tolerance and volume shape.

three lanes, not two

Batch (~50% off on-demand) for anything asynchronous — bulk processing, embeddings, offline eval. On-demand (with prompt caching) for variable or moderate interactive traffic. Provisioned Throughput only for hot, steady, high-volume paths or custom models. Match each workload to a lane rather than picking one billing model for everything.

the decision, distilled

VIIVerdict by scenario

Pulling the whole decision into a single set of calls. Find the row that matches your workload; the recommendation column is the default, and the reasoning column tells you when to override it.

These are defaults, not absolutes — an SLA, a compliance constraint, or a credit-funded runway can shift any of them. But for most teams, matching the workload to the row below gets the billing model right on the first try.

Prototype / early experiment — On-demand. Volume is unknown and the model choice is still moving — commit to nothing, pay only for what you use, and revisit once you have real traffic data. Reserving capacity here is the most common way teams waste money on PT.
Variable or spiky interactive traffic — On-demand, with prompt caching and (if throttling appears) cross-region inference. The sloped-from-zero cost curve rewards traffic that quiets down, and you keep full flexibility to change models.
Bulk / asynchronous / offline jobs — Batch. ~50% cheaper per token than on-demand and no reserved-capacity risk. The only cost is latency, which these jobs tolerate by definition.
High, steady, predictable interactive volume — Provisioned Throughput — once the break-even math confirms the units would run busy. Run the worked calculation on real traffic first; if utilization would be high and sustained, reserve. Start with a shorter commitment term until the volume is proven.
Hard latency / availability SLA — Provisioned Throughput, even below the cost break-even. The guaranteed, isolated capacity is the reason — best-effort on-demand can throttle at exactly the spikes that matter. The reliability override applies.
Custom model (fine-tuned / distilled / imported) — Provisioned Throughput — no choice. On-demand cannot serve it. The only open question is how few model units serve the load; budget the standing hosting cost from day one.

One meta-point ties the table together: the question is per-workload, not per-account. Most teams that get Bedrock costs right do not pick a single billing model — they run all three lanes and route each path to the right one, then keep watching utilization and re-routing as traffic shape changes. That ongoing right-sizing — measuring real traffic, computing the break-even, moving paths between lanes, and cleaning up idle reserved capacity — is exactly the FinOps work a vetted partner handles in the engagements CloudRoute routes.

how it becomes $0

VIIIHow AWS credits and FinOps make the right choice cheaper

Whichever lane a workload lands in, the spend is ordinary Bedrock spend — which means AWS credits can absorb it, and disciplined cost engineering can make the bill smaller before credits even apply. Together they change the risk calculus of the whole on-demand-vs-provisioned decision.

On-demand tokens, Batch jobs, and Provisioned-Throughput model-unit-hours are all fully credit-eligible — credits in your AWS account apply automatically against each, the same way they apply to fine-tuning, embeddings, and the rest of your bill. The relevant pools are the familiar ones: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case, and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups).

Why credits matter specifically for this decision: the scariest thing about choosing Provisioned Throughput is paying for reserved capacity during the months before a workload has fully ramped — the "what if the volume does not materialize" risk that keeps teams on more expensive on-demand longer than they should be. When the commitment is drawn from a credit pool rather than runway, that risk is largely defused. You can reserve what a custom model or a high-SLA path needs, run it through launch and ramp, and let credits cover the standing cost while you prove the workload out. The discipline becomes "make the credits last" rather than "protect the bank balance" — which often means the correct billing model gets chosen instead of the cheapest-looking one.

Credits do not remove the need for FinOps, though — they extend it. The same right-sizing that controls a paid bill controls a credit burn: routing each workload to the cheapest lane that fits (Batch wherever latency allows), sizing model units from measured peak rather than a guess, wiring CloudWatch utilization alarms, and decommissioning idle allocations promptly. Done well, the cost-optimization sibling's techniques and the credit pool compound — the credits last far longer because the underlying bill is already lean.

The practical mechanic is that these pools are largely partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. CloudRoute matches you to the right pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and does the actual cost engineering: choosing the lane per workload, running the break-even math on real traffic, sizing and managing any reserved capacity, and keeping the bill lean. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. (For the credit mechanics, see AWS credits for generative-AI startups and the Bedrock POC funding page.)

on-demand vs provisioned vs batch

On-Demand vs Provisioned Throughput vs Batch — the three lanes side by side

The scannable version of the whole decision: the two headline billing models plus the Batch third lane, across how they bill, cost shape, performance, commitment, custom-model support, and where each one wins. Figures are representative 2026 illustrations, not quotes.

Variable	On-Demand	Provisioned Throughput	Batch
How you pay	Per token (input + output)	Hourly per model unit	Per token, ~50% off on-demand
Cost shape	Scales with usage, from zero	Fixed (capacity × time)	Scales with usage, half-rate
Commitment	None — cancel any time	No-commit / 1mo / 6mo terms	None — per job
Latency	Real-time, best-effort	Real-time, guaranteed	Asynchronous (job completes later)
Throttling risk	Possible at spikes	None (reserved, isolated)	n/a (queued/async)
Serves custom models?	No	Yes (the only way)	Base models (where supported)
Idle-cost risk	None	High (pays even when idle)	None
When it wins	Variable / low / unknown interactive volume; prototypes	High steady volume; SLA paths; custom models	Bulk / offline / async work at any volume

Provisioned Throughput also has a no-commitment hourly option (highest rate, cancellable any time) for spikes and validation; the 1-month and 6-month terms trade flexibility for a lower hourly rate. Custom (fine-tuned / distilled / imported) models can only run on Provisioned Throughput. On a base model, PT beats on-demand on cost only at high, sustained utilization — see the break-even section. All figures representative as of 2026; confirm on the AWS Bedrock pricing page.

before you pick a billing model

Get AWS credits to fund either path — and a partner to pick and size the mix (you pay $0)

Get matched in 24h →

a recent match

A workload split across all three lanes — funded at $0 — anonymized

inquiry · Series-A document-AI SaaS, United Kingdom

Series-A document-AI SaaS, 21 people, a mix of real-time chat, nightly bulk processing, and a high-SLA review path

Situation: The team had defaulted everything to on-demand and watched the Bedrock bill climb as usage grew. They could not tell whether they should move to Provisioned Throughput, and a back-of-envelope estimate suggested reserving capacity for their whole workload would actually cost more than on-demand because much of the traffic was bursty. Meanwhile their nightly document-processing job was being billed at full on-demand rates, and one customer-facing review path was occasionally throttling under load — a contractual SLA risk. They needed someone to run the real break-even math rather than guess.

What CloudRoute did: CloudRoute matched them within 24 hours to a UK-region AWS partner with GenAI cost-engineering experience. The partner pulled real traffic data and routed each workload to the right lane: (1) moved the nightly bulk document job to Batch, cutting that line roughly in half; (2) kept the bursty interactive chat on on-demand and added prompt caching to cut repeat-context cost; (3) ran the break-even calculation on the high-SLA review path, found it ran steady and near-saturation, and reserved one model unit of Provisioned Throughput for it on a 1-month term to start — chosen for the latency guarantee as much as the cost. The partner also filed a Bedrock POC credit application plus an Activate Portfolio application to fund the whole bill.

Outcome: The three-lane split lowered the underlying Bedrock run-rate before credits even applied, the SLA path stopped throttling, and the approved credits then covered the remaining spend — so the team paid $0 through launch and ramp. Once the review path's volume proved out over the month, the partner rolled it onto a 6-month commitment for the deeper rate. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

lanes: Batch + on-demand + 1×PT MU · run-rate cut before credits, then $0 · credits secured: POC + Activate · out-of-pocket during build: $0

faq

Common questions

What is the difference between Bedrock on-demand and Provisioned Throughput?

On-demand bills per token (input + output) for shared, best-effort capacity with no commitment — your cost scales directly with usage and is zero when idle. Provisioned Throughput reserves dedicated capacity for one model, measured in "model units," and bills a flat hourly rate per unit regardless of tokens, with guaranteed throughput and latency. The core distinction: on-demand cost tracks usage; provisioned cost tracks reserved capacity and time. On-demand suits variable or low traffic; Provisioned Throughput suits steady high volume, SLA-bound paths, and custom models.

When is Provisioned Throughput cheaper than on-demand?

Only when reserved model units run at high, sustained utilization. Because Provisioned Throughput is a flat monthly cost and on-demand scales with usage, the two cross at a break-even volume: below it on-demand is cheaper, above it Provisioned Throughput is. In a representative worked example, two saturated Sonnet-class model units at ~$25/model-unit-hour cost ~$36,500/month, which on-demand only exceeds at very high token volume (~6.1B input + 1.2B output). If your units would sit half-idle, on-demand is far cheaper. Reliability needs or custom models can justify PT below the cost crossover.

How do I calculate the break-even between on-demand and Provisioned Throughput?

Four steps. (1) Size the model units needed from your peak sustained throughput (tokens/minute ÷ per-unit throughput). (2) Compute the fixed monthly PT cost = units × hourly rate × hours per month. (3) Price the same monthly token volume at on-demand rates (input and output separately). (4) The crossover — where on-demand cost rises past the fixed PT bill — is your break-even. Practically, PT only wins on cost when reserved units run busy most hours of most days; idle reserved capacity is pure waste. All rates are representative as of 2026 — confirm on the AWS Bedrock pricing page.

Does on-demand or Provisioned Throughput give better latency?

Provisioned Throughput gives guaranteed, consistent latency and isolated throughput — your reserved capacity is not affected by other tenants' spikes and will not throttle from contention. On-demand is best-effort within your account quota: usually fine, but it can throttle during regional demand surges and its tail latency is not contractually fixed. For an interactive product under a latency SLA, that guarantee can justify Provisioned Throughput even when the cost math alone would favor on-demand. Cross-region inference is the on-demand way to reduce throttling without a commitment.

Can I serve a fine-tuned model with on-demand pricing?

No. Custom models — fine-tuned, distilled, or imported via Custom Model Import — can only be served on Provisioned Throughput, because there is no shared on-demand endpoint for a model unique to your account. This makes the on-demand-vs-provisioned question moot for those paths: they will be on Provisioned Throughput, and the only remaining decision is how few model units serve the load. Budget the standing hourly hosting cost from day one — the fine-tuning training run is a small one-time charge, but keeping the model deployed bills continuously.

Where does Batch fit between on-demand and Provisioned Throughput?

Batch is a third lane that beats both on cost for asynchronous work: it processes large jobs offline at roughly 50% of the on-demand per-token rate, with no commitment and no reserved-capacity risk. The trade-off is latency — results arrive when the job completes, not in real time — so it suits bulk document processing, embedding generation, offline evaluation, and nightly pipelines. A well-engineered Bedrock workload usually runs three lanes: Batch for async work, on-demand (with prompt caching) for variable interactive traffic, and Provisioned Throughput only for hot, steady, or custom paths.

Should I pick one billing model for my whole Bedrock account?

Generally no — the decision is per-workload, not per-account. Different paths have different latency tolerances and volume shapes, so the right approach is to route each to the lane that fits: Batch for anything asynchronous, on-demand for variable or moderate interactive traffic, and Provisioned Throughput reserved only for the one or two hot, steady, high-volume paths (or any custom model) that justify dedicated capacity. Then keep watching utilization and re-route as traffic changes. Picking a single billing model for everything is the most common altitude mistake.

Can AWS credits cover both on-demand and Provisioned Throughput costs?

Yes — on-demand tokens, Batch jobs, and Provisioned-Throughput model-unit-hours are all ordinary Bedrock spend and fully credit-eligible; credits in your AWS account apply automatically against each. This is especially useful for the on-demand-vs-provisioned decision because it defuses the main risk of reserving capacity — paying for it before a workload has ramped — so you can choose the correct billing model rather than the cheapest-looking one. The relevant pools (AWS Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) are largely partner-filed. CloudRoute matches you to the right pool and a vetted AWS partner who files the application and engineers the cost mix — customer pays $0.

Pick the right billing model — and let AWS fund it

On-demand, Provisioned Throughput, or Batch — the right answer is usually all three, routed per workload. CloudRoute connects you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who runs the break-even math on your real traffic and engineers the mix. Customer pays $0.

Get matched in 24h →→ see the AI-team persona detail

matched within< 24h

GenAI credit ceilingup to $1M

cost to you$0