A neutral reference for Amazon Bedrock Provisioned Throughput in 2026: what a reserved "model unit" actually buys you, how it differs from on-demand, when it is genuinely required (hosting custom and fine-tuned models), the three commitment tiers (no-commit hourly, 1-month, 6-month), the exact break-even math against on-demand with a worked example, how to buy and manage model units, and how AWS credits can fund the commitment so the build costs you $0.
Provisioned Throughput is the mode you reach for when on-demand stops being enough — either because you need guaranteed capacity, or because you are serving a model that on-demand cannot serve at all. Understanding what you are reserving, and what the reservation guarantees, is the whole decision.
Most Bedrock usage starts on the on-demand path: you call a model, you pay a published rate per 1,000 input and output tokens, and you commit to nothing. Capacity is shared across all accounts in a region and governed by per-account throughput quotas. That is ideal until your traffic is high enough, steady enough, or sensitive enough that shared capacity becomes a liability. Provisioned Throughput (PT) is the answer to that: you reserve dedicated inference capacity for one specific model and pay a flat hourly rate for it.
The unit of reservation is the model unit (MU). A model unit represents a defined, guaranteed amount of throughput for a given model — a certain number of input and output tokens per minute (the exact figures vary by model and are published per model on the Bedrock console). You buy one or more model units of a specific model, and from that moment you are billed an hourly rate per model unit for as long as the provisioned-throughput allocation exists — independent of how many requests you actually send through it. Send zero tokens for an hour and you still pay the hourly rate; saturate the unit and you pay the same hourly rate.
That flat-rate structure is the entire point. On-demand cost is a function of usage (tokens consumed); provisioned cost is a function of time and reserved capacity (model units × hours). PT decouples your bill from your traffic. For a workload with high, predictable volume, a fixed monthly capacity bill is both cheaper and more budgetable than a per-token bill that scales with every request.
PT also changes the performance contract. On-demand throughput is best-effort within your quota and can be throttled during regional demand spikes (the ThrottlingException that latency-sensitive teams learn to fear). A provisioned model unit delivers guaranteed, consistent throughput and latency — the capacity is yours, isolated from other tenants' spikes. For a production system with an SLA, that predictability is often worth more than the raw cost comparison.
One framing that helps: on-demand is like paying per ride; Provisioned Throughput is like leasing a dedicated vehicle. If you ride occasionally, pay-per-ride is cheaper. If you are commuting heavily every day, the lease is cheaper and always available — but you pay for it even on the days you stay home. The rest of this page is about finding the point where the lease starts to win, and the cases where you have no choice but to lease.
Provisioned Throughput = reserve dedicated capacity for one Bedrock model in units called model units, billed at a flat hourly rate per unit (optionally discounted with a 1- or 6-month commitment), in exchange for guaranteed throughput and latency and the ability to serve custom models. Cost tracks reserved capacity and time, not tokens consumed.
The choice between on-demand and PT is the largest cost-and-reliability lever on a heavy Bedrock workload. It is not "which is better" — it is "which fits this traffic shape." Get the shape wrong in either direction and you either overpay or get throttled.
The tradeoff has three axes — cost, latency/reliability, and commitment — and they pull in different directions depending on how your traffic behaves.
The honest default for most teams: start on-demand (optionally with prompt caching and Batch for the workloads those suit), measure real, sustained traffic, and only move a specific hot path onto Provisioned Throughput once the volume is high, steady, and predictable enough that the math and the reliability both favor it. Reserving capacity before you have proof of steady volume is the single most common way teams waste money on PT.
On-demand bills per token, so cost rises and falls exactly with usage; at low or spiky volume you pay almost nothing during quiet periods. PT bills per model-unit-hour, a fixed cost that does not move with usage. The implication is a crossover: below some volume, on-demand's per-token total is less than PT's flat monthly bill; above it, PT is less. The whole cost question reduces to: are you above or below the break-even volume? (Section IV does the math.)
On-demand capacity is shared and best-effort within your quota. Most of the time it is fine, but during regional demand surges you can hit throttling, and tail latency is not contractually fixed. A provisioned model unit gives you isolated, guaranteed throughput — consistent latency and no contention with other tenants' spikes. For an interactive product with a latency SLA, or a pipeline that must not stall, this reliability is frequently the deciding factor even before cost. (Cross-region inference is the on-demand answer to spike-smoothing; PT is the reserved-capacity answer — see the cross-region-inference sibling.)
On-demand commits you to nothing — switch models, change volume, or stop entirely with no penalty. PT asks for a commitment: even the no-commit hourly option ties you to paying for the reserved capacity while it exists, and the discounted 1- and 6-month tiers lock the rate (and the spend) for that term. That rigidity is the cost of the guarantee. It is fine for a stable, proven workload; it is a trap for an experiment whose model choice or volume is still moving.
For base models, PT is a cost-and-reliability choice you can take or leave. For an important class of workloads it is not optional at all — it is the only way to run the model. This is the case people most often miss when they budget a custom-model project.
The defining rule: most custom models on Bedrock can only be served via Provisioned Throughput. On-demand is reserved for the shared, multi-tenant base models. Anything that is yours specifically needs dedicated capacity to host it. That covers several categories:
The practical consequence for planning: if your roadmap includes fine-tuning, distillation, or importing a model, build the Provisioned-Throughput hosting cost into the budget from day one. The most common custom-model budgeting mistake is pricing only the training and forgetting that the resulting model then sits on a 24/7 hourly charge for as long as it is deployed. For many narrow use cases that standing cost is exactly why the team should reconsider whether a base model with good prompting or RAG would have been cheaper overall (see the fine-tuning sibling for that decision).
Base models: on-demand or Provisioned Throughput — your choice. Custom models (fine-tuned, distilled, imported): Provisioned Throughput only. If you fine-tune, you are committing to a standing hourly hosting cost — budget for the hosting, not just the training.
Provisioned Throughput pricing is refreshingly simple compared with per-token math: a model unit has an hourly rate, and the rate drops the longer you commit. The complexity is not in the formula — it is in deciding how many units and which term.
You are billed (number of model units) × (hourly rate for that model) × (hours the allocation exists). The hourly rate depends on two things: which model (larger, more capable models cost more per model-unit-hour, mirroring their higher on-demand token rates) and which commitment term you choose. There are three terms:
Two cost realities to internalize. First, the charge is per model — a model unit of one model does not serve a different model; if you run several models on PT you pay for each separately. Second, the charge is continuous: a provisioned allocation left running over a weekend, a forgotten test allocation, or an over-provisioned unit count all burn money silently because the meter runs on time, not usage. The discipline of PT is not the purchase decision alone — it is ongoing right-sizing and cleanup.
Pay the highest per-hour rate, but cancel any time — you are only on the hook for the hours the allocation actually exists. Best for: short-lived needs (a launch spike, a time-boxed campaign, a load test), validating that PT is the right move before committing to a term, or serving a custom model for a finite project. This is the flexible, no-lock option; you trade the discount for the freedom to turn it off.
Commit to one month and the hourly rate drops below the no-commit rate. You pay for the full month of reserved capacity regardless of usage. Best for: a workload you are confident is steady for at least a month but whose longer-term volume you are not ready to lock — a recently-launched feature with proven early traffic, for example.
The longest standard term and the cheapest per-hour rate — the deepest discount in exchange for the deepest lock. You pay for six months of capacity. Best for: a mature, high-volume production path with stable model choice and predictable demand — the classic case for reserving capacity. The risk is obvious: if you switch models or your volume falls inside the term, you are still paying for capacity you no longer need.
| Commitment term | Relative hourly rate | You pay for | Flexibility | Best for |
|---|---|---|---|---|
| No commitment | Highest | Only the hours the allocation exists | Cancel any time | Spikes, tests, validating PT, finite projects |
| 1-month | Mid (discounted) | A full month of capacity | Locked for the month | Proven-steady feature, near-term confidence |
| 6-month | Lowest (deepest discount) | Six months of capacity | Locked for the term | Mature high-volume production, stable model |
The cost decision for a base model comes down to one number: the volume at which Provisioned Throughput becomes cheaper than on-demand. Below it, stay on-demand; above it, reserve. Here is exactly how to compute your own line, with representative numbers to show the shape.
The method is the same regardless of the actual rates: compare the fixed monthly cost of the model units you would need against the per-token cost of the same traffic on on-demand. Where the two lines cross is your break-even.
Step 1 — size the model units. A model unit delivers a published throughput (tokens/minute) for a given model. Take your peak sustained throughput requirement and divide by the per-unit throughput to get the number of model units you must reserve to serve the load without throttling. Say a workload needs sustained capacity that requires 2 model units of a Sonnet-class model to serve at peak.
Step 2 — compute the fixed monthly PT cost. Suppose, for illustration, the 6-month-commitment rate for that model is on the order of $25 per model-unit-hour (representative — confirm the real figure). Two units running continuously for a 730-hour month is 2 × $25 × 730 ≈ $36,500/month, fixed, no matter the token volume.
Step 3 — compute the same traffic on-demand. Now price the actual token throughput at on-demand rates. If the workload, run flat-out, would push roughly 3.0 billion input and 0.6 billion output tokens a month through those two saturated units, then at representative Sonnet-class on-demand rates ($3 per 1M input, $15 per 1M output) that is 3,000 × $3 + 600 × $15 ≈ $9,000 + $9,000 ≈ $18,000/month. At that volume on-demand is far cheaper — so do not provision.
Step 4 — find the crossover. PT only wins once on-demand cost climbs past the ~$36,500 fixed PT bill. Keep the same 5:1 input:output mix and on-demand reaches ~$36,500/month at roughly 6.1 billion input + 1.2 billion output tokens — i.e. you need to be running those two model units near saturation, around the clock, before reservation pays off purely on cost. The lesson is blunt: on a base model, Provisioned Throughput beats on-demand on cost only at genuinely high, sustained utilization. If your units would sit half-idle, on-demand is cheaper.
Two things shift the line in PT's favor beyond raw cost. (1) Reliability: if on-demand throttling would breach an SLA, the guaranteed capacity can justify PT below the pure cost crossover. (2) Custom models: if the model is fine-tuned or distilled, there is no on-demand line to compare against — PT is the only option, and the question becomes "how few model units can serve the load," not "PT or on-demand." For everything else, the rule of thumb stands: provision only when you can keep the units busy.
Provisioned Throughput beats on-demand on cost only when reserved model units run at high, sustained utilization (roughly speaking, busy most hours of most days). Idle reserved capacity is pure waste. Below the crossover, on-demand — plus Batch and prompt caching where they fit — is cheaper. Above it, and for guaranteed-SLA or custom-model paths, reserve.
Provisioning throughput is a few clicks (or an API call), but managing it well — picking the term, right-sizing the unit count, and not leaking idle capacity — is where the cost discipline lives. Here is the lifecycle end to end.
You purchase Provisioned Throughput from the Amazon Bedrock console (Provisioned throughput section) or programmatically via the API/SDK/CloudFormation. The flow is consistent:
A clean operating pattern that many teams converge on: serve interactive traffic on-demand with prompt caching, run bulk jobs on Batch, and reserve Provisioned Throughput only for the one or two hot, steady paths (or any custom model) that genuinely justify it — then watch CloudWatch and resize. PT is not an all-or-nothing switch for the account; it is a surgical tool for specific paths. The cost-engineering value is in applying it precisely and cleaning it up rigorously — exactly the kind of ongoing FinOps work a vetted partner handles in the engagements CloudRoute routes.
Provisioned Throughput is the most "committed" line on a Bedrock bill — a standing hourly charge for months. That is precisely the kind of cost AWS credits are designed to absorb, which changes the risk calculus of reserving capacity at all.
Provisioned-Throughput charges are ordinary Bedrock spend, so they are fully credit-eligible — credits in your AWS account apply automatically against the model-unit-hours just as they do against on-demand tokens, fine-tuning, and embeddings. The relevant pools are the familiar ones: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case, and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups).
Why this matters specifically for PT: the scariest thing about a commitment term is paying for reserved capacity during the months before a workload has fully ramped — the "what if volume does not materialize" risk. When the commitment is drawn from a credit pool rather than runway, that risk is largely defused. You can reserve the capacity a fine-tuned model or a high-SLA path needs, run it through launch and ramp, and let credits cover the standing hourly cost while you prove the workload out. Cost discipline becomes "make the credits last" rather than "protect the bank balance."
The practical mechanic is that these pools are largely partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. CloudRoute matches you to the right pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and does the actual cost engineering: sizing the model units from real traffic, choosing the commitment term, wiring CloudWatch alarms on utilization, and cleaning up idle allocations. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. (For the credit mechanics, see AWS credits for generative-AI startups and the Bedrock POC funding page.)
The scannable version of the whole decision: on-demand against the two main provisioned commitment tiers, across cost shape, performance, commitment, and the workloads each one wins. Figures are representative 2026 illustrations, not quotes.
| Variable | On-Demand | Provisioned — 1-month | Provisioned — 6-month |
|---|---|---|---|
| How you pay | Per token (input + output) | Hourly per model unit | Hourly per model unit |
| Cost shape | Scales with usage | Fixed for the month | Fixed for the term |
| Relative hourly rate | n/a (usage-priced) | Discounted vs no-commit | Cheapest per hour |
| Throughput / latency | Best-effort within quota | Guaranteed, isolated | Guaranteed, isolated |
| Throttling risk | Possible at spikes | None (reserved) | None (reserved) |
| Commitment | None — cancel any time | 1 month locked | 6 months locked |
| Serves custom models? | No | Yes | Yes |
| When it wins | Variable / low / unknown volume; prototypes | Proven-steady volume, near-term confidence | Mature high-volume production; stable model |
Situation: The team had fine-tuned a domain-specific model on their proprietary data and discovered at deployment time that serving it required Provisioned Throughput — an ongoing hourly cost they had not budgeted, because they had priced only the one-time fine-tuning run. On top of that, their interactive product path could not tolerate on-demand throttling under load. They needed reserved capacity for both, but were wary of committing months of standing cost out of a runway earmarked for hiring.
What CloudRoute did: CloudRoute matched them within 24 hours to a Singapore-region AWS partner with GenAI cost-engineering experience. The partner (1) sized the model units from real traffic — one unit for the custom model, one for the SLA path — rather than over-provisioning; (2) put the proven, steady custom-model path on a 6-month commitment for the deepest rate and kept the still-ramping SLA path on a 1-month commitment; (3) wired CloudWatch utilization alarms so idle capacity would be caught; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the whole commitment.
Outcome: The reserved capacity went live with guaranteed throughput on both paths, and the entire standing PT cost — plus the rest of the Bedrock bill — was covered by the approved credits, so the team paid $0 during launch and ramp. As volume on the SLA path proved out, the partner rolled it onto a 6-month commitment too. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
reserved: 2 model units · terms: 6mo + 1mo · credits secured: POC + Activate · out-of-pocket during build: $0
Whether you need Provisioned Throughput to serve a fine-tuned model or to guarantee an SLA at scale, the standing hourly cost is exactly what AWS credits absorb. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to size and manage the commitment. Customer pays $0.