A neutral reference for Amazon Bedrock service quotas in 2026: the two dimensions that actually throttle production workloads — requests-per-minute (RPM) and tokens-per-minute (TPM), set per model — what the 429 ThrottlingException really means, exactly how to request a quota increase, the two levers that raise the ceiling instead of just retrying into it (cross-region inference and Provisioned Throughput), how to design for limits with backoff and queueing, and when Batch is the right tool for bulk. All figures are representative as of 2026 — confirm current values in the AWS Service Quotas console.
Amazon Bedrock is a shared, multi-tenant service, so access to each foundation model is governed by service quotas. Most of those quotas never matter to you. Two of them — the per-minute request and token limits — are the ones that throttle real production traffic, and they are the ones worth understanding precisely.
A service quota (formerly "service limit") is the maximum value AWS permits for a given resource or rate in your account. Bedrock publishes dozens of them — limits on concurrent fine-tuning jobs, on the number of custom models, on Knowledge Base sizes, on Agents, on Batch job counts, and many more. The overwhelming majority are generous enough that you will never touch them. The two that govern day-to-day inference, and that produce nearly all the throttling teams hit in production, are the per-minute runtime invocation quotas.
Those two dimensions are requests per minute (RPM) and tokens per minute (TPM). RPM caps how many model-invocation API calls you can make in a rolling minute. TPM caps how many tokens — input plus output, combined — you can process through that model in a rolling minute. Both are enforced at once: you are throttled the moment you cross either ceiling, whichever you hit first. A workload that makes few calls but each with a huge context can hit TPM long before RPM; a workload that makes many tiny calls can hit RPM long before TPM.
The most important structural fact is the scope. Bedrock on-demand quotas are applied per account, per AWS Region, and per model. That has three consequences worth internalizing. First, the quota for one model says nothing about another — Claude Sonnet, Claude Haiku, Llama, and Amazon Nova each carry their own RPM/TPM numbers. Second, the quota in one region is independent of another — your us-east-1 ceiling and your eu-west-1 ceiling are separate budgets (this is exactly what cross-region inference exploits; see Section IV). Third, the quota is per account, so every workload, environment, and team sharing an AWS account draws from the same pool unless you separate them across accounts.
Quotas also come in two flavours: adjustable and non-adjustable. An adjustable quota can be raised on request (Section III). A non-adjustable quota is a hard ceiling AWS does not lift on demand — for those, the answer is an architectural lever (cross-region inference, Provisioned Throughput, or Batch), not a support ticket. Whether a specific RPM/TPM limit is adjustable depends on the model and is shown in the Service Quotas console, which is the authoritative source for both the current value and its adjustability.
A final clarification, because it trips people up: quotas are not the same as on-demand pricing, and neither is the same as Provisioned Throughput. Pricing is what each token costs. Quotas are how fast you are allowed to spend on-demand. Provisioned Throughput is a separate capacity model where you reserve dedicated throughput and step outside the shared on-demand quota pool entirely. The rest of this page keeps those three straight.
Bedrock quotas = per-account, per-Region, per-model ceilings on inference rate, enforced on two dimensions at once: requests per minute (RPM) and tokens per minute (TPM). Cross either and you get an HTTP 429 ThrottlingException. Some are adjustable (raise via Service Quotas); some are hard (use a throughput lever instead).
There is no single Bedrock rate limit — the numbers differ by model, by region, and over time as AWS adjusts defaults. The table below is a representative shape, not a quote, to show how the limits scale. Always confirm your actual values in the Service Quotas console for your account and region.
The pattern across models is consistent even though the exact figures move. Smaller, cheaper, faster models (Haiku-class, Amazon Nova Micro/Lite, small Llama variants) carry higher default RPM and TPM, because they are cheap to serve and AWS provisions more shared capacity for them. Larger frontier models (Claude Opus / Sonnet-class, large Llama, Nova Premier) carry lower default RPM and TPM, because each call is more expensive to serve. So the same traffic pattern can sail through on a small model and throttle on a frontier model — a real consideration when you pick a model for a high-throughput path.
Read the figures below as illustrative orders of magnitude as of 2026, useful for reasoning about headroom and for sizing, not as values to hardcode. AWS revises defaults, and your account may already carry increases. The authoritative number for any given model in any given region is the one shown in the Service Quotas console under the Amazon Bedrock service.
Two practical takeaways from the shape. First, budget headroom on the frontier models: if your hottest path runs on an Opus- or Sonnet-class model, that is where you will throttle first, so it is the quota to check and likely the one to raise. Second, model choice is a throughput decision, not only a quality decision: routing the bulk of cheap, high-volume calls to a small model (and reserving the frontier model for the calls that genuinely need it) both lowers cost and stays clear of the tighter frontier limits.
| Model class | Typical role | Relative RPM | Relative TPM | Adjustable? |
|---|---|---|---|---|
| Small / fast (Haiku-class, Nova Micro/Lite) | High-volume, latency-sensitive, cheap | Highest | Highest | Often yes |
| Mid (Sonnet-class, mid Llama, Nova Pro) | Balanced quality / cost workhorse | Moderate | Moderate | Often yes |
| Large / frontier (Opus-class, large Llama, Nova Premier) | Hardest reasoning, lowest throughput | Lowest | Lowest | Sometimes (model-dependent) |
| Embeddings (Titan / Nova / Cohere embed) | Vectorising for RAG at scale | High | High | Often yes |
| Image / video (Nova Canvas / Reel, Stability) | Generation, lower call rates | Low (per-image) | n/a (per-image) | Model-dependent |
When you cross a quota, Bedrock does not queue your request or slow it down — it rejects it. Understanding exactly what that rejection looks like, and what it does not mean, is the difference between a resilient client and one that falls over under load.
When a request would exceed your RPM or TPM quota for a model, the Bedrock runtime returns an error with HTTP status 429 and the error code ThrottlingException (you may also see TooManyRequestsException phrasing in some SDKs). The message typically reads along the lines of "Too many requests, please wait before trying again." This is a transient, retryable error — it does not mean your request was malformed, your credentials are wrong, or the model is down. It means: right now, in this rolling minute, this model in this region is at capacity for your account.
Crucially, a 429 is not a 400-class validation error (a bad prompt, an oversized payload, an unsupported parameter) and not a 500-class server error (an internal Bedrock fault). The correct response to a 429 is to wait and retry; the correct response to a 400 is to fix the request and not retry; the correct response to a 500 is usually a cautious retry as well. Conflating them is a common bug — teams either retry validation errors forever (wasting calls and never succeeding) or fail fast on throttling (turning a brief capacity blip into a user-facing error).
Throttling is evaluated continuously against the rolling minute, so it is inherently bursty: a workload averaging well under its quota can still throttle if its traffic arrives in spikes that momentarily exceed the per-minute rate. This is why the average-utilisation view lies. A path that averages 30% of its TPM but bursts to 150% for ten seconds will throw 429s during those bursts. Smoothing the burst — through queueing and concurrency caps (Section V) — often eliminates throttling without raising the quota at all.
There is also a relationship between the two dimensions worth naming: because TPM counts input plus output tokens, a sudden shift in workload shape can trip TPM unexpectedly. A feature that starts sending much larger documents, or a prompt change that balloons output length, can push you over TPM even though your request count (RPM) has not moved. When throttling appears "out of nowhere," a change in average tokens-per-call is a frequent culprit — check token volume, not just call volume.
A 429 ThrottlingException is transient and retryable — wait and retry with backoff, do not fail fast and do not fix the request. It is not a 400 (bad request — fix and stop) and not a 500 (server error). Throttling is evaluated per rolling minute, so bursty traffic throttles even when the average is well under quota. TPM counts input + output, so larger prompts or longer outputs can trip it without any change in request count.
When the throttling is real demand rather than a fixable burst, the direct fix is to raise the adjustable quota. This is a self-service request for most RPM/TPM limits, escalating to AWS Support for larger asks. Here is the path end to end.
The front door is the Service Quotas console (Service Quotas → AWS services → Amazon Bedrock). It lists every Bedrock quota, the current applied value for your account, the AWS default, and whether the quota is adjustable. Find the specific per-model RPM or TPM quota you are hitting, choose "Request quota increase," enter the new value you need, and submit. For modest increases on adjustable quotas this is often approved automatically or within a short review; larger increases route to a human reviewer.
The lifecycle of a request is straightforward:
Two limits on what a quota increase can do. First, AWS will not raise a quota beyond what it can serve from shared on-demand capacity for that model and region, so very large asks may be partially granted or pointed toward Provisioned Throughput. Second, a higher quota raises the ceiling but does nothing for bursty traffic that throttles below the ceiling — if your average is far under quota and you still see 429s, the fix is smoothing the burst (Section V), not a bigger number. Raise the quota when sustained demand genuinely exceeds it; smooth the traffic when bursts are the cause.
Raise the adjustable RPM/TPM quota via the Service Quotas console (escalating to AWS Support for large asks) when sustained demand exceeds it. If the quota is non-adjustable, or the asks get very large, or the model/region is capacity-constrained, the real fix is an architectural lever — cross-region inference, Provisioned Throughput, or Batch (next two sections), not a bigger number.
Beyond raising on-demand numbers, two features change the capacity model itself. One spreads on-demand load across regions to smooth spikes; the other reserves dedicated capacity that does not share the on-demand quota pool at all. Both raise effective throughput without you simply retrying into the same ceiling.
A quota increase makes the on-demand bucket bigger. These two levers do something different: they give you more buckets, or a private bucket.
The decision rule is simple. If throttling comes from bursts below the ceiling, smooth them (cross-region inference, plus client-side queueing in Section VI). If it comes from sustained demand above an adjustable ceiling, raise the quota. If you need guaranteed throughput, are running a custom model, or have hit a non-adjustable wall, reserve Provisioned Throughput. If the work is bulk and not time-sensitive, take it off the real-time path with Batch (Section VII).
Because on-demand quotas are scoped per region, the capacity available to you across several regions is the sum of the per-region quotas. Cross-region inference uses an inference profile that lets Bedrock automatically route a request to one of several regions on your behalf, drawing on the combined capacity rather than a single region's. The practical effect is spike-smoothing: a burst that would throttle in one region is spread across the profile's regions, so the effective RPM/TPM headroom is larger and transient throttling drops sharply. It stays on the on-demand pricing model (no commitment) and is often the first lever to reach for when throttling comes from spiky on-demand traffic rather than a sustained ceiling. (See the cross-region-inference sibling for the routing detail and data-residency considerations.)
Where cross-region inference adds shared buckets, Provisioned Throughput (PT) gives you a private one. You reserve dedicated capacity for a specific model — measured in "model units," each delivering a guaranteed tokens-per-minute throughput — and pay a flat hourly rate per unit. Because that capacity is yours, it is not subject to the shared on-demand RPM/TPM quotas at all: no contention with other tenants, guaranteed throughput, and consistent latency with no 429s from the shared pool. PT is also the only way to serve most custom (fine-tuned, distilled, imported) models. The trade is commitment and cost: the hourly charge accrues whether or not the capacity is used, so PT wins for high, steady, predictable volume and for guaranteed-SLA paths, not for variable or experimental traffic. (See the provisioned-throughput sibling for the break-even math.)
| Lever | What it changes | Pricing model | Best when | Removes 429s? |
|---|---|---|---|---|
| Quota increase | Raises the adjustable on-demand RPM/TPM ceiling | On-demand (per token) | Sustained on-demand demand exceeds the current ceiling | If the ceiling was the cause |
| Cross-region inference | Pools per-region capacity across regions | On-demand (per token) | Spiky on-demand traffic throttling in one region | Largely — smooths spikes |
| Provisioned Throughput | Reserves dedicated capacity outside the shared pool | Hourly per model unit | High, steady volume; SLA paths; custom models | Yes — within reserved capacity |
| Batch inference | Moves bulk work off the real-time path entirely | Per token (~50% cheaper) | Large, non-urgent, asynchronous jobs | N/A — not real-time |
Even with healthy quotas, any high-traffic Bedrock client will eventually meet a 429 — at a burst, during a regional surge, or on a frontier model. A client designed for limits turns that from an outage into a brief, invisible slowdown. Three patterns do most of the work.
These patterns are not exotic; they are the standard discipline for calling any rate-limited API. The difference is that with generative AI, calls are expensive and latency-sensitive, so getting them right has outsized value.
Two supporting practices complete the picture. Observe the limits: emit metrics and CloudWatch alarms on your 429 rate, your per-minute request and token volume, and your retry counts, so you see throttling trending up before it becomes an incident — and so you know whether the right fix is a quota increase, cross-region inference, or just better smoothing. Degrade deliberately: decide in advance what happens when retries are exhausted — fall back to a smaller, higher-quota model, return a graceful "try again" to the user, or shed lower-priority work — rather than surfacing a raw 429. Designed together, backoff, queueing, concurrency caps, observability, and a fallback make throttling a non-event.
On a 429, do not retry immediately and do not retry at a fixed interval. Wait a short delay, then double it on each subsequent failure (exponential backoff), and add a small random component (jitter) so that many clients throttled at the same instant do not all retry in lockstep and re-throttle together (the "thundering herd"). Cap the number of retries and the maximum delay so a request fails cleanly rather than hanging forever. The AWS SDKs implement retry with backoff for throttling errors out of the box — confirm it is enabled and tune the retry count and mode (standard or adaptive) for your workload rather than disabling it.
If your traffic is inherently bursty — a batch of users hitting "generate" at once, an upstream event fan-out, a cron that fires a thousand jobs — put a queue (for example Amazon SQS) between the producers and the Bedrock callers, and drain it at a controlled rate matched to your quota. This converts an uncontrolled spike into a steady stream the quota can absorb, trading a little latency for the elimination of throttling. For asynchronous work this is almost always the right shape; for interactive work, a smaller in-process queue or token-bucket rate limiter on the client achieves the same smoothing.
Bound the number of in-flight Bedrock requests your application makes at once (a semaphore, a worker pool, or a concurrency limit on your async runtime). A concurrency cap keeps you from launching a thousand simultaneous calls the instant load arrives — which would blow through RPM in a single second — and instead holds throughput just under the quota. Combined with backoff (for the 429s that still slip through) and a queue (to hold the overflow), a concurrency cap is the third leg of a client that degrades gracefully instead of failing.
(1) Exponential backoff with jitter on 429s, with a retry cap. (2) A queue in front of bursty producers, drained at a controlled rate. (3) A client-side concurrency cap so a load spike cannot blow RPM in one second. (4) CloudWatch alarms on 429 rate and per-minute token volume. (5) A deliberate fallback (smaller model or graceful retry) when retries are exhausted. With these, throttling is a slowdown, not an outage.
A large share of the throttling teams fight is self-inflicted: they push bulk, non-urgent work through the real-time API and then battle RPM/TPM to get it done. For that work the right move is not a bigger quota — it is a different API.
Batch inference processes a large set of inputs asynchronously: you submit a job pointing at your inputs in Amazon S3, Bedrock works through them on its own schedule, and the results land back in S3 when the job completes. Because it runs on a separate, asynchronous path, it does not compete with your interactive traffic for the real-time RPM/TPM quota — so a million-record backfill no longer threatens to throttle the live product. Batch also typically prices at roughly 50% of on-demand, so moving bulk work to it cuts both throttling and cost. (Batch has its own job-level quotas — on concurrent jobs and job size — which are generous and separate from the real-time limits; see the batch-inference sibling.)
The fit is clear once you separate urgent from bulk. Anything a user is waiting on — a chat reply, an interactive agent step, a live classification — belongs on the real-time path, protected by backoff and queueing. Anything that can complete in minutes-to-hours rather than milliseconds — overnight document enrichment, dataset labelling, embedding a large corpus for RAG, bulk summarisation, evaluation runs over a test set — belongs on Batch. Routing bulk to Batch frees your real-time quota for the traffic that actually needs low latency, which is often the single most effective thing a throttled team can do.
Put the levers together and a coherent production architecture emerges: interactive traffic on the real-time API with cross-region inference and client-side resilience; hot, steady or custom paths on Provisioned Throughput; bulk, non-urgent work on Batch; and a quota increase where sustained on-demand demand has genuinely outgrown the default. Designing that mix — and the backoff, queueing, observability, and fallback around it — is exactly the kind of production engineering a vetted AWS partner does in the engagements CloudRoute routes.
If a user is waiting on it, keep it on the real-time API (with backoff + queueing + cross-region inference). If it can finish in minutes-to-hours, move it to Batch — a separate asynchronous path that does not touch the real-time RPM/TPM quota and prices ~50% cheaper. Moving bulk off the real-time path frees your quota for the traffic that actually needs low latency.
Designing for Bedrock limits is real engineering — quota planning, cross-region routing, Provisioned-Throughput sizing, queueing, backoff, observability. That work, and the inference spend it manages, is exactly what AWS credits and a vetted partner are for, which is how the build can cost you $0.
Every cost involved here — on-demand tokens, Provisioned-Throughput model-unit-hours, Batch jobs, embeddings, the supporting SQS and CloudWatch — is ordinary AWS spend, and credit-eligible: credits in your account apply automatically against it. The relevant pools are the familiar ones: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case at production load, and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). Credits let you stress-test against real quotas, validate cross-region and Provisioned-Throughput strategies under load, and run the bulk Batch jobs — all without the bill being the thing that limits the experiment.
Why this matters for a quota-and-limits problem specifically: getting throughput right often means buying capacity (Provisioned Throughput) or running heavy validation (load tests, Batch backfills) before the product is fully monetised. Funding that from a credit pool rather than runway changes the calculus — you can architect for the peak you expect, prove it holds, and let credits absorb the cost through launch and ramp. Cost discipline becomes "make the credits last" rather than "avoid the experiment."
The practical mechanic is that these pools are largely partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. CloudRoute matches you to the right pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and does the actual production engineering: planning quotas against projected peak, wiring cross-region inference, sizing and managing Provisioned Throughput, building the queue-and-backoff resilience, moving bulk to Batch, and standing up the CloudWatch alarms. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. (For the credit mechanics, see AWS credits for generative-AI startups and the Bedrock POC funding page.)
The scannable version of the whole decision: the four ways to get more Bedrock throughput, side by side, across what each changes, its pricing model, its commitment, and the throttling problem it actually solves. Figures and behaviour are representative 2026 illustrations, not quotes.
| Variable | Quota increase | Cross-region inference | Provisioned Throughput | Batch inference |
|---|---|---|---|---|
| What it changes | Raises adjustable RPM/TPM ceiling | Pools per-region capacity | Reserves dedicated capacity | Moves bulk off real-time |
| Pricing model | On-demand (per token) | On-demand (per token) | Hourly per model unit | Per token, ~50% of on-demand |
| Commitment | None | None | 1mo / 6mo (or hourly no-commit) | None (per job) |
| Latency profile | Real-time | Real-time | Real-time, guaranteed | Asynchronous (mins–hours) |
| Solves which throttling | Sustained demand above ceiling | Spiky on-demand bursts | Need for guaranteed / custom-model capacity | Bulk competing with live traffic |
| Serves custom models? | No | No | Yes | Base models |
| Lead time | Minutes–days (approval) | Immediate (enable profile) | Immediate (purchase) | Per-job runtime |
| When it wins | On-demand demand outgrew the default | Interactive traffic with spikes | Hot steady paths, SLAs, custom models | Backfills, labelling, evals, RAG ingest |
Situation: Ahead of a big customer go-live, load tests on their real-time agent were throwing frequent 429 ThrottlingExceptions on the frontier model they used for the hardest reasoning steps — its on-demand TPM was the tightest ceiling in the account. At the same time, a nightly job that re-embedded their entire knowledge base for RAG was running through the real-time API and competing with daytime traffic for the same quota. They had no backoff strategy, no queue, and were nervous about committing to reserved capacity out of a runway earmarked for hiring before they knew the real peak.
What CloudRoute did: CloudRoute matched them within 24 hours to a UK-region AWS partner with GenAI production experience. The partner (1) split model usage — cheap, high-volume calls to a small high-quota model and only the genuinely hard steps to the frontier model — cutting frontier-model load; (2) enabled cross-region inference on the interactive path to pool capacity and smooth spikes; (3) added exponential backoff with jitter, an SQS queue in front of bursty producers, and a client-side concurrency cap; (4) moved the nightly re-embedding job entirely onto Batch, off the real-time quota and at ~50% cost; (5) filed a quota-increase request for the frontier model with real peak numbers as headroom for proven demand; and (6) filed a Bedrock POC application plus an Activate Portfolio application to fund the load testing and the standing spend.
Outcome: The go-live ran without throttling: cross-region inference plus the client-side resilience absorbed the interactive spikes, the frontier-model quota increase landed before launch, and the nightly Batch job stopped touching the live quota entirely. The entire load-test and launch spend was covered by the approved credits, so the team paid $0 through launch and ramp. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
levers used: cross-region + Batch + quota increase + backoff/queue · 429s at go-live: ~0 · credits secured: POC + Activate · out-of-pocket during build: $0
Quota planning, cross-region inference, Provisioned-Throughput sizing, queueing and backoff, moving bulk to Batch — designing a Bedrock workload that does not throttle is real engineering. CloudRoute routes you to a vetted AWS partner who builds it, and to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) to fund it. Customer pays $0.