for AWS partners →Build it on AWS credits →

bedrock quotas & rate limits · the reference · 2026

Amazon Bedrock quotas and limits — the rate limits that throttle you, and how to lift them.

A neutral reference for Amazon Bedrock service quotas in 2026: the two dimensions that actually throttle production workloads — requests-per-minute (RPM) and tokens-per-minute (TPM), set per model — what the 429 ThrottlingException really means, exactly how to request a quota increase, the two levers that raise the ceiling instead of just retrying into it (cross-region inference and Provisioned Throughput), how to design for limits with backoff and queueing, and when Batch is the right tool for bulk. All figures are representative as of 2026 — confirm current values in the AWS Service Quotas console.

Build it on AWS credits →→ jump to requesting an increase

throttling signal

HTTP 429

quota dimensions

RPM · TPM

scope

per model · per region

build cost with credits

TL;DR

Amazon Bedrock enforces per-model service quotas on two dimensions: requests-per-minute (RPM) and tokens-per-minute (TPM). Exceed either and the API returns an HTTP 429 ThrottlingException. Quotas are per account, per AWS Region, and per model — so a quota on Claude Sonnet in us-east-1 is separate from the same model in eu-west-1 and from a different model in the same region.
Two things raise the ceiling rather than just absorbing throttling. Quota increases (requested through the Service Quotas console or AWS Support) lift the on-demand RPM/TPM numbers where the limit is adjustable. The throughput levers change the capacity model entirely: cross-region inference spreads on-demand load across regions to smooth spikes, and Provisioned Throughput reserves dedicated capacity with guaranteed throughput and no shared-pool throttling. Batch sidesteps real-time limits for bulk work.
Designing for limits is not optional at scale: exponential backoff with jitter on 429s, a queue in front of bursty producers, and a client-side concurrency cap turn throttling from an outage into a brief slowdown. CloudRoute routes you to a vetted AWS partner who designs that production architecture — and to an AWS credit pool (Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) that funds the build. Customer pays $0; AWS funds it.

the concept

IWhat Bedrock quotas are — and the two dimensions that matter

Amazon Bedrock is a shared, multi-tenant service, so access to each foundation model is governed by service quotas. Most of those quotas never matter to you. Two of them — the per-minute request and token limits — are the ones that throttle real production traffic, and they are the ones worth understanding precisely.

A service quota (formerly "service limit") is the maximum value AWS permits for a given resource or rate in your account. Bedrock publishes dozens of them — limits on concurrent fine-tuning jobs, on the number of custom models, on Knowledge Base sizes, on Agents, on Batch job counts, and many more. The overwhelming majority are generous enough that you will never touch them. The two that govern day-to-day inference, and that produce nearly all the throttling teams hit in production, are the per-minute runtime invocation quotas.

Those two dimensions are requests per minute (RPM) and tokens per minute (TPM). RPM caps how many model-invocation API calls you can make in a rolling minute. TPM caps how many tokens — input plus output, combined — you can process through that model in a rolling minute. Both are enforced at once: you are throttled the moment you cross either ceiling, whichever you hit first. A workload that makes few calls but each with a huge context can hit TPM long before RPM; a workload that makes many tiny calls can hit RPM long before TPM.

The most important structural fact is the scope. Bedrock on-demand quotas are applied per account, per AWS Region, and per model. That has three consequences worth internalizing. First, the quota for one model says nothing about another — Claude Sonnet, Claude Haiku, Llama, and Amazon Nova each carry their own RPM/TPM numbers. Second, the quota in one region is independent of another — your us-east-1 ceiling and your eu-west-1 ceiling are separate budgets (this is exactly what cross-region inference exploits; see Section IV). Third, the quota is per account, so every workload, environment, and team sharing an AWS account draws from the same pool unless you separate them across accounts.

Quotas also come in two flavours: adjustable and non-adjustable. An adjustable quota can be raised on request (Section III). A non-adjustable quota is a hard ceiling AWS does not lift on demand — for those, the answer is an architectural lever (cross-region inference, Provisioned Throughput, or Batch), not a support ticket. Whether a specific RPM/TPM limit is adjustable depends on the model and is shown in the Service Quotas console, which is the authoritative source for both the current value and its adjustability.

A final clarification, because it trips people up: quotas are not the same as on-demand pricing, and neither is the same as Provisioned Throughput. Pricing is what each token costs. Quotas are how fast you are allowed to spend on-demand. Provisioned Throughput is a separate capacity model where you reserve dedicated throughput and step outside the shared on-demand quota pool entirely. The rest of this page keeps those three straight.

the one-line definition

Bedrock quotas = per-account, per-Region, per-model ceilings on inference rate, enforced on two dimensions at once: requests per minute (RPM) and tokens per minute (TPM). Cross either and you get an HTTP 429 ThrottlingException. Some are adjustable (raise via Service Quotas); some are hard (use a throughput lever instead).

the numbers (with a caveat)

IIRepresentative RPM and TPM figures per model

There is no single Bedrock rate limit — the numbers differ by model, by region, and over time as AWS adjusts defaults. The table below is a representative shape, not a quote, to show how the limits scale. Always confirm your actual values in the Service Quotas console for your account and region.

The pattern across models is consistent even though the exact figures move. Smaller, cheaper, faster models (Haiku-class, Amazon Nova Micro/Lite, small Llama variants) carry higher default RPM and TPM, because they are cheap to serve and AWS provisions more shared capacity for them. Larger frontier models (Claude Opus / Sonnet-class, large Llama, Nova Premier) carry lower default RPM and TPM, because each call is more expensive to serve. So the same traffic pattern can sail through on a small model and throttle on a frontier model — a real consideration when you pick a model for a high-throughput path.

Read the figures below as illustrative orders of magnitude as of 2026, useful for reasoning about headroom and for sizing, not as values to hardcode. AWS revises defaults, and your account may already carry increases. The authoritative number for any given model in any given region is the one shown in the Service Quotas console under the Amazon Bedrock service.

Two practical takeaways from the shape. First, budget headroom on the frontier models: if your hottest path runs on an Opus- or Sonnet-class model, that is where you will throttle first, so it is the quota to check and likely the one to raise. Second, model choice is a throughput decision, not only a quality decision: routing the bulk of cheap, high-volume calls to a small model (and reserving the frontier model for the calls that genuinely need it) both lowers cost and stays clear of the tighter frontier limits.

representative on-demand RPM / TPM by model class · illustrative 2026 shape, not a quote

Model class	Typical role	Relative RPM	Relative TPM	Adjustable?
Small / fast (Haiku-class, Nova Micro/Lite)	High-volume, latency-sensitive, cheap	Highest	Highest	Often yes
Mid (Sonnet-class, mid Llama, Nova Pro)	Balanced quality / cost workhorse	Moderate	Moderate	Often yes
Large / frontier (Opus-class, large Llama, Nova Premier)	Hardest reasoning, lowest throughput	Lowest	Lowest	Sometimes (model-dependent)
Embeddings (Titan / Nova / Cohere embed)	Vectorising for RAG at scale	High	High	Often yes
Image / video (Nova Canvas / Reel, Stability)	Generation, lower call rates	Low (per-image)	n/a (per-image)	Model-dependent

Relative, not absolute — exact RPM/TPM differ by model, region, and date, and your account may already carry increases. Confirm current values per model per region in the AWS Service Quotas console (Amazon Bedrock). The pattern that holds: smaller models = higher limits; frontier models = lower limits; you are throttled when you cross RPM or TPM, whichever comes first.

what 429 means

IIIHow throttling works — the 429 ThrottlingException

When you cross a quota, Bedrock does not queue your request or slow it down — it rejects it. Understanding exactly what that rejection looks like, and what it does not mean, is the difference between a resilient client and one that falls over under load.

When a request would exceed your RPM or TPM quota for a model, the Bedrock runtime returns an error with HTTP status 429 and the error code ThrottlingException (you may also see TooManyRequestsException phrasing in some SDKs). The message typically reads along the lines of "Too many requests, please wait before trying again." This is a transient, retryable error — it does not mean your request was malformed, your credentials are wrong, or the model is down. It means: right now, in this rolling minute, this model in this region is at capacity for your account.

Crucially, a 429 is not a 400-class validation error (a bad prompt, an oversized payload, an unsupported parameter) and not a 500-class server error (an internal Bedrock fault). The correct response to a 429 is to wait and retry; the correct response to a 400 is to fix the request and not retry; the correct response to a 500 is usually a cautious retry as well. Conflating them is a common bug — teams either retry validation errors forever (wasting calls and never succeeding) or fail fast on throttling (turning a brief capacity blip into a user-facing error).

Throttling is evaluated continuously against the rolling minute, so it is inherently bursty: a workload averaging well under its quota can still throttle if its traffic arrives in spikes that momentarily exceed the per-minute rate. This is why the average-utilisation view lies. A path that averages 30% of its TPM but bursts to 150% for ten seconds will throw 429s during those bursts. Smoothing the burst — through queueing and concurrency caps (Section V) — often eliminates throttling without raising the quota at all.

There is also a relationship between the two dimensions worth naming: because TPM counts input plus output tokens, a sudden shift in workload shape can trip TPM unexpectedly. A feature that starts sending much larger documents, or a prompt change that balloons output length, can push you over TPM even though your request count (RPM) has not moved. When throttling appears "out of nowhere," a change in average tokens-per-call is a frequent culprit — check token volume, not just call volume.

the 429 rule

A 429 ThrottlingException is transient and retryable — wait and retry with backoff, do not fail fast and do not fix the request. It is not a 400 (bad request — fix and stop) and not a 500 (server error). Throttling is evaluated per rolling minute, so bursty traffic throttles even when the average is well under quota. TPM counts input + output, so larger prompts or longer outputs can trip it without any change in request count.

lifting the ceiling

IVHow to request a quota increase

When the throttling is real demand rather than a fixable burst, the direct fix is to raise the adjustable quota. This is a self-service request for most RPM/TPM limits, escalating to AWS Support for larger asks. Here is the path end to end.

The front door is the Service Quotas console (Service Quotas → AWS services → Amazon Bedrock). It lists every Bedrock quota, the current applied value for your account, the AWS default, and whether the quota is adjustable. Find the specific per-model RPM or TPM quota you are hitting, choose "Request quota increase," enter the new value you need, and submit. For modest increases on adjustable quotas this is often approved automatically or within a short review; larger increases route to a human reviewer.

The lifecycle of a request is straightforward:

Identify the exact quota — In Service Quotas, filter to Amazon Bedrock and locate the precise quota — it names the model and the dimension (e.g. on-demand requests-per-minute or tokens-per-minute for a specific model). Confirm it is marked adjustable; if it is not, a support ticket will not raise it and you need a throughput lever instead (Section V).
Request the new value — Click "Request quota increase" and enter the target value. Ask for realistic headroom above your measured peak — not a round number plucked from the air. Reviewers approve well-justified, proportionate asks faster than vague large ones.
Justify with real numbers (for larger asks) — For substantial increases, open or attach an AWS Support case describing the workload: current peak RPM/TPM, projected peak, the use case, and the region. Concrete traffic data and a clear business reason materially speed approval.
Track the request — Service Quotas shows request status (pending, approved, denied). Modest adjustable increases can clear quickly; large or non-standard ones can take longer and may involve back-and-forth with the Bedrock team.
Plan around lead time — A quota increase is not instant capacity for a launch tomorrow. Request well ahead of a known traffic event (a launch, a marketing push, a seasonal peak) so the approval lands before the demand does.

Two limits on what a quota increase can do. First, AWS will not raise a quota beyond what it can serve from shared on-demand capacity for that model and region, so very large asks may be partially granted or pointed toward Provisioned Throughput. Second, a higher quota raises the ceiling but does nothing for bursty traffic that throttles below the ceiling — if your average is far under quota and you still see 429s, the fix is smoothing the burst (Section V), not a bigger number. Raise the quota when sustained demand genuinely exceeds it; smooth the traffic when bursts are the cause.

increase vs architecture

Raise the adjustable RPM/TPM quota via the Service Quotas console (escalating to AWS Support for large asks) when sustained demand exceeds it. If the quota is non-adjustable, or the asks get very large, or the model/region is capacity-constrained, the real fix is an architectural lever — cross-region inference, Provisioned Throughput, or Batch (next two sections), not a bigger number.

changing the capacity model

VThe throughput levers — cross-region inference and Provisioned Throughput

Beyond raising on-demand numbers, two features change the capacity model itself. One spreads on-demand load across regions to smooth spikes; the other reserves dedicated capacity that does not share the on-demand quota pool at all. Both raise effective throughput without you simply retrying into the same ceiling.

A quota increase makes the on-demand bucket bigger. These two levers do something different: they give you more buckets, or a private bucket.

The decision rule is simple. If throttling comes from bursts below the ceiling, smooth them (cross-region inference, plus client-side queueing in Section VI). If it comes from sustained demand above an adjustable ceiling, raise the quota. If you need guaranteed throughput, are running a custom model, or have hit a non-adjustable wall, reserve Provisioned Throughput. If the work is bulk and not time-sensitive, take it off the real-time path with Batch (Section VII).

Cross-region inference — more buckets

Because on-demand quotas are scoped per region, the capacity available to you across several regions is the sum of the per-region quotas. Cross-region inference uses an inference profile that lets Bedrock automatically route a request to one of several regions on your behalf, drawing on the combined capacity rather than a single region's. The practical effect is spike-smoothing: a burst that would throttle in one region is spread across the profile's regions, so the effective RPM/TPM headroom is larger and transient throttling drops sharply. It stays on the on-demand pricing model (no commitment) and is often the first lever to reach for when throttling comes from spiky on-demand traffic rather than a sustained ceiling. (See the cross-region-inference sibling for the routing detail and data-residency considerations.)

Provisioned Throughput — a private bucket

Where cross-region inference adds shared buckets, Provisioned Throughput (PT) gives you a private one. You reserve dedicated capacity for a specific model — measured in "model units," each delivering a guaranteed tokens-per-minute throughput — and pay a flat hourly rate per unit. Because that capacity is yours, it is not subject to the shared on-demand RPM/TPM quotas at all: no contention with other tenants, guaranteed throughput, and consistent latency with no 429s from the shared pool. PT is also the only way to serve most custom (fine-tuned, distilled, imported) models. The trade is commitment and cost: the hourly charge accrues whether or not the capacity is used, so PT wins for high, steady, predictable volume and for guaranteed-SLA paths, not for variable or experimental traffic. (See the provisioned-throughput sibling for the break-even math.)

four ways to get more Bedrock throughput · what each one changes

Lever	What it changes	Pricing model	Best when	Removes 429s?
Quota increase	Raises the adjustable on-demand RPM/TPM ceiling	On-demand (per token)	Sustained on-demand demand exceeds the current ceiling	If the ceiling was the cause
Cross-region inference	Pools per-region capacity across regions	On-demand (per token)	Spiky on-demand traffic throttling in one region	Largely — smooths spikes
Provisioned Throughput	Reserves dedicated capacity outside the shared pool	Hourly per model unit	High, steady volume; SLA paths; custom models	Yes — within reserved capacity
Batch inference	Moves bulk work off the real-time path entirely	Per token (~50% cheaper)	Large, non-urgent, asynchronous jobs	N/A — not real-time

These are complementary, not mutually exclusive. A common production shape: cross-region inference for interactive on-demand traffic, Provisioned Throughput for the one or two hottest steady paths or any custom model, Batch for bulk, and a quota increase where on-demand demand genuinely outgrows the default. Figures and behaviour are representative as of 2026 — confirm current details in the AWS docs and Service Quotas console.

resilient by design

VIDesigning for limits — backoff, queueing, and concurrency caps

Even with healthy quotas, any high-traffic Bedrock client will eventually meet a 429 — at a burst, during a regional surge, or on a frontier model. A client designed for limits turns that from an outage into a brief, invisible slowdown. Three patterns do most of the work.

These patterns are not exotic; they are the standard discipline for calling any rate-limited API. The difference is that with generative AI, calls are expensive and latency-sensitive, so getting them right has outsized value.

Two supporting practices complete the picture. Observe the limits: emit metrics and CloudWatch alarms on your 429 rate, your per-minute request and token volume, and your retry counts, so you see throttling trending up before it becomes an incident — and so you know whether the right fix is a quota increase, cross-region inference, or just better smoothing. Degrade deliberately: decide in advance what happens when retries are exhausted — fall back to a smaller, higher-quota model, return a graceful "try again" to the user, or shed lower-priority work — rather than surfacing a raw 429. Designed together, backoff, queueing, concurrency caps, observability, and a fallback make throttling a non-event.

Exponential backoff with jitter

On a 429, do not retry immediately and do not retry at a fixed interval. Wait a short delay, then double it on each subsequent failure (exponential backoff), and add a small random component (jitter) so that many clients throttled at the same instant do not all retry in lockstep and re-throttle together (the "thundering herd"). Cap the number of retries and the maximum delay so a request fails cleanly rather than hanging forever. The AWS SDKs implement retry with backoff for throttling errors out of the box — confirm it is enabled and tune the retry count and mode (standard or adaptive) for your workload rather than disabling it.

A queue in front of bursty producers

If your traffic is inherently bursty — a batch of users hitting "generate" at once, an upstream event fan-out, a cron that fires a thousand jobs — put a queue (for example Amazon SQS) between the producers and the Bedrock callers, and drain it at a controlled rate matched to your quota. This converts an uncontrolled spike into a steady stream the quota can absorb, trading a little latency for the elimination of throttling. For asynchronous work this is almost always the right shape; for interactive work, a smaller in-process queue or token-bucket rate limiter on the client achieves the same smoothing.

A client-side concurrency cap

Bound the number of in-flight Bedrock requests your application makes at once (a semaphore, a worker pool, or a concurrency limit on your async runtime). A concurrency cap keeps you from launching a thousand simultaneous calls the instant load arrives — which would blow through RPM in a single second — and instead holds throughput just under the quota. Combined with backoff (for the 429s that still slip through) and a queue (to hold the overflow), a concurrency cap is the third leg of a client that degrades gracefully instead of failing.

the resilience checklist

(1) Exponential backoff with jitter on 429s, with a retry cap. (2) A queue in front of bursty producers, drained at a controlled rate. (3) A client-side concurrency cap so a load spike cannot blow RPM in one second. (4) CloudWatch alarms on 429 rate and per-minute token volume. (5) A deliberate fallback (smaller model or graceful retry) when retries are exhausted. With these, throttling is a slowdown, not an outage.

sidestepping real-time limits

VIIWhen the answer is Batch — bulk work off the real-time path

A large share of the throttling teams fight is self-inflicted: they push bulk, non-urgent work through the real-time API and then battle RPM/TPM to get it done. For that work the right move is not a bigger quota — it is a different API.

Batch inference processes a large set of inputs asynchronously: you submit a job pointing at your inputs in Amazon S3, Bedrock works through them on its own schedule, and the results land back in S3 when the job completes. Because it runs on a separate, asynchronous path, it does not compete with your interactive traffic for the real-time RPM/TPM quota — so a million-record backfill no longer threatens to throttle the live product. Batch also typically prices at roughly 50% of on-demand, so moving bulk work to it cuts both throttling and cost. (Batch has its own job-level quotas — on concurrent jobs and job size — which are generous and separate from the real-time limits; see the batch-inference sibling.)

The fit is clear once you separate urgent from bulk. Anything a user is waiting on — a chat reply, an interactive agent step, a live classification — belongs on the real-time path, protected by backoff and queueing. Anything that can complete in minutes-to-hours rather than milliseconds — overnight document enrichment, dataset labelling, embedding a large corpus for RAG, bulk summarisation, evaluation runs over a test set — belongs on Batch. Routing bulk to Batch frees your real-time quota for the traffic that actually needs low latency, which is often the single most effective thing a throttled team can do.

Put the levers together and a coherent production architecture emerges: interactive traffic on the real-time API with cross-region inference and client-side resilience; hot, steady or custom paths on Provisioned Throughput; bulk, non-urgent work on Batch; and a quota increase where sustained on-demand demand has genuinely outgrown the default. Designing that mix — and the backoff, queueing, observability, and fallback around it — is exactly the kind of production engineering a vetted AWS partner does in the engagements CloudRoute routes.

urgent vs bulk

If a user is waiting on it, keep it on the real-time API (with backoff + queueing + cross-region inference). If it can finish in minutes-to-hours, move it to Batch — a separate asynchronous path that does not touch the real-time RPM/TPM quota and prices ~50% cheaper. Moving bulk off the real-time path frees your quota for the traffic that actually needs low latency.

how it becomes $0

VIIIProduction architecture, funded by AWS credits

Designing for Bedrock limits is real engineering — quota planning, cross-region routing, Provisioned-Throughput sizing, queueing, backoff, observability. That work, and the inference spend it manages, is exactly what AWS credits and a vetted partner are for, which is how the build can cost you $0.

Every cost involved here — on-demand tokens, Provisioned-Throughput model-unit-hours, Batch jobs, embeddings, the supporting SQS and CloudWatch — is ordinary AWS spend, and credit-eligible: credits in your account apply automatically against it. The relevant pools are the familiar ones: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case at production load, and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). Credits let you stress-test against real quotas, validate cross-region and Provisioned-Throughput strategies under load, and run the bulk Batch jobs — all without the bill being the thing that limits the experiment.

Why this matters for a quota-and-limits problem specifically: getting throughput right often means buying capacity (Provisioned Throughput) or running heavy validation (load tests, Batch backfills) before the product is fully monetised. Funding that from a credit pool rather than runway changes the calculus — you can architect for the peak you expect, prove it holds, and let credits absorb the cost through launch and ramp. Cost discipline becomes "make the credits last" rather than "avoid the experiment."

The practical mechanic is that these pools are largely partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. CloudRoute matches you to the right pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and does the actual production engineering: planning quotas against projected peak, wiring cross-region inference, sizing and managing Provisioned Throughput, building the queue-and-backoff resilience, moving bulk to Batch, and standing up the CloudWatch alarms. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. (For the credit mechanics, see AWS credits for generative-AI startups and the Bedrock POC funding page.)

the four throughput levers

Quota increase vs cross-region inference vs Provisioned Throughput vs Batch

The scannable version of the whole decision: the four ways to get more Bedrock throughput, side by side, across what each changes, its pricing model, its commitment, and the throttling problem it actually solves. Figures and behaviour are representative 2026 illustrations, not quotes.

Variable	Quota increase	Cross-region inference	Provisioned Throughput	Batch inference
What it changes	Raises adjustable RPM/TPM ceiling	Pools per-region capacity	Reserves dedicated capacity	Moves bulk off real-time
Pricing model	On-demand (per token)	On-demand (per token)	Hourly per model unit	Per token, ~50% of on-demand
Commitment	None	None	1mo / 6mo (or hourly no-commit)	None (per job)
Latency profile	Real-time	Real-time	Real-time, guaranteed	Asynchronous (mins–hours)
Solves which throttling	Sustained demand above ceiling	Spiky on-demand bursts	Need for guaranteed / custom-model capacity	Bulk competing with live traffic
Serves custom models?	No	No	Yes	Base models
Lead time	Minutes–days (approval)	Immediate (enable profile)	Immediate (purchase)	Per-job runtime
When it wins	On-demand demand outgrew the default	Interactive traffic with spikes	Hot steady paths, SLAs, custom models	Backfills, labelling, evals, RAG ingest

These combine in production: cross-region inference + client-side resilience for interactive traffic, Provisioned Throughput for the hottest steady or custom paths, Batch for bulk, and a quota increase where sustained on-demand demand outgrew the default. Confirm current quotas, behaviour, and pricing in the AWS Service Quotas console and the Bedrock docs.

before you fight 429s by hand

Get a partner to design the production architecture — and AWS credits to fund it (you pay $0)

Get matched in 24h →

a recent match

A launch that was throttling on a frontier model — re-architected and funded at $0 — anonymized

inquiry · Series-A AI support platform, London

Series-A AI customer-support platform, 28 people, a real-time agent on a Sonnet-class model plus a large nightly knowledge-base re-ingest

Situation: Ahead of a big customer go-live, load tests on their real-time agent were throwing frequent 429 ThrottlingExceptions on the frontier model they used for the hardest reasoning steps — its on-demand TPM was the tightest ceiling in the account. At the same time, a nightly job that re-embedded their entire knowledge base for RAG was running through the real-time API and competing with daytime traffic for the same quota. They had no backoff strategy, no queue, and were nervous about committing to reserved capacity out of a runway earmarked for hiring before they knew the real peak.

What CloudRoute did: CloudRoute matched them within 24 hours to a UK-region AWS partner with GenAI production experience. The partner (1) split model usage — cheap, high-volume calls to a small high-quota model and only the genuinely hard steps to the frontier model — cutting frontier-model load; (2) enabled cross-region inference on the interactive path to pool capacity and smooth spikes; (3) added exponential backoff with jitter, an SQS queue in front of bursty producers, and a client-side concurrency cap; (4) moved the nightly re-embedding job entirely onto Batch, off the real-time quota and at ~50% cost; (5) filed a quota-increase request for the frontier model with real peak numbers as headroom for proven demand; and (6) filed a Bedrock POC application plus an Activate Portfolio application to fund the load testing and the standing spend.

Outcome: The go-live ran without throttling: cross-region inference plus the client-side resilience absorbed the interactive spikes, the frontier-model quota increase landed before launch, and the nightly Batch job stopped touching the live quota entirely. The entire load-test and launch spend was covered by the approved credits, so the team paid $0 through launch and ramp. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

levers used: cross-region + Batch + quota increase + backoff/queue · 429s at go-live: ~0 · credits secured: POC + Activate · out-of-pocket during build: $0

faq

Common questions

What are Amazon Bedrock service quotas?

Service quotas are the maximum values AWS permits for Bedrock resources and rates in your account. The two that govern day-to-day inference — and cause nearly all production throttling — are the per-model requests-per-minute (RPM) and tokens-per-minute (TPM) limits. They are scoped per account, per AWS Region, and per model, so each model in each region has its own RPM/TPM budget, and every workload sharing an account draws from the same pool. Other quotas (concurrent fine-tuning jobs, custom-model counts, Knowledge Base sizes, Agents, Batch jobs) exist too but are usually generous enough to ignore.

What is the difference between RPM and TPM in Bedrock?

RPM (requests per minute) caps how many model-invocation API calls you can make in a rolling minute. TPM (tokens per minute) caps how many tokens — input plus output combined — you can process through that model per rolling minute. Both are enforced at once, and you are throttled the instant you cross either, whichever comes first. A workload of many tiny calls tends to hit RPM first; a workload of few calls with very large contexts or long outputs tends to hit TPM first. Because TPM counts input + output, larger prompts or longer responses can trip it without your request count changing at all.

What does a 429 ThrottlingException mean in Bedrock?

A 429 ThrottlingException means you exceeded your RPM or TPM quota for that model in that region in the current rolling minute. It is transient and retryable — it does not mean your request was malformed (that is a 400) or that Bedrock failed internally (that is a 500). The correct response is to wait and retry with exponential backoff and jitter, not to fail fast and not to "fix" the request. Throttling is evaluated per rolling minute, so bursty traffic can trigger 429s even when your average usage is well under the quota.

How do I request a Bedrock quota increase?

Open the AWS Service Quotas console, go to AWS services and select Amazon Bedrock, find the specific per-model RPM or TPM quota you are hitting, confirm it is marked adjustable, click "Request quota increase," and enter a realistic value above your measured peak. Modest increases on adjustable quotas are often approved automatically or after a short review; larger asks route to AWS Support and are approved faster when you supply concrete traffic data (current and projected peak RPM/TPM, the use case, the region). Request ahead of any known traffic event — approval is not instant capacity.

Why am I being throttled when my average usage is below the quota?

Because Bedrock evaluates quotas against a rolling minute, so bursts matter more than averages. Traffic that averages 30% of your TPM but spikes to 150% for a few seconds will throw 429s during those spikes. The fix is usually not a higher quota but smoothing the burst: put a queue in front of bursty producers and drain it at a controlled rate, add a client-side concurrency cap so a load spike cannot launch a thousand calls in one second, and use cross-region inference to pool per-region capacity. Also check whether average tokens-per-call has grown, since larger prompts or outputs can trip TPM without any change in request count.

How do cross-region inference and Provisioned Throughput help with rate limits?

They change the capacity model rather than just raising a number. On-demand quotas are per region, so cross-region inference uses an inference profile to route requests across several regions automatically, pooling their capacity and smoothing spikes — it stays on-demand with no commitment, and is the first lever for bursty on-demand throttling. Provisioned Throughput reserves dedicated capacity (model units, each with guaranteed throughput) at a flat hourly rate; that capacity is outside the shared on-demand pool, so it has no shared-quota throttling and guarantees throughput — and it is the only way to serve most custom models. Use cross-region for spikes, Provisioned Throughput for steady high volume, SLA paths, or custom models.

When should I use Batch instead of fighting real-time quotas?

Use Batch for any bulk work that does not need a sub-second response — overnight document enrichment, dataset labelling, embedding a large corpus for RAG, bulk summarisation, evaluation runs. Batch processes inputs from S3 asynchronously on a separate path, so it does not compete with interactive traffic for the real-time RPM/TPM quota, and it typically prices at roughly 50% of on-demand. Keep anything a user is waiting on (chat replies, live agent steps) on the real-time API with backoff and queueing, and move the bulk to Batch — that alone frees real-time quota and cuts cost.

How should I design a Bedrock client to handle throttling?

Combine five patterns. (1) Exponential backoff with jitter on 429s, with a retry cap so requests fail cleanly rather than hanging — the AWS SDKs do this for throttling out of the box, so confirm it is enabled and tune the retry mode. (2) A queue (for example SQS) in front of bursty producers, drained at a rate matched to your quota. (3) A client-side concurrency cap so a spike cannot blow RPM in one second. (4) CloudWatch alarms on your 429 rate and per-minute token volume so you see throttling trending up before it becomes an incident. (5) A deliberate fallback — a smaller, higher-quota model or a graceful retry — when retries are exhausted. Together these make throttling a brief slowdown rather than an outage.

Can AWS credits cover the cost of building and load-testing a Bedrock workload?

Yes. On-demand tokens, Provisioned-Throughput model-unit-hours, Batch jobs, embeddings, and the supporting SQS and CloudWatch are all ordinary, credit-eligible AWS spend — credits in your account apply automatically. This is especially useful for a quota/throughput problem because getting throughput right often means buying reserved capacity or running heavy load tests and Batch backfills before the product is fully monetised. The relevant pools (AWS Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) are largely partner-filed via the AWS Partner Network. CloudRoute matches you to the right pool and a vetted AWS partner who files the application and designs the production architecture — customer pays $0, AWS funds it.

Architect past the limits — let AWS fund the build

Quota planning, cross-region inference, Provisioned-Throughput sizing, queueing and backoff, moving bulk to Batch — designing a Bedrock workload that does not throttle is real engineering. CloudRoute routes you to a vetted AWS partner who builds it, and to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) to fund it. Customer pays $0.

Get matched in 24h →→ see the AI-team persona detail

matched within< 24h

GenAI credit ceilingup to $1M

cost to you$0