A complete, neutral reference for Bedrock batch inference in 2026: what it is (asynchronous, high-throughput jobs at roughly half the on-demand price), how to submit one end-to-end (JSONL records in S3, results back to S3, via CreateModelInvocationJob), which models support it, the workloads it is built for, the quotas and realistic turnaround, the cost math versus on-demand, and how it combines with the other Bedrock cost levers. Plus how AWS credits make the whole bill $0 to build.
Batch inference is the mode you reach for whenever you have a lot of model work to do and no need for any individual answer right now. It is the clearest example on Bedrock of trading latency for a large, predictable cost cut — and a surprising share of real GenAI workloads fit its shape better than the default real-time path.
On the default on-demand path you send one request and wait for one response, in real time, paying the full per-token rate. That is the right model for anything a human is waiting on — a chatbot reply, an agent step, an autocomplete. But a large class of GenAI work is not latency-sensitive at all: classifying ten million support tickets overnight, generating embeddings for an entire document corpus, enriching a dataset, scoring a model evaluation suite, producing synthetic training data. For all of these, nobody is staring at a spinner — you just need the whole job done by some deadline, cheaply.
Batch inference serves exactly that. You assemble all your requests into a single job (a file of inputs in Amazon S3), submit it, and Bedrock processes the requests asynchronously in the background, writing the results to S3 when finished. Because you have relinquished real-time responsiveness — results arrive in minutes to hours, not milliseconds — Bedrock charges you roughly half the on-demand token rate for the same model and the same tokens. The output is identical to what on-demand would produce; only the delivery model and the price differ.
The mental model: on-demand is a conversation, batch is a queue. On-demand optimizes for the latency of each individual request; batch optimizes for the cost and throughput of a large set of requests where individual latency does not matter. The ~50% discount is not a promotion — it reflects that asynchronous, schedulable work is cheaper for AWS to serve, and they pass that saving on.
This makes batch the single easiest cost win on Bedrock for batch-shaped work. There is no model-quality tradeoff, no architectural risk, no reserved-capacity commitment — you take a workload that does not need to be real-time, point it at the batch path instead of the on-demand path, and your token bill for that workload roughly halves. The only real cost is engineering the job submission and tolerating the turnaround.
Batch inference runs a big set of prompts as one asynchronous job — you give Bedrock a file of inputs in S3, it processes them in the background and writes results back to S3 — at roughly 50% of the on-demand price. The trade is latency (minutes-to-hours, not real-time), which is free to give up on any bulk job a human is not waiting on.
A batch job has a deliberately simple shape: a file of inputs in S3 in, a file of outputs in S3 out, one API call to kick it off. Understanding the five moving parts — the input format, the S3 locations, the IAM role, the job-creation call, and the output — is all you need to run one end to end.
The workflow has five steps, in order:
Two properties of that design are worth internalizing. First, every record is independent — there is no shared conversation state across lines, so batch is for embarrassingly-parallel work (classify each item, embed each chunk), not for multi-turn or chained reasoning. Second, the whole thing is file-in / file-out through S3, which makes it trivial to wire into a data pipeline: a job is just "transform this S3 dataset with a model," and it slots naturally beside Glue, Athena, Step Functions, or whatever orchestrates your data flow.
Conceptually each JSONL line looks like: a recordId (so you can join the answer back to the source row) plus a modelInput (the exact request body the chosen model takes on-demand — prompt/messages, max tokens, temperature, etc.). The output JSONL mirrors it: the same recordId plus a modelOutput with the model's response. Because modelInput is just the normal per-model body, moving a workload from on-demand to batch is largely a matter of writing the same requests to a file instead of sending them one at a time.
Exact field names, per-model body shapes, the precise job states, and any console-vs-API differences evolve — treat this as the durable shape and confirm the specifics in the current AWS Bedrock documentation for batch inference.
Batch is supported on a broad set of the text and embedding models on Bedrock — which is the point, because the bulk jobs batch is built for are overwhelmingly classification, extraction, summarization, and embeddings on exactly those models.
As of 2026, batch inference is available across many of the high-volume text models on Bedrock — including Anthropic Claude tiers, Amazon Nova text models, and various open-weight families (Llama, Mistral) — and the embedding models (Amazon Titan Text Embeddings, Cohere Embed) that power corpus backfills. Coverage broadens over time and not every model in the catalog supports batch, so confirm batch availability for your specific model in the current AWS Bedrock documentation before designing a job around it.
The practical model choice for batch is the same discipline as everywhere else on Bedrock, only more so: pick the cheapest model that clears the quality bar, because batch is where volume is highest. A nightly job over hundreds of millions of tokens should almost never run on a frontier model if a small, fast model (Amazon Nova Micro/Lite, Claude Haiku, a small Llama/Mistral) produces good-enough results — the per-token rate difference, multiplied by batch-scale volume and then halved by the batch discount, compounds into the difference between a $40 job and a $4,000 one. See amazon-bedrock-pricing for the full per-model price table.
For embeddings specifically, batch is the natural home for the initial backfill of a large corpus and for periodic re-embedding when content or the embedding model changes — both are high-volume, non-interactive, and perfectly parallel, which is the exact batch sweet spot.
Batch support, per-model input body shapes, and quotas vary by model and region and change as AWS ships updates. Confirm batch availability and limits for your exact model in the current AWS Bedrock docs — this page gives the durable mechanics and representative economics, not a frozen capability matrix.
Batch is the right call whenever the work is high-volume and nobody is waiting on any single answer. Here are the canonical workloads it was built for — and the bright line that tells you when to stay on the real-time path instead.
When not to use batch: anything a human or a synchronous system is actively waiting on. Interactive chat, agent loops that branch on each step, autocomplete, real-time moderation in a request path, or any flow where a result is needed within seconds belongs on on-demand (often with prompt caching). The bright line is simple: "Is something waiting on this specific answer right now?" If yes, real-time. If no — if you just need the whole job done by a deadline — batch, and take the ~50% discount.
Batch trades latency for cost, so the honest question is not "how fast" but "how predictable." Turnaround depends on job size, model, region capacity, and how many jobs you are running — and there are quotas that shape how you structure work. Here is what to expect and how to plan around it.
Turnaround. A batch job is asynchronous and scheduled against available capacity, so completion time ranges from minutes for a small job to many hours for a very large one. It is not a real-time SLA — Bedrock processes the job as capacity allows, and a large job submitted during a busy window can sit before it runs. The right way to use batch is to build the deadline into your schedule: submit the nightly enrichment job with hours of headroom before anyone needs the results, rather than expecting a fixed completion time. For planning, size from the job's total token volume and the model's throughput, and add margin.
Quotas and limits. Bedrock applies account- and region-level quotas to batch — for example caps on the number of concurrent/in-progress batch jobs, and limits on input file size and the number of records per job. Large datasets are therefore chunked into multiple input files / multiple jobs rather than one giant submission. The specific numbers vary by model and region and change over time, and many are adjustable via a Service Quotas increase request, so check the current Bedrock quotas page rather than hard-coding a limit. The practical guidance: design the pipeline to split large work into appropriately-sized jobs and to handle queuing, retries on failed records, and partial completion gracefully.
Failure modes to plan for. Individual records can fail (malformed input, a record exceeding limits) without failing the whole job; the output reflects which records succeeded, so your pipeline should reconcile inputs against outputs by record id and re-submit failures. Jobs can also be stopped. None of this is unusual for batch systems — it just means treating batch as a data pipeline with reconciliation, not a single fire-and-forget call.
| Dimension | On-Demand | Batch inference |
|---|---|---|
| Latency | Real-time (ms–seconds) | Asynchronous (minutes–hours) |
| Price per token | Baseline (1×) | ~50% of on-demand |
| Interface | Per-request API call | JSONL file in S3 → job → JSONL out |
| Concurrency model | Throughput limits per account | Job/record quotas; chunk large work |
| Best for | Anything a human waits on | Bulk, non-interactive jobs |
| State | Can be multi-turn | Each record independent |
The ~50% discount is easy to state and easy to under-appreciate. Here is a concrete, representative example so you can see the absolute dollars and reproduce the calculation for your own job.
The job. A nightly enrichment task: summarize and tag 2,000,000 documents/month, each averaging 1,500 input tokens and producing 250 output tokens, on a small, fast model (Amazon Nova Lite-class). Monthly volume: 2M × 1,500 = 3,000M (3B) input tokens and 2M × 250 = 500M output tokens.
On-demand. At Nova Lite's representative rates of $0.06 / 1M input and $0.24 / 1M output: input = 3,000 × $0.06 = $180; output = 500 × $0.24 = $120 → ≈ $300/month.
Batch (~50% off). The same tokens at roughly half the rate: input ≈ $90, output ≈ $60 → ≈ $150/month. Same model, same output, same two million documents — half the bill, in exchange for letting the job run overnight instead of in real time.
Now compound it with model choice. If that same job had been run on a frontier model (say a Sonnet-class model at $3 / $15 per 1M) it would cost roughly 3,000 × $3 + 500 × $15 = $9,000 + $7,500 = ~$16,500/month on-demand, or ~$8,250 on batch. The lesson the arithmetic teaches twice: for bulk work, model choice is the first lever and batch is the second, and using both together — cheapest adequate model, run on the batch path — is the difference between a $150 job and a $16,500 one for identical throughput. Right-size the model first, then halve it with batch.
These remain representative 2026 illustrations — your numbers depend on token volumes, the model, and the region, and rates change. Always confirm current pricing on the AWS Bedrock pricing page; see amazon-bedrock-pricing-calculator to model your own job.
| Model | On-demand / mo | Batch (~50%) / mo | Saving from batch | Notes |
|---|---|---|---|---|
| Amazon Nova Lite | ~$300 | ~$150 | ~$150 | Right-sized for enrichment |
| Claude Haiku | ~$1,375 | ~$690 | ~$685 | Fast, slightly pricier |
| Claude Sonnet | ~$16,500 | ~$8,250 | ~$8,250 | Overkill for this job |
Batch is one of four Bedrock pricing modes, and the smart move is rarely "use batch for everything" — it is to route each path of a product to its best mode. Here is how batch composes with the others so you can cost-tune a whole system rather than picking one global setting.
The four ways to pay on Bedrock are On-Demand (per token, real-time, no commitment), Batch (~50% off, asynchronous), Provisioned Throughput (reserved capacity at a flat hourly rate), and prompt caching (a discount on repeated input on a real-time path). They are not mutually exclusive across a product — a single application typically uses several for different paths.
Batch + model right-sizing is the highest-impact pairing, as the worked example showed: choose the cheapest model that clears the bar, then halve it on the batch path. The two levers multiply, and for bulk work the combined effect is often more than an order of magnitude versus a frontier model on-demand.
Batch vs. prompt caching is mostly an either/or by traffic shape, not a stack. Caching discounts a repeated prefix on interactive traffic; batch discounts non-interactive bulk work. You route a workload to one or the other: real-time chat/agents → on-demand + caching; offline bulk → batch. Both are ways to stop overpaying, applied to opposite traffic shapes. (Within a batch job, records are independent, so the prefix-reuse pattern caching exploits is generally not the relevant lever there — model choice is.)
Batch vs. Provisioned Throughput is a choice about predictability and latency. Provisioned reserves dedicated capacity for steady, high, real-time volume (and is required to serve most custom fine-tuned models); batch is for asynchronous bulk where you would rather pay per token at half price than reserve capacity by the hour. If your bulk work is genuinely continuous and latency matters, Provisioned may win; if it is periodic and async, batch almost always does.
The right mental model for a real product: route each path to its cheapest adequate mode. Serve interactive traffic On-Demand with prompt caching; run nightly enrichment, embeddings backfills, evals, and synthetic-data generation on Batch with a right-sized model; reserve Provisioned Throughput only for an always-hot real-time path or a custom model. See amazon-bedrock-pricing for all four modes side by side, amazon-bedrock-prompt-caching for the caching lever, and amazon-bedrock-provisioned-throughput for the reserved-capacity path.
| Lever | What it cuts | Traffic it fits | Relationship to batch |
|---|---|---|---|
| Batch | ~50% off token rate | Bulk, non-interactive | — (this is it) |
| On-Demand | Nothing (baseline) | Variable / interactive | The alternative for real-time work |
| Prompt caching | Repeated-input cost + latency | Interactive, repeated prefix | Either/or by traffic shape |
| Provisioned Throughput | Caps cost at high steady volume | Steady high real-time volume; custom models | Choice by predictability/latency |
| Model right-sizing | Per-token rate (cheaper model) | Any | Multiplies with batch — biggest combo |
Everything above is about shrinking a batch bill you pay AWS directly. For most startups and many companies the more relevant move is to not pay it at all during the build — because AWS will frequently fund generative-AI workloads with credits, and batch spend draws those credits down before it touches your card.
AWS runs several credit programs specifically to put GenAI workloads on AWS, and Bedrock usage — batch and on-demand inference, fine-tuning, embeddings, and the supporting services (S3, the data pipeline) — is fully credit-eligible. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). Credits apply automatically against your AWS bill until exhausted.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams route through an AWS partner rather than applying alone, and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the workload — including the batch pipeline itself: the S3 layout, the JSONL job submission, the orchestration and reconciliation, and the model right-sizing that makes a big enrichment or embeddings job cheap. The customer pays $0: AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
There is a clean synergy worth naming. Batch is frequently the very first heavy Bedrock workload a team runs — the initial embeddings backfill to stand up RAG, the first large enrichment or evaluation pass. Those one-time, high-volume jobs can be exactly the spike that a Bedrock POC credit pool is designed to absorb: prove the use case, backfill the corpus, run the evals, all funded. A team that combines batch (and model right-sizing) with a credit pool can do an enormous amount of bulk processing while paying nothing out of pocket. Related: see the cross-cluster pages on AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.
Batch is one of three ways to buy model capacity on Bedrock (prompt caching is a modifier on the on-demand path rather than a fourth capacity mode). Here they are side by side so the choice for any given workload is obvious. Figures are representative 2026 illustrations, not quotes.
| Dimension | On-Demand | Batch inference | Provisioned Throughput |
|---|---|---|---|
| How you pay | Per token, no commitment | Per token, async job | Flat hourly per model unit |
| Relative cost | Baseline (highest/token) | ~50% of on-demand | Flat — wins at high steady volume |
| Latency | Real-time (ms–s) | Async (minutes–hours) | Real-time, guaranteed |
| Commitment | None | None | 1–6 months for best rate |
| Best for | Anything a human waits on | Bulk, non-interactive jobs | Steady high volume; custom models |
| Throughput | Shared, per-account limits | High via parallel job | Reserved & guaranteed |
| Watch out for | Throttling at spikes | Turnaround; job/record quotas | Paid even when idle |
Situation: To ship their product they had to embed a ~40M-document corpus for a vector index and generate model-derived metadata (summaries, tags) for each document — a large one-time backfill, plus periodic re-embedding as the corpus grew. Their first instinct was to loop on-demand calls on a capable model, which both modeled into the high four figures and risked throttling, and they had no runway to spend on a one-time backfill.
What CloudRoute did: CloudRoute matched them in under 24 hours to a US AWS partner with data-pipeline and Bedrock experience. The partner (1) moved the entire backfill to batch inference — JSONL inputs chunked into appropriately-sized jobs in S3, results written back to S3 and reconciled by record id; (2) right-sized the work onto a small embedding model plus a fast text model for the enrichment instead of a frontier model; (3) built reconciliation and retry handling for failed records and chunked around the batch quotas; and (4) filed a Bedrock POC credit application plus an Activate application to fund the backfill and early usage.
Outcome: The full 40M-document backfill ran via batch at roughly half the on-demand token rate, on right-sized models, completing overnight across chunked jobs — and the entire cost was absorbed by the approved credits, so the team paid $0 to stand up their search index and launch. The same batch pipeline now runs the periodic re-embedding. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
corpus: ~40M docs · path: batch (~50% off) + right-sized models · credits secured: POC + Activate · out-of-pocket: $0
Batch inference runs your bulk jobs at ~50% of on-demand. AWS credits can cover what is left. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who builds the batch pipeline, model right-sizing, and reconciliation. Customer pays $0.