A complete, neutral reference for running AI21 Labs' Jamba models on Amazon Bedrock in 2026: the hybrid SSM-Transformer (Mamba) plus Mixture-of-Experts architecture and why it changes the long-context economics; the headline 256K-token context window that is Jamba's differentiator; the model IDs (Jamba 1.5 Mini and Large) and how to enable access; per-model pricing; where Jamba is strong (long-document processing, RAG over large corpora, structured JSON output and tool use); a clear decision on when to pick Jamba versus Claude or Llama for long context; a minimal Converse API call; and how AWS credits make running Jamba $0.
Jamba is AI21 Labs' family of foundation models, available natively on Amazon Bedrock as one of the model providers behind Bedrock's single managed API — alongside Anthropic's Claude, Amazon's own Nova and Titan, Meta Llama, Mistral, Cohere, and others. What sets Jamba apart from almost everything else in that catalog is not a benchmark headline; it is the architecture, and the very long context window that architecture makes affordable.
AI21 Labs is an established foundation-model lab, and Jamba is its production model line built around a single thesis: that a hybrid architecture can serve very long context far more cheaply than a conventional Transformer. On Bedrock the family is offered as two members — Jamba 1.5 Mini, the smaller, faster, lower-cost model, and Jamba 1.5 Large, the larger, more capable model — and both expose the same headline feature: a 256K-token context window, among the largest available on Bedrock. That is roughly the size of a long book or several hundred pages of dense documents in one request.
Practically, that means Jamba is the model you reach for when the problem is shaped like a lot of text: a stack of contracts to compare, a quarter of support transcripts to summarize, a large codebase or specification to reason over, or a RAG application that wants to stuff many retrieved chunks into a single prompt rather than aggressively trimming them. The same job is possible on other Bedrock models, but Jamba is engineered so that the long-context case stays fast and cost-controlled rather than ballooning, which is the whole point of the next section.
Both Jamba models on Bedrock are instruction-tuned for following directions, support structured output (notably native JSON, useful for downstream parsing), and support tool use / function calling for agentic and grounded workflows. They are also multilingual. All of this is reached through the same Bedrock API surface and the same IAM/VPC controls as every other model, so adopting Jamba is an integration change, not a platform change.
One caveat, stated once and meant throughout: exact model version names, model IDs, regional availability, context-window sizes, and per-token prices all change as AI21 ships new Jamba generations and AWS updates Bedrock. The figures and identifiers here are representative as of 2026 to convey structure and relative cost. Always confirm the current model IDs in the Bedrock model catalog and current rates on the AWS Bedrock pricing page before you build or budget.
Jamba = a hybrid SSM-Transformer + MoE model with a 256K-token context window. Two members on Bedrock — Jamba 1.5 Mini (fast, cheap) and Jamba 1.5 Large (more capable). Reach for it when the job is long-document processing or RAG over a large corpus and you want long context without a frontier-model price.
Almost every other model in the Bedrock catalog is a pure Transformer. Jamba is not, and the difference is the reason its long context is practical rather than a headline. Understanding the architecture at a high level tells you exactly when Jamba is the right tool.
A standard Transformer uses self-attention, where every token attends to every other token. That is what makes Transformers so capable — but it is also why long context is expensive: attention cost grows roughly with the square of the sequence length, and the memory needed to hold the running state (the KV cache) grows with length too. Double the context and the attention work rises about fourfold. On very long inputs this is what makes a pure-Transformer model slow, memory-hungry, and costly — and why many models cap context well below 256K.
State-space models (SSMs) — the family popularized by the Mamba architecture — take a different approach. Instead of all-pairs attention, an SSM processes the sequence with a recurrence whose cost grows linearly with length and whose memory footprint stays roughly constant as context grows. That makes SSMs dramatically more efficient on long sequences. The trade-off is that pure SSMs can be weaker than attention at certain tasks that need precise, content-based lookup across the whole context (for example, pulling an exact fact from far back in the input).
Jamba's design is to interleave both: most layers are efficient Mamba/SSM layers, with Transformer attention layers placed at intervals to recover the precise-recall and in-context-learning strengths that attention provides. The result is a model that keeps much of the SSM efficiency on long context while keeping much of the Transformer quality — a deliberate hybrid rather than a compromise. This is the structural reason Jamba can offer a 256K window with a flatter cost-and-latency curve than a same-size pure Transformer would.
On top of the hybrid stack, Jamba uses a Mixture-of-Experts (MoE) design. In an MoE, the model has many "expert" sub-networks but a router activates only a small subset for any given token, so the model has a large total parameter count (for capability) while only a fraction of those parameters do work on each token (for efficiency). The net effect across the whole design: a large, capable model that is comparatively economical to run — especially on the long-context workloads it is built for.
Pure-Transformer attention cost scales roughly with the square of context length; an SSM scales about linearly and keeps memory roughly flat. By interleaving SSM layers with attention layers (and adding MoE), Jamba keeps long context far cheaper and faster than a same-size Transformer — which is exactly why the 256K window is usable in production rather than just on a spec sheet.
Context window — the amount of text a model can consider in a single request — is the single number that most often decides whether Jamba is the right model. At 256K tokens, Jamba sits among the longest-context options on Bedrock, and on big-input workloads that window is the difference between one clean call and a brittle chunking pipeline.
To make 256K tokens concrete: a token is roughly three-quarters of a word in English, so 256K tokens is on the order of 180,000–200,000 words — comparable to a long novel, or several hundred pages of contracts, filings, transcripts, logs, or documentation. Everything you place in that window is available to the model at once, with no need to pre-summarize or drop material to make it fit.
Why does that matter so much? Two reasons. First, it removes engineering you would otherwise have to build. When a document is larger than the context window, you must split it, process the pieces separately, and stitch the results back together — a map-reduce pattern that loses cross-references, double-counts, and is fiddly to get right. A 256K window lets a large document, or a large set of retrieved chunks, go into the model whole, so cross-document reasoning ("does clause 14 in contract A conflict with the indemnity in contract B?") works in a single call.
Second, it changes how you build RAG. Retrieval-augmented generation works by fetching relevant chunks from a knowledge base and feeding them to the model as context. With a small window you can afford only a handful of chunks, so retrieval quality has to be near-perfect or the answer is missing the relevant passage. A 256K window lets you pass far more retrieved context — more chunks, longer chunks, more documents — which makes the whole pipeline more forgiving of imperfect retrieval and better at questions whose answer is spread across many sources. (See the rag-on-aws sibling for the full pattern.)
The honest counterpoint, stated plainly: a long window is a capability, not a free lunch. Input is billed per token, so a 256K-token prompt costs far more than a 4K-token one on the same model — filling the window every call is expensive and often unnecessary. And research across the field shows models can attend less reliably to the deep middle of a very long context than to its start and end, so retrieval and good prompt structure still matter even when everything fits. The right discipline is to use the long window when the task genuinely needs it, and to lean on prompt caching for any large fixed prefix (see amazon-bedrock-prompt-caching) so you are not re-paying for the same context on every request.
Before you can call Jamba on Bedrock, you do one small but mandatory thing: request model access in your account. Foundation models on Bedrock are off by default; turning Jamba on is a one-time, no-cost step in the console.
Enabling access. In the Bedrock console, open Model access, find the AI21 Jamba models you want, and request access. For most models this is granted effectively immediately; some prompt for brief use-case details. There is no charge for enabling access — you only pay when you actually call a model. Access is per-account and per-region, so if you operate in several regions, enable Jamba in each region you will call from. Where you need extra availability or throughput, cross-region inference profiles can route calls across a set of regions (see the amazon-bedrock-cross-region-inference sibling).
Model IDs. Every model on Bedrock is invoked by a model ID — a string identifying the provider, model, and version. AI21's models are namespaced under AI21, so Jamba IDs are of the shape ai21.jamba-… (for example, an identifier for Jamba 1.5 Mini versus Jamba 1.5 Large, each with a version suffix). You pass this ID to the API to choose which Jamba model answers a request, so moving a workload from Mini to Large is a change of model-ID string. Because IDs advance with each generation, do not hard-code a guessed value — read the current ID from the Bedrock model catalog (console) or list it via the API/CLI, and treat it as configuration rather than a literal in your code.
Permissions. The IAM principal making the call needs permission for the relevant Bedrock invoke actions (and, if you use cross-region inference profiles, permission on the profile). A least-privilege policy scoped to the specific Jamba model ARNs you intend to use is the recommended posture. Once access is granted and IAM is in place, you are ready to call Jamba — the Converse snippet later in this page shows the minimal request.
ai21.jamba-…) from the model catalog or via the API — do not hard-code a guessed version string.Jamba on Bedrock is billed per token: a rate per 1,000 input tokens (everything you send, including the long context) and a higher rate per 1,000 output tokens (everything Jamba generates). Mini is the low-cost tier; Large is the higher-quality, higher-price tier. With long context the input side dominates the bill, so the per-input-token rate matters most.
The table gives representative 2026 on-demand rates for the two Jamba models, shown per 1,000 and per 1,000,000 tokens (the per-million column is the per-1K figure × 1,000; providers increasingly quote per-million). Use it to rank the models and sanity-check a budget — not as an audited price sheet. Two cost levers sit on top of these rates and are not in the table: Batch (submit non-interactive work as an async job for roughly half the on-demand price — ideal for bulk long-document processing) and prompt caching (stop re-paying full input price for a repeated prefix such as a fixed instruction block or a reference document). Both matter a great deal precisely because Jamba's workloads tend to be large-input. See amazon-bedrock-pricing and amazon-bedrock-prompt-caching.
| Jamba model | Context | Input / 1K | Output / 1K | Input / 1M | Output / 1M | Cost position |
|---|---|---|---|---|---|---|
| Jamba 1.5 Mini | 256K | $0.0002 | $0.0004 | $0.20 | $0.40 | Low — fast, high-volume, cheap long context |
| Jamba 1.5 Large | 256K | $0.002 | $0.008 | $2.00 | $8.00 | Mid — higher quality for harder long-context work |
Jamba is not trying to be the strongest general-purpose frontier model. It is engineered to win a specific, common, and expensive class of work: tasks that are large in input. Mapped to concrete capabilities, here is where it is the right pick.
This is the home-turf use case. Summarizing, analyzing, or answering questions over a single very large document — a 200-page contract, a financial filing, a research dossier, a long deposition transcript — fits the 256K window without chunking. Because the whole document is in context, the model can resolve cross-references and reason about the document as a coherent whole rather than as disconnected fragments, and the flat-ish long-context cost curve keeps the per-document price sensible even at scale (especially on Mini, and especially via Batch for bulk jobs).
For retrieval-augmented generation, the long window lets you pass many more retrieved chunks into a single call than a short-context model allows — more documents, longer passages, more of the knowledge base per question. That makes the pipeline more robust to imperfect retrieval and better at questions whose answer is distributed across many sources. Jamba pairs naturally with Bedrock Knowledge Bases (managed RAG) as the generation model behind a large retrieval set. (See amazon-bedrock-knowledge-bases and rag-on-aws.)
Jamba models support structured output, notably the ability to return valid JSON conforming to a shape you specify. That is exactly what you want when the model's output feeds another system — extraction pipelines, form-filling, populating a database, or any step where you parse the response programmatically. Reliable JSON output removes the brittle "ask for JSON and hope" post-processing that otherwise surrounds LLM integrations, and it pairs well with the long-document case (extract structured fields from a big unstructured document in one pass).
Jamba supports tool use: you describe tools (functions, APIs, queries) and the model decides when to call them and with what arguments, then folds the results into its answer. This is the basis of agentic and grounded workflows — letting the model look things up, take actions, and ground responses in live data — and on Bedrock it is exposed through the Converse API's tool fields, so a Jamba-backed agent is built the same way as any other Bedrock agent. (See amazon-bedrock-agents.)
Jamba is one of several Bedrock models that can handle long inputs. The honest framing is that this is a workload-and-budget decision, not a "which model is best" decision — and the right answer depends on what you are optimizing for on the long-context job in front of you.
Pick Jamba when the workload is large-input and cost-sensitive. Its reason to exist is efficient long context: if you are routinely sending very large prompts — long documents, big RAG contexts, bulk processing — and you want the 256K window without paying a frontier-model rate for every token, Jamba (especially Mini) is the natural fit. The hybrid SSM-Transformer + MoE design is precisely what keeps that long-context path fast and economical. It is the value pick for "a lot of text, at scale."
Pick Claude when the long-context job also needs top-tier reasoning. Claude on Bedrock offers a large context window and the strongest reasoning, vision, extended thinking, and a deep capability profile — so for a long-context task that is also genuinely hard (intricate multi-document analysis, nuanced synthesis, high-stakes agentic steps over long inputs), Claude (Sonnet or Opus) is often worth its higher per-token price. Use Claude when quality on the hard part dominates; use Jamba when efficient throughput over large inputs dominates. Many teams run both behind one Converse API and route accordingly. (See claude-on-amazon-bedrock.)
Pick Llama when you want open-weight flexibility and a strong general model. Meta's Llama models on Bedrock are capable, widely supported open-weight models with competitive pricing and good general performance; some generations offer large context too. Choose Llama when you value the open-weight ecosystem, want portability across environments, or already standardize on it — recognizing that its long-context cost curve is conventional-Transformer-shaped rather than SSM-efficient, so on the very largest inputs Jamba's architectural edge can tell. (See amazon-bedrock-models for the full provider line-up.)
The meta-point, true across this whole cluster: because every model sits behind the same Bedrock API, this is not a one-way door. Start with whichever fits, benchmark the candidates on your own documents and prompts — long-context behavior in particular varies by task in ways leaderboards do not capture — and re-tier as prices and capabilities move, without re-plumbing your application. The comparison table below puts the three side by side.
The recommended way to call Jamba (and any chat model) on Bedrock is the <strong>Converse API</strong> — a single, model-agnostic interface for multi-turn messages, system prompts, tool use, and structured output. Because it is model-agnostic, the same code calls Jamba Mini or Large by changing only the model ID — and can call Claude or Llama the same way.
A minimal text request with the AWS SDK looks like the snippet below (Python / boto3). You create a Bedrock Runtime client, call converse with a model ID and a list of messages, and read the reply from the response. Swapping modelId between the Jamba Mini and Large IDs is the only change needed to move a request between the two tiers — and the same call shape would target Claude or Llama instead.
import boto3client = boto3.client("bedrock-runtime", region_name="us-east-1")resp = client.converse( modelId="ai21.jamba-<mini|large>-<version>", # from the model catalog messages=[{"role": "user", "content": [{"text": "Summarize the indemnity terms across these contracts: ..."}]}], system=[{"text": "You are a precise legal-analysis assistant. Answer only from the documents provided."}], inferenceConfig={"maxTokens": 1024, "temperature": 0.2},)print(resp["output"]["message"]["content"][0]["text"])
That is the whole pattern for a basic call. For Jamba's signature workloads you extend it the same model-agnostic way: place a long document or many retrieved chunks in the message content to exploit the 256K window; add tool use (a toolConfig describing your functions, with a multi-step loop to feed results back) for agents; request structured JSON via your prompt and schema for extraction pipelines; and use streaming (the converse_stream variant) for token-by-token output on long generations. The API surface barely changes as you add capabilities — that is the point of Converse. The exact model ID string must come from the Bedrock model catalog; the placeholder above is illustrative, not a literal value.
The Converse API is model-agnostic: one interface for messages, system prompts, tool use, and structured output across every Bedrock model. Switching Jamba Mini ↔ Large — or swapping Jamba for Claude or Llama to compare them on your long-context task — is a change to modelId, not a rewrite. Build once, route per request.
Everything above prices Jamba on Bedrock if you pay AWS directly. For most startups and many companies the relevant number is different — because AWS will frequently fund the build with credits, and Jamba usage on Bedrock draws those credits down before it ever touches your card. Long-context workloads are large-input by nature, so this matters even more here than on a lightweight model.
Jamba inference on Bedrock is ordinary AWS spend, so it is fully credit-eligible and credits apply automatically against your bill until exhausted — covering Jamba input and output tokens, any Batch and prompt-caching usage, plus the supporting services a long-context app leans on (Knowledge Bases, the vector store behind RAG, S3 for the documents, logging). That is significant precisely because the headline 256K-context use cases consume a lot of input tokens: the credit pool absorbs exactly the spend that would otherwise grow fastest. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case; and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups).
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the Jamba workload — the long-document pipeline, the large-context RAG over Knowledge Bases, the structured-extraction step, the tool-using agent, prompt caching on the fixed prefix. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
Put together with Jamba's own cost advantages — efficient long context, Mini for the cheap bulk path, Batch and caching on top — the picture for a startup is: build the long-document or big-RAG product on the model tier each request actually needs, cache the repeated context, and run the whole thing on a $25K–$100K (or larger) credit pool while you find product-market fit — paying real money only once usage, and ideally revenue, has scaled past the credits. Related: AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.
The core decision in one place: three Bedrock options for long-input work, compared on context, architecture, the cost shape that matters at scale, and the job each suits. Match the workload to the model that optimizes what you actually care about. Representative 2026 figures for relative comparison, not quotes.
| Model | Context window | Architecture | Long-context cost shape | Best for | Reach for it when |
|---|---|---|---|---|---|
| AI21 Jamba (Mini / Large) | 256K (very long) | Hybrid SSM (Mamba) + Transformer + MoE | Flattest — SSM efficiency keeps big inputs cheap | Long-document processing, big-corpus RAG, structured extraction | A lot of text, at scale, cost-sensitive |
| Anthropic Claude (Sonnet / Opus) | Large | Transformer (frontier) | Higher per token, but top reasoning | Long-context work that is also genuinely hard | Quality on the hard part dominates |
| Meta Llama | Large (varies by gen) | Transformer (open-weight) | Conventional Transformer scaling | Open-weight flexibility, strong general use | You value open weights / portability |
Situation: The product had to read and cross-reference large contract bundles — often 150–300 pages — and answer questions whose answers spanned multiple documents. On their existing short-context frontier model they were forced into an aggressive chunk-and-stitch pipeline that lost cross-references and was expensive per query, and the inference bill was climbing out of runway as usage grew. They were already an AWS customer and wanted long context, lower cost, and to stop paying for it out of pocket.
What CloudRoute did: CloudRoute matched them in under 24 hours to a EU-West AWS partner with GenAI and RAG experience. The partner (1) moved the long-document and RAG generation onto AI21 Jamba via Bedrock's Converse API to use the 256K window — letting whole contract bundles and far more retrieved context go in per call; (2) routed the easy, high-volume queries to Jamba 1.5 Mini and reserved Jamba 1.5 Large for the hard multi-document analyses; (3) wired structured JSON output into the extraction step and prompt caching onto the fixed instruction prefix; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the workload.
Outcome: The chunk-and-stitch pipeline was retired in favor of whole-document calls, cross-reference accuracy improved, and the Mini/Large split plus caching cut the modeled per-query cost substantially — but the decisive change was that the spend now draws down AWS credits instead of runway, so the team pays $0 during the build and early scale. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
window used: 256K · pattern: whole-doc + big-RAG, Mini/Large split, JSON + caching · credits secured: POC + Activate · out-of-pocket: $0
Jamba's 256K window and SSM-efficient architecture make big-document and large-RAG workloads affordable — and on Bedrock the spend draws down AWS credits instead of your card, under your existing IAM, VPC, and billing. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who builds the Jamba long-document or RAG pipeline, splits traffic across Mini and Large, and turns on caching. Customer pays $0.