Everything you need to call foundation models on AWS from code: the two runtime APIs (InvokeModel vs Converse, plus ConverseStream), the exact request and response shapes, tool use (function calling), system prompts, inference parameters, the boto3 and JavaScript SDKs, token streaming, and how to handle throttling and errors. Reference-grade, with copy-pasteable snippets — and the cost reality of running it in production.
Before any code, get the map straight. "The Bedrock API" is really two API families: a control plane for managing the service and a runtime for actually calling models. For sending a prompt and getting a completion, you live entirely in the runtime, where there are two operations to choose between — InvokeModel and Converse — each with a streaming variant.
The control plane (the bedrock service in the SDKs) is where you manage the service itself: list foundation models, create and manage Guardrails, fine-tuning jobs, model evaluation, provisioned-throughput model units, Knowledge Bases, and Agents. You touch it at setup and configuration time, not on every request. The runtime plane (the bedrock-runtime service) is the hot path — it is where inference happens, and it exposes exactly four operations that matter for day-to-day model calls.
Those four runtime operations are InvokeModel and InvokeModelWithResponseStream (the original, lower-level pair) and Converse and ConverseStream (the modern, unified pair). The non-streaming operations return the full completion in one response; the streaming operations return tokens incrementally as the model generates them. Choosing between the InvokeModel family and the Converse family is the single most consequential API decision you make, so it is worth being precise about the difference.
InvokeModel is provider-shaped. You send a request whose body is a JSON blob in the exact format that specific model provider defined — Anthropic's body looks different from Meta's, which looks different from Amazon Nova's — and you parse a provider-specific response body. It gives you maximum fidelity to each provider's native parameters, but it means your code branches per provider and a model swap can be a rewrite.
Converse is model-shaped from your side and provider-shaped under the hood. You send one canonical request schema — messages, an optional system prompt, inference config, and an optional tool config — and Bedrock translates it to whatever the underlying model expects, returning one canonical response schema regardless of provider. Multi-turn conversation, system prompts, tool use (function calling), and streaming are all first-class. Switching from, say, Claude to Llama to Nova is usually a one-line change to the modelId string. For chat and agentic applications, Converse is the recommended default; the rest of this reference centers on it and notes where InvokeModel is still the right tool.
InvokeModel — single response, provider-specific body (max control, provider-specific code).
InvokeModelWithResponseStream — streamed response, provider-specific body.
Converse — single response, one unified schema across all chat models (recommended default).
ConverseStream — streamed response, same unified schema.
Control-plane operations (ListFoundationModels, Guardrails, fine-tuning, etc.) live in the separate bedrock service, not bedrock-runtime.
The honest rule is short: use Converse for anything conversational, and use InvokeModel for the cases Converse does not cover. The table makes the trade-offs explicit so the choice is a five-second decision rather than a research project.
Reach for Converse / ConverseStream when you are building a chatbot, an assistant, an agent, or any multi-turn experience; when you want portability across models so you can A/B or fall back between providers; when you need tool use (function calling) with a consistent shape; or when you simply want the least provider-specific code. This is the overwhelming majority of new applications.
Reach for InvokeModel / InvokeModelWithResponseStream when you are calling a non-conversational modality that Converse does not model — image generation (Stability, Nova Canvas), video (Nova Reel), or text embeddings (Titan, Cohere Embed) — or when you need a niche provider-specific parameter that Converse intentionally does not expose. Embeddings and image endpoints are the most common legitimate reasons teams still use InvokeModel in 2026.
| If you need… | Use | Request body | Streaming variant | Notes |
|---|---|---|---|---|
| Chat / multi-turn / agents | Converse | Unified schema | ConverseStream | Recommended default for text/chat |
| Model portability (swap providers) | Converse | Unified schema | ConverseStream | Switch model = change modelId only |
| Tool use (function calling) | Converse | toolConfig in schema | ConverseStream | Consistent across providers |
| Text embeddings | InvokeModel | Provider-specific | n/a | Titan / Cohere Embed; no Converse equivalent |
| Image generation | InvokeModel | Provider-specific | n/a | Stability / Nova Canvas |
| Video generation | InvokeModel (async) | Provider-specific | n/a | Nova Reel; uses async invocation |
| A provider-only parameter | InvokeModel | Provider-specific | WithResponseStream | Only when Converse omits it |
Three prerequisites, then one call. The prerequisites are the same regardless of language: enable model access in the Bedrock console for the model you want, attach an IAM policy that allows the runtime action, and pick a Region the model is available in. Then the call itself is a handful of lines.
Prerequisite 1 — model access. In the Bedrock console open Model access and enable the model you intend to call (for example a workhorse chat model); access is per-Region and per-model, and some models require accepting the provider's license first. Prerequisite 2 — IAM. Grant the calling principal the runtime actions it needs — at minimum bedrock:InvokeModel, plus bedrock:InvokeModelWithResponseStream for streaming and bedrock:Converse / bedrock:ConverseStream for the Converse family — ideally scoped to specific model ARNs. Prerequisite 3 — Region. Configure the SDK with a Region where the model is live; frontier models often land in US Regions first. With those in place, the call below returns a completion.
In Python you use boto3 and the bedrock-runtime client. Note there is no API key — Bedrock uses standard AWS credentials (IAM role, environment, or profile), the same as every other AWS SDK call:
The converse method takes a modelId, a messages list, and an inferenceConfig; the assistant's text is at response["output"]["message"]["content"][0]["text"], and token counts come back in response["usage"]. Model IDs below are illustrative — copy the exact current ID from the Bedrock console.
In Node.js you use the modular v3 client @aws-sdk/client-bedrock-runtime: instantiate a BedrockRuntimeClient, build a ConverseCommand with the same logical fields (camelCase here — modelId, messages, inferenceConfig), and send it. The response shape mirrors the Python one: response.output.message.content[0].text and response.usage.
# Python — boto3
import boto3
brt = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = brt.converse(
modelId="anthropic.claude-sonnet", # illustrative — copy exact ID from console
messages=[{"role": "user", "content": [{"text": "List three uses for embeddings."}]}],
inferenceConfig={"maxTokens": 512, "temperature": 0.2},
)
print(resp["output"]["message"]["content"][0]["text"])
print(resp["usage"]) # {inputTokens, outputTokens, totalTokens}
// JavaScript — @aws-sdk/client-bedrock-runtime (v3)
import { BedrockRuntimeClient, ConverseCommand } from "@aws-sdk/client-bedrock-runtime";
const client = new BedrockRuntimeClient({ region: "us-east-1" });
const resp = await client.send(new ConverseCommand({
modelId: "anthropic.claude-sonnet", // illustrative
messages: [{ role: "user", content: [{ text: "List three uses for embeddings." }] }],
inferenceConfig: { maxTokens: 512, temperature: 0.2 },
}));
console.log(resp.output.message.content[0].text);
Converse is worth learning once because the shape is stable across every model. A request is four top-level fields; a response is three. Internalize these and you can read or write any Bedrock chat call without per-provider documentation.
A Converse request has these top-level fields. modelId (required) is the model or inference-profile identifier. messages (required) is the conversation: an ordered array of turns, each { role: "user" | "assistant", content: [...] }, where content is a list of content blocks — most commonly { text: "..." }, but also image blocks, document blocks, and (for tool use) toolUse / toolResult blocks. system (optional) is a separate array of system-prompt blocks — Converse keeps the system prompt out of the message list so it is unambiguous. inferenceConfig (optional) holds the sampling parameters. toolConfig (optional) declares the tools the model may call. Two further optional fields are guardrailConfig (attach a Bedrock Guardrail by ID) and additionalModelRequestFields (an escape hatch to pass a provider-specific parameter Converse does not model directly).
A Converse response has three fields you use constantly. output.message is the assistant turn — same { role, content: [...] } structure as an input message, so you can append it straight back onto messages to continue the conversation. stopReason tells you why generation stopped: "end_turn" (the model finished normally), "max_tokens" (it hit your maxTokens limit), "stop_sequence" (it emitted one of your stop sequences), "tool_use" (it wants to call a tool — see section VI), or "guardrail_intervened" (a Guardrail blocked the content). usage reports inputTokens, outputTokens, and totalTokens — the numbers your bill and your rate limits are based on, so log them.
Maintaining a conversation is therefore mechanical: send messages, take output.message from the response, append it to messages, append the next user turn, and call again. There is no server-side session — Bedrock is stateless, and you own the transcript. That statelessness is deliberate: it keeps the API simple, keeps your data in your control, and lets you trim, summarize, or cache history however you like.
| Field | Where | Required | Purpose |
|---|---|---|---|
| modelId | request | Yes | Model or cross-region inference-profile ID to invoke |
| messages | request | Yes | Ordered turns; each { role, content[] } of typed blocks |
| system | request | No | System-prompt blocks, kept separate from messages |
| inferenceConfig | request | No | maxTokens, temperature, topP, stopSequences |
| toolConfig | request | No | Tool (function) declarations + tool-choice policy |
| guardrailConfig | request | No | Attach a Bedrock Guardrail by id/version |
| output.message | response | — | Assistant turn; append back onto messages to continue |
| stopReason | response | — | end_turn / max_tokens / stop_sequence / tool_use / guardrail_intervened |
| usage | response | — | inputTokens, outputTokens, totalTokens (billing + limits) |
Two levers control behavior on every call: the system prompt (what the model is and how it should behave) and the inference parameters (how it samples tokens). Converse exposes both cleanly and identically across models.
The system prompt sets persona, rules, output format, and guardrails-in-prose. In Converse it is the top-level system field — an array of blocks such as [{ "text": "You are a precise support assistant. Answer only from provided context. If unsure, say so." }] — kept deliberately separate from the user/assistant messages so there is never ambiguity about what is instruction versus conversation. Keeping the system prompt stable across turns also makes it a prime candidate for prompt caching, which bills repeated context at a steep discount instead of re-charging full input price every call.
The inference parameters live in inferenceConfig and are consistent across providers: maxTokens caps the length of the response (and thus the most expensive half of the bill); temperature controls randomness (low — for example 0–0.3 — for deterministic extraction and tool use; higher for creative drafting); topP is nucleus sampling, an alternative randomness control you usually tune instead of, not together with, temperature; and stopSequences is a list of strings that, when generated, halt the model (useful for forcing structured output or delimiting sections). A small number of model-specific knobs (such as topK on some providers) are not part of the unified config — pass those through additionalModelRequestFields when a given model supports them.
A practical default for production reasoning and tool use is a low temperature (0.0–0.2) with a sensible maxTokens ceiling, which keeps outputs deterministic, cheaper, and easier to test. Raise temperature only where variety is the point. Always set maxTokens explicitly — leaving it unbounded is the most common cause of surprise output-token costs and of responses truncating in ways you did not plan for.
resp = brt.converse(
modelId="anthropic.claude-sonnet",
system=[{"text": "You are a precise support assistant. Answer only from the provided context; if unsure, say you don't know."}],
messages=[{"role": "user", "content": [{"text": "What is our refund window?"}]}],
inferenceConfig={"maxTokens": 400, "temperature": 0.1, "topP": 0.9, "stopSequences": ["\n\nUSER:"]},
)
Tip: hold the system prompt constant across turns and enable prompt caching so you are not billed full input price to re-process it on every request.
Tool use — also called function calling — is how a model goes from talking to doing: it asks your code to run a function, you run it, you hand back the result, and the model continues with that information in hand. Converse standardizes this loop across every model that supports it.
You declare tools in toolConfig.tools. Each tool is a toolSpec with a name, a natural-language description (the model uses this to decide when to call it — write it well), and an inputSchema expressed as a JSON Schema describing the parameters. You can also set toolConfig.toolChoice to influence whether the model may call a tool (auto), must call some tool (any), or must call one specific tool. The model never executes anything itself — it only emits a request to call a tool.
The loop has four steps. (1) You send the user message plus toolConfig. (2) If the model decides to use a tool, it returns stopReason: "tool_use" and an assistant message whose content includes a toolUse block — { toolUseId, name, input } — telling you which tool and with what arguments. (3) Your code runs the actual function (query a database, call an internal API, hit a pricing service) and sends a new user message containing a toolResult block that echoes the same toolUseId and carries the function's output. (4) The model incorporates the result and produces its final answer (or requests another tool). You append each turn to messages as you go, exactly as in any other multi-turn conversation — the only new content-block types are toolUse (from the model) and toolResult (from you).
This is the same primitive that Amazon Bedrock Agents build on top of: an Agent automates this plan-call-observe loop, including the orchestration and prompt construction, when you would rather configure tools than write the loop yourself. Use raw Converse tool use when you want full control of the loop in your own code; use Agents when you want the loop managed for you.
tool_config = {
"tools": [{
"toolSpec": {
"name": "get_order_status",
"description": "Look up the current status of a customer order by its ID.",
"inputSchema": {"json": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}}
}
}]
}
resp = brt.converse(modelId=MODEL, messages=messages, toolConfig=tool_config)
if resp["stopReason"] == "tool_use":
# find the toolUse block, run get_order_status(order_id),
# then append a user message with a toolResult block (same toolUseId) and call converse again.
For anything a human watches — a chat UI, a coding assistant, a long answer — you want tokens to appear as they are generated rather than after the whole completion is done. That is what the streaming operations are for, and Converse makes the event shape uniform.
Call ConverseStream (or InvokeModelWithResponseStream for the lower-level API) and instead of a single response you get an event stream you iterate over. The stream is a sequence of typed events: messageStart (the assistant turn begins, with its role), a series of contentBlockDelta events (each carrying a small chunk of text — the part you append to the UI as it arrives), contentBlockStop and messageStop (the turn is complete, with the stopReason), and a final metadata event that carries the usage token counts and latency metrics. Tool use streams too: the model's toolUse arguments arrive incrementally across deltas, which you accumulate before running the tool.
Streaming is purely a transport choice — it does not change pricing (you pay for the same input and output tokens) or the logical request shape (the request is the same as a non-streaming Converse call). It only changes when you receive the output. The trade-off is in your code: streaming improves perceived latency dramatically but means you handle a stream of events and assemble the final message yourself, rather than reading one field off one response.
stream = brt.converse_stream(
modelId="anthropic.claude-sonnet",
messages=[{"role": "user", "content": [{"text": "Explain prompt caching in two sentences."}]}],
inferenceConfig={"maxTokens": 300},
)
for event in stream["stream"]:
if "contentBlockDelta" in event:
print(event["contentBlockDelta"]["delta"]["text"], end="") # stream chunks to the UI
elif "metadata" in event:
print(event["metadata"]["usage"]) # final token counts arrive last
The difference between a demo and a production integration is almost entirely error handling — and on Bedrock the error you will meet most is throttling. Know the exceptions, retry the right ones correctly, and design for the rate limits up front.
Bedrock rate-limits the runtime per account, per model, and per Region along two axes: requests per minute and tokens per minute. Exceed either and the API returns ThrottlingException with HTTP status 429. This is the normal signal to slow down, not a bug. The correct response is exponential backoff with jitter — wait, then retry with progressively longer, randomized delays so a fleet of clients does not retry in lockstep. The AWS SDKs implement adaptive/standard retry modes that handle a baseline of this automatically; for high-throughput services you typically add your own backoff and a concurrency limiter on top, and you queue or shed load rather than hammering the API.
Beyond retrying, there are three structural ways to get more headroom. Request a quota increase for the specific model in Service Quotas when your steady-state demand genuinely exceeds the defaults. Use cross-region inference profiles so a request can be served from one of several Regions in a geography, spreading load and improving availability without you managing the routing (see cross-region inference). Or buy Provisioned Throughput to reserve dedicated capacity with guaranteed throughput for high, steady volume. Which one fits depends on whether your problem is occasional spikes (backoff + cross-region), a higher steady ceiling (quota increase), or guaranteed latency at scale (provisioned throughput).
The other exceptions are ordinary AWS API errors and should be handled distinctly because most are not retryable. AccessDeniedException means the IAM principal lacks the action or the model is not enabled in Model access — fix the policy or enable access, do not retry. ValidationException means a malformed request (bad parameter, oversized payload, wrong content block) — fix the request, do not retry. ResourceNotFoundException usually means a wrong or unavailable modelId for that Region. ModelTimeoutException and ServiceUnavailableException / InternalServerException (5xx) are transient and are safe to retry with backoff. A clean integration branches on the exception type: retry throttling and 5xx with backoff; surface validation and access errors to the developer immediately.
| Exception | HTTP | Meaning | Retry? | Fix |
|---|---|---|---|---|
| ThrottlingException | 429 | Request/token rate exceeded | Yes — backoff + jitter | Backoff; quota increase; cross-region; provisioned throughput |
| ModelTimeoutException | 408/5xx | Model took too long | Yes — backoff | Retry; shorten prompt/maxTokens |
| ServiceUnavailable / Internal | 5xx | Transient server-side issue | Yes — backoff | Retry with backoff |
| ValidationException | 400 | Malformed request / bad params | No | Fix the request body |
| AccessDeniedException | 403 | Missing IAM perms or model not enabled | No | Add IAM action; enable Model access |
| ResourceNotFoundException | 404 | Unknown/unavailable modelId in Region | No | Correct modelId or switch Region |
A single scannable map of the runtime operations so the right call is obvious. "Unified schema" means the same request/response shape across all chat models; "provider-specific" means the body matches each model vendor's native format.
| Operation | Schema | Streaming? | Tool use | Best for | Avoid for |
|---|---|---|---|---|---|
| Converse | Unified | No | Yes (built-in) | Chat, agents, portable text apps | Embeddings, images, video |
| ConverseStream | Unified | Yes | Yes (streamed) | Chat UIs, coding assistants, long answers | Batch/offline jobs |
| InvokeModel | Provider-specific | No | Provider-defined | Embeddings, images, provider-only params | New portable chat apps |
| InvokeModelWithResponseStream | Provider-specific | Yes | Provider-defined | Streaming a provider-specific text body | When Converse would do |
Situation: The team had a working Converse prototype but kept hitting ThrottlingException (429) under real traffic, had no consistent retry/backoff strategy, and were calling a frontier model for every request — including trivial classifications — so the projected inference bill at launch scale was alarming. They also wanted tool use (function calling) wired to their internal APIs without hand-rolling and maintaining the whole plan-call-observe loop, and needed streaming in the chat UI.
What CloudRoute did: Routed within 19 hours to a US-East AWS partner with a Bedrock + production-GenAI track record. The partner hardened the integration: exponential backoff with jitter plus a concurrency limiter around every runtime call, a Service Quotas increase on the workhorse model, and cross-region inference profiles for spike headroom. They added model routing (a small fast model for classification, the frontier model only for hard reasoning), turned on prompt caching for the large stable system prompt, moved tool use onto a clean Converse toolConfig loop, and switched the chat UI to ConverseStream. In parallel they filed a Bedrock/GenAI proof-of-concept credit application and an Activate Portfolio application.
Outcome: GenAI POC credits ($25K) approved in under 2 weeks and Portfolio ($100K) shortly after — the first several months of Bedrock inference were credit-funded. 429s dropped to negligible under load, and per-request cost fell sharply thanks to routing plus prompt caching. The assistant shipped to general availability in 4 weeks with streaming and tool use in production. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.
time-to-match: < 24h · credits secured: $125K · 429 rate: negligible post-fix · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who files your Bedrock/GenAI credit application (Activate Portfolio up to $100K, GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and, if you need hands, builds the production integration — Converse, tool use, streaming, retries, cost controls — with you. AWS funds the credits and the engagement. You pay $0.