The two features that separate a demo from a shipped product: ConverseStream for token-by-token output that feels instant, and tool use (function calling) that lets the model query your APIs and act on the result. This is the full reference — how the streaming event format works and how to parse it, how to define tools and run the toolUse/toolResult loop, multi-tool selection and forced tool choice, how to stream and call tools in the same turn, and how to handle throttling, partial responses, and timeouts without breaking the UX.
Streaming and tool use are the two capabilities that turn a Bedrock chat call into something a user will actually tolerate and trust: streaming makes it feel fast, and tool use makes it able to do things instead of just talk. On Bedrock, both are delivered through the same modern interface — the Converse API — which is why they belong on one page.
The Converse API is Bedrock's unified way to call any chat model. Instead of the older InvokeModel path, where the request and response JSON are shaped differently for every provider, Converse gives you one request and response schema across every model — Anthropic Claude, Meta Llama, Mistral, Amazon Nova, Cohere, and the rest. The relevance here is direct: streaming and tool use are features of the Converse schema, not of any single provider. Write the streaming-parse loop once and it works whether the modelId points at Claude Sonnet or Nova Pro; define a tool once in the Converse toolConfig format and any tool-capable model on the platform can call it. Switching models is usually a one-line change.
There are two entry points. Converse is the unary call: you send messages, you get one complete response back. ConverseStream is the streaming call: you send the same request but receive the response as a sequence of small events over a persistent connection, so you can render the answer as it is generated rather than waiting for the whole thing. Tool use works identically in both — the difference is only whether the model's output (including any tool request) arrives all at once or incrementally.
Why this matters for what you build: a non-streaming, no-tools assistant can only answer from the model's parametric knowledge and makes the user stare at a spinner for the full generation time. Add streaming and the perceived latency collapses — the user reads the first sentence while the rest is still being written. Add tool use and the assistant can look up a live order status, run a database query, call your pricing service, or trigger an action, then ground its answer in the real result. Combine them and you get the now-standard experience: a fast, streaming assistant that can also reach into your systems mid-conversation.
The rest of this guide treats each in turn — streaming first (the event format and how to parse it), then tool use (defining tools and running the request/result loop), then the two combined, and finally the error-handling and latency details that decide whether the thing survives real traffic. For the broader platform context, see the full Amazon Bedrock guide and the Bedrock API reference.
On Bedrock, streaming (ConverseStream) and tool use / function calling are both features of the unified Converse API — so one streaming-parse loop and one tool definition work across every chat model on the platform, and switching models is a one-line modelId change rather than a rewrite.
ConverseStream returns the response not as one JSON object but as an ordered stream of typed events. To render token-by-token you only have to recognise a handful of event types and pull the text out of the right one. Once you have seen the shape, the parse loop is short and identical across models.
You call converse_stream (boto3) / ConverseStream with exactly the same arguments you would pass to converse — modelId, messages, inferenceConfig, and optionally system and toolConfig. What comes back is a stream you iterate. Each item is a small dictionary keyed by event type. The sequence for a normal text answer is predictable:
The whole pattern is: iterate the stream, and on each contentBlockDelta that carries delta.text, emit that text. Everything else is bookkeeping.
import boto3
brt = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = brt.converse_stream(
modelId="anthropic.claude-sonnet", # swap to switch models
messages=[{"role": "user", "content": [{"text": "Explain prompt caching in 4 sentences."}]}],
inferenceConfig={"maxTokens": 512, "temperature": 0.2},
)
for event in resp["stream"]:
if "contentBlockDelta" in event:
text = event["contentBlockDelta"]["delta"].get("text", "")
print(text, end="", flush=True) # render as it arrives
elif "messageStop" in event:
stop = event["messageStop"]["stopReason"]
elif "metadata" in event:
usage = event["metadata"]["usage"] # log inputTokens/outputTokens
Streaming uses a distinct permission and a distinct under-the-hood API. Grant bedrock:InvokeModelWithResponseStream (the streaming sibling of bedrock:InvokeModel) in addition to bedrock:Converse; without it the stream call is denied. Over the wire the stream is delivered as an HTTP/2 event stream — the SDK handles the framing, so on the server you simply iterate, as above.
To get tokens to a web client, re-emit each delta over a streaming transport your frontend understands. The common choices are Server-Sent Events (SSE) or a WebSocket from your backend: your server iterates the Bedrock stream and forwards each delta.text to the browser as it arrives, so the round trip stays incremental end to end. Avoid buffering the whole answer on the server and sending it at once — that throws away the entire latency benefit of streaming.
Streaming does not make generation faster; total tokens and total cost are unchanged. What it changes is perceived latency: the user sees the first words after time-to-first-token (typically a fraction of a second) instead of after the full generation completes (which for a long answer can be many seconds). For any interactive chat or assistant UI, ConverseStream should be the default.
Tool use — also called function calling — is how a Bedrock model goes from describing an action to requesting one. You declare the tools the model is allowed to call; when it decides a tool would help, it returns a structured request instead of a final answer; your code executes the tool and hands the result back; the model then answers using that result. It is a loop, and getting the loop right is the whole skill.
The mental model is important: the model never runs your code. It only emits a structured request to call a named tool with specific arguments. Your application is what actually calls the database, hits the API, or runs the function — then returns the output to the model. This keeps execution, credentials, and side effects entirely on your side; the model just decides what to call and reads what came back.
You pass a toolConfig alongside your messages. Each tool has a name, a natural-language description (this is how the model decides when to use it — write it carefully), and an inputSchema expressed as JSON Schema describing the arguments. The JSON Schema both tells the model what arguments to produce and lets you validate what it returns.
toolConfig = {
"tools": [{
"toolSpec": {
"name": "get_order_status",
"description": "Look up the current status of a customer order by its ID.",
"inputSchema": {"json": {
"type": "object",
"properties": {"order_id": {"type": "string", "description": "The order ID, e.g. A-10293."}},
"required": ["order_id"]
}}
}
}]
}
When the model decides to call a tool, the Converse response comes back with stopReason = "tool_use", and the assistant message contains a toolUse content block: a toolUseId (a correlation handle you must echo back), the tool name, and an input object matching your schema. Critically, you must append the assistant's message (with the toolUse block) to your messages list before you reply — the conversation has to record that the model asked, or the follow-up turn will be malformed.
Now your code runs the actual function — calls the orders service, queries the database, whatever the tool represents — and sends the output back as a new user-role message containing a toolResult block. The toolResultBlock must carry the same toolUseId so the model knows which request it answers, plus the content (the data) and a status of success or error. Then you call Converse again with the updated messages.
# after parsing the toolUse block (name, tool_use_id, tool_input):
result = get_order_status(**tool_input) # YOUR code runs the tool
messages.append({"role": "user", "content": [{
"toolResult": {
"toolUseId": tool_use_id,
"content": [{"json": result}],
"status": "success"
}
}]})
final = brt.converse(modelId=MODEL, messages=messages, toolConfig=toolConfig)
On the follow-up call the model reads the toolResult and usually returns a normal text answer grounded in your data (stopReason = "end_turn"). But it may decide it needs another tool first — in which case you get another tool_use stop and repeat the loop. Production code therefore wraps Steps 2–4 in a bounded while loop that keeps going while stopReason == "tool_use", with a hard cap on iterations so a model that loops cannot run forever.
(1) Always append the assistant's toolUse message to history before replying. (2) Echo the exact toolUseId back in your toolResult. (3) Loop while stopReason == "tool_use", bounded by a max-iterations cap. Break any of these and the conversation desynchronises or the call errors.
Real assistants expose more than one tool, sometimes need several at once, and occasionally must be forced to use a specific tool. The Converse toolConfig handles all three — the model picks from the set, can request multiple tools in a single turn, and obeys a toolChoice directive when you set one.
Multiple tools. Put several toolSpec entries in the tools array and the model chooses which (if any) fits the user's request based on the descriptions you wrote. This is the most common setup — a support assistant might expose get_order_status, issue_refund, and search_knowledge_base, and the model routes to the right one per message. The quality of routing is largely a function of how clearly each tool's description is written; vague descriptions cause wrong-tool selection.
Parallel / multiple calls in one turn. A model can return more than one toolUse block in a single response when a request needs several independent lookups (for example, the status of three different orders). Your handler should therefore iterate all toolUse blocks in the message, execute them (you can run them concurrently since they are independent), and return one toolResult block per request — each with its matching toolUseId — in the next message. Returning a result for only one of several requests will desynchronise the turn.
Forcing a tool choice. By default toolChoice is auto — the model decides whether to use a tool at all. You can override it: {"any": {}} forces the model to call some tool (it may not answer in plain text), and {"tool": {"name": "get_order_status"}} forces a specific tool. Forced choice is how you guarantee structured output — e.g. always extract fields into a schema — rather than hoping the model decides to. Note that not every model supports every toolChoice mode, and forcing a tool disables plain-text answers for that call; check the model's capabilities in the AWS docs.
| toolChoice | Behaviour | Plain-text answer allowed? | Use it when |
|---|---|---|---|
| auto (default) | Model decides: use a tool, or just answer | Yes | General assistants — let the model route naturally |
| any ({"any": {}}) | Model must call some tool from the set | No (must pick a tool) | You always want an action/structured call, not prose |
| tool ({"tool": {"name": ...}}) | Model must call the named tool | No (that tool only) | Guaranteed structured extraction into one schema |
The experience users now expect — a fast assistant that also reaches into live systems — requires streaming and tool use working together in a single turn. ConverseStream supports this directly: the stream surfaces a tool request mid-flight, you pause to execute it, then resume streaming the model's grounded answer.
The flow is the streaming event loop from Section II with a tool branch added. While iterating the stream, watch the events: a contentBlockStart may announce a toolUse block (carrying the tool name and toolUseId), and the subsequent contentBlockDelta events carry fragments of the tool's input JSON in delta.toolUse.input — you concatenate those fragments and parse the assembled string into the arguments object once the block stops. When the stream ends with stopReason == "tool_use", you have a complete tool request.
At that point you do exactly what the non-streaming loop does: append the assistant message, execute the tool, append a toolResult message echoing the toolUseId, and call converse_stream again. The second stream is the model's final answer, now grounded in your tool's output, and you render its deltas token-by-token. So a single user turn can produce: a short streamed preamble, a tool call, a pause while you fetch data, then a streamed final answer. To the user it reads as one smooth response with a brief "looking that up…" beat in the middle — which is exactly the moment to show a tool-call status indicator (see the latency section).
The key parsing nuance unique to streaming-plus-tools is that tool arguments arrive in pieces. With unary Converse you get the whole input object at once; with ConverseStream you must accumulate delta.toolUse.input string fragments and JSON-parse them only after contentBlockStop. Attempting to parse a partial fragment will fail — buffer first, parse once.
1) Stream the first turn. 2) On a toolUse block, accumulate the input-JSON fragments and parse after contentBlockStop. 3) On stopReason == tool_use, append the assistant message, run the tool, append a toolResult with the matching toolUseId. 4) Call converse_stream again and stream the grounded final answer. 5) Cap the loop. This is the standard production shape for an interactive Bedrock assistant.
Demos ignore failure; production cannot. Two surfaces fail independently — the Bedrock call itself (throttling, timeouts, a dropped stream) and your tool execution (a downstream API is slow or errors). Handling both cleanly is most of what separates a robust assistant from a flaky one.
Throttling and retries. The most common Bedrock error under load is ThrottlingException — you have exceeded the account/model requests-per-minute or tokens-per-minute quota for that Region. Treat it as expected: retry with exponential backoff and jitter, and cap the attempts. The AWS SDKs include configurable adaptive retries, but you should still design for throttling rather than assume it away — request a quota increase for steady load, and consider Provisioned Throughput or cross-region inference if you regularly hit limits. Other retryable conditions include ModelTimeoutException and transient ServiceUnavailable; non-retryable ones (ValidationException, AccessDeniedException) mean fix the request, not retry it.
Dropped or partial streams. A long-lived stream can break mid-flight (network blip, timeout). Because you have been rendering deltas, you may have a partial answer on screen. Decide the policy up front: either retry the whole turn (simplest, but the user sees a restart) or mark the message as incomplete and offer a "continue" affordance. Always wrap the stream iteration in try/except so a mid-stream error is caught rather than crashing the request handler, and log the last received event for diagnosis.
Tool execution failures. Your tool is calling something that can fail or hang. Put a timeout on every tool call so a slow downstream cannot freeze the whole turn, and when a tool fails, do not crash — return a toolResult with status: "error" and a short message describing the failure. The model will read that and can apologise, retry differently, or fall back gracefully, which is far better UX than a 500. Make tools idempotent where possible (especially write actions like refunds) so a retried call cannot double-execute, and validate the model's tool arguments against your JSON Schema before you act on them — never trust the arguments blindly for anything with side effects.
Guardrails and limits. Layer Bedrock Guardrails over the call to filter unsafe content and block prompt-injection-style attempts to misuse tools, and set a sane maxTokens so a runaway generation is bounded. Together these make the assistant safe to point at production systems.
| Failure | Where it happens | Handle it by |
|---|---|---|
| ThrottlingException | Bedrock call under load | Exponential backoff + jitter, capped retries; quota increase / provisioned throughput |
| Dropped / partial stream | Mid-stream network or timeout | try/except around iteration; retry turn or offer "continue"; keep partial text |
| ModelTimeoutException | Long generation | Retry with backoff; lower maxTokens; consider a faster model |
| Tool downstream slow/down | Your tool execution | Per-tool timeout; return toolResult status:"error"; let the model recover |
| Bad tool arguments | Model-generated input | Validate against JSON Schema before acting; idempotent write tools |
| ValidationException / AccessDenied | Malformed request / IAM | Do NOT retry — fix the request shape or the IAM policy (Converse + stream perms) |
Streaming and tool use are as much UX features as engineering ones. The same backend can feel instant or sluggish depending on how you surface progress. A few patterns reliably make an interactive Bedrock assistant feel responsive even when a tool call adds a real pause.
Render the first token immediately. The entire point of streaming is time-to-first-token: get the first words on screen the instant they arrive and the assistant feels alive, regardless of total length. Do not gate rendering on the full response, and do not buffer on the server. If you can, start a typing indicator the moment the request is sent and replace it with real text on the first contentBlockDelta.
Make the tool-call pause legible. When the stream stops on a tool_use and you go off to execute, the user is momentarily waiting with no new tokens. Fill that gap with an explicit status — "Checking your order status…", "Searching the knowledge base…" — derived from the tool name. A visible, named action reads as competence; a frozen cursor reads as a hang. Then resume streaming the grounded answer.
Choose the model for the latency budget. First-token and per-token speed differ by model: small, fast models (Nova Lite/Micro, Claude Haiku, Mistral) stream noticeably quicker than frontier models. A strong pattern is to route the easy, latency-sensitive turns to a fast model and escalate only hard reasoning to a frontier model — all through the same Converse code, since only the modelId changes. See Amazon Nova and Claude on Bedrock for the per-model trade-offs.
Mind cost while you optimise feel. Streaming is billed identically to non-streaming, and a tool-use turn means two (or more) model calls — the initial request plus the post-toolResult answer — each billed for its tokens. The system prompt and tool schema are resent on every call in the loop, which is exactly the repeated context that prompt caching is designed to discount. For multi-tool agents that loop several times per user turn, caching the stable system prompt and tool definitions is the highest-leverage cost move; the broader cost mechanics live in the Bedrock pricing coverage.
(1) Render the first delta instantly — never buffer. (2) Show a named status during the tool-call pause. (3) Route latency-sensitive turns to a fast model, escalate only when needed. (4) Cache the repeated system prompt + tool schema so the multi-call loop stays cheap. Get these four right and a tool-using, streaming assistant feels fast even when it is doing real work.
Everything above is the do-it-yourself approach: you own the tool loop, the streaming parse, the error handling, the history. Bedrock also offers a managed alternative — Bedrock Agents — that wraps the same toolUse/toolResult mechanic in a higher-level service. Knowing when to use which saves a lot of rework.
Hand-rolled Converse tool use (this guide) gives you maximum control: you decide exactly how the loop runs, how tools execute, how errors recover, how the UI streams, and how state is stored. It is the right choice when you want a tight, custom integration, when tools are simple application functions, when you need bespoke streaming UX, or when you want zero additional managed-service surface. The cost is that you write and maintain the orchestration.
Bedrock Agents takes over the orchestration: you register action groups (your APIs/Lambda functions) and optionally a Knowledge Base, and the Agent plans the steps, constructs the tool calls, invokes your functions, and assembles the answer — the toolUse/toolResult loop is managed for you. Reach for Agents when the workflow is genuinely multi-step, when you want managed planning and built-in retrieval, or when you would rather configure than build. The trade-off is less low-level control over the exact loop and streaming. The full treatment is in Amazon Bedrock Agents.
A useful rule: start with raw Converse tool use for a focused assistant calling a handful of functions with custom UX; graduate to Agents when the orchestration itself becomes the hard part — many tools, multi-step plans, managed RAG, and you no longer want to own the loop. Both sit on the same Bedrock foundation, so the underlying model choice, security model, and cost levers carry across.
The two axes that define a Bedrock chat call are whether the response streams and whether tools are in play. This maps the four combinations so you can pick the right call shape for each surface of your app. Behaviour and schema are identical across models — only the modelId changes.
| Capability | Converse (unary) | ConverseStream (streaming) | Notes |
|---|---|---|---|
| Response delivery | One complete object | Stream of typed events | Stream renders token-by-token; total tokens/cost identical |
| Best for | Background / batch / single-shot | Interactive chat & assistant UIs | Default to streaming for anything a user watches |
| IAM action | bedrock:Converse | bedrock:InvokeModelWithResponseStream | Streaming needs the extra stream permission |
| Tool use supported? | Yes (toolUse/toolResult) | Yes (toolUse arrives as event blocks) | Same loop; streamed tool input arrives in fragments |
| Tool input arrival | Whole input object at once | delta.toolUse.input fragments to concatenate | Buffer fragments, JSON-parse after contentBlockStop |
| toolChoice control | auto / any / specific tool | auto / any / specific tool | Model-dependent; forcing a tool suppresses free text |
| Typical use | Extraction, enrichment, cron jobs | Support bots, copilots, chat agents | Combine: stream + tool call + stream final answer |
Situation: The team had a working Converse prototype but it felt like a demo, not a product: answers appeared all at once after a long pause, and the assistant could only describe what to do — it could not actually look up a shipment's live status or trigger a re-route. They needed real streaming UX plus tool use against their internal APIs, with the whole thing resilient to throttling and to their own services occasionally timing out. No one on the team had built the streamed-tool-call loop before, and the inference bill for a always-on copilot was an open question.
What CloudRoute did: Routed within 19 hours to a US AWS partner with a GenAI-application track record. The partner rebuilt the assistant on ConverseStream: token-by-token streaming over SSE, tool use against three internal tools (get_shipment_status, reschedule_pickup, search_help_center) via the toolUse/toolResult loop, a named status indicator during each tool-call pause, per-tool timeouts with status:"error" fallbacks, and exponential-backoff retries on ThrottlingException. Cost was controlled with model routing — Nova Lite for classification and easy turns, Claude Sonnet only for hard answers — plus prompt caching on the system prompt and tool schema resent every loop. In parallel the partner filed a Bedrock/GenAI proof-of-concept credit application and an Activate Portfolio application.
Outcome: GenAI POC credits ($25K) approved in under 2 weeks and Portfolio ($100K) shortly after, so the first ~6 months of copilot inference were fully credit-funded. Time-to-first-token dropped to a fraction of a second, the assistant now answers with live shipment data, and tool failures degrade gracefully instead of 500-ing. Shipped to production in 4 weeks. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.
time-to-match: < 24h · credits secured: $125K · first-token latency: sub-second · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who files your Bedrock/GenAI credit application (Activate Portfolio up to $100K, GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and, if you need hands, builds the Converse streaming + tool-use workload with you. AWS funds the credits and the engagement. You pay $0.