for AWS partners →Fund your Bedrock build with AWS credits →

bedrock streaming + tool use · the 2026 builder guide

Bedrock streaming & tool use — the Converse API, in production.

Q: What is the difference between Converse and ConverseStream on Bedrock?

Both call a Bedrock chat model with the same request schema (modelId, messages, inferenceConfig, optional system and toolConfig). Converse returns one complete response object; ConverseStream returns the response as a sequence of events you iterate so you can render the answer token-by-token as it is generated. Streaming reduces perceived latency (the user sees the first words almost immediately) but does not change total tokens or cost. Use ConverseStream for any interactive, user-facing chat; use unary Converse for background, batch, or single-shot calls.

Q: How do I parse the ConverseStream event stream?

Iterate the stream and branch on event type. The sequence is messageStart (role), contentBlockStart (block begins), repeated contentBlockDelta (the text chunks in delta.text — append these to the UI), contentBlockStop (block ends), messageStop (carries stopReason such as end_turn, tool_use, or max_tokens), and a final metadata event (carries token usage and latency). For plain text you mostly just emit delta.text on each contentBlockDelta; everything else is bookkeeping. You need the bedrock:InvokeModelWithResponseStream IAM permission to call it.

Q: What is tool use (function calling) on Amazon Bedrock and how does the loop work?

Tool use lets a model request that your application run a named function with structured arguments, then answer using the result. You pass a toolConfig declaring each tool (name, description, JSON-Schema inputSchema). When the model wants a tool, the response has stopReason "tool_use" and a toolUse block with a toolUseId, name, and input. You append that assistant message to history, run the function yourself, append a user message with a toolResult block echoing the same toolUseId, then call Converse again so the model answers using the data. The model never runs your code — it only requests calls; your app executes them.

Q: Can I use multiple tools, and can the model call several at once?

Yes. Put multiple toolSpec entries in the tools array and the model chooses which fits each message based on the descriptions you write. A model can also return more than one toolUse block in a single turn when a request needs several independent lookups; your handler should execute all of them (concurrently if independent) and return one matching toolResult per request — each with its own toolUseId — in the next message. Returning a result for only some of the requests will desynchronise the turn.

Q: How do I force the model to call a specific tool?

Set toolChoice in the toolConfig. The default is auto (the model decides whether to use a tool). {"any": {}} forces the model to call some tool from the set, and {"tool": {"name": "your_tool"}} forces a specific named tool. Forced choice is how you guarantee structured output — for example, always extracting fields into one schema. Note that forcing a tool suppresses free-text answers for that call, and not every model supports every toolChoice mode, so confirm support in the AWS docs.

Q: Can I stream and use tools in the same turn?

Yes — this is the standard production pattern. You call converse_stream; if the model requests a tool, the toolUse block arrives across stream events (the tool input JSON comes as fragments in delta.toolUse.input that you concatenate and parse after the block stops), and the stream ends with stopReason "tool_use". You then append the assistant message, execute the tool, append a toolResult with the matching toolUseId, and call converse_stream again to stream the grounded final answer. To the user it reads as one response with a brief "looking that up" pause in the middle.

Q: How should I handle errors like throttling, dropped streams, and tool failures?

Treat them as expected. For ThrottlingException, retry with exponential backoff and jitter (capped), and request a quota increase or use provisioned throughput for steady load; never retry ValidationException or AccessDeniedException — fix the request or IAM instead. Wrap stream iteration in try/except so a mid-stream drop is caught, and decide whether to retry the turn or keep the partial answer. For tools, put a timeout on every call, return a toolResult with status "error" on failure so the model can recover gracefully, make write tools idempotent, and validate the model's arguments against your JSON Schema before acting.

Q: Does streaming or tool use change how much Bedrock costs, and when should I use Bedrock Agents instead?

Streaming is billed identically to non-streaming — same tokens, same price; it only changes perceived latency. Tool use does affect cost because a tool-using turn is two or more model calls (the initial request plus the post-toolResult answer), each billed for its tokens, and the system prompt and tool schema are resent every call in the loop — exactly the repeated context that prompt caching discounts, so caching the stable system prompt and tool definitions is the highest-leverage cost move for multi-tool agents. Hand-rolled Converse tool use (owning the loop) is right for a focused assistant calling a handful of functions with custom UX; graduate to Bedrock Agents when the orchestration itself is the hard part — many tools, multi-step plans, managed retrieval via a Knowledge Base. Both use the same toolUse/toolResult mechanic on the same foundation. AWS credits can also fund the inference bill outright — Activate Portfolio (up to $100K), Bedrock/GenAI POC ($10K–$50K), and the GenAI Accelerator (up to $1M); CloudRoute routes you to a partner who files them and you pay $0.

The two features that separate a demo from a shipped product: ConverseStream for token-by-token output that feels instant, and tool use (function calling) that lets the model query your APIs and act on the result. This is the full reference — how the streaming event format works and how to parse it, how to define tools and run the toolUse/toolResult loop, multi-tool selection and forced tool choice, how to stream and call tools in the same turn, and how to handle throttling, partial responses, and timeouts without breaking the UX.

Fund your Bedrock build with AWS credits →→ jump to the tool-use loop

one API for both

Converse

first-token latency drop

major

provider-specific code

none

cost vs non-streaming

identical

TL;DR

Amazon Bedrock exposes streaming and tool use through one interface — the Converse API. Converse / ConverseStream replace provider-specific InvokeModel bodies with a single schema across Claude, Llama, Mistral, Nova, Cohere and the rest, so the same streaming and tool-calling code works no matter which model you call; switching models is usually a one-line modelId change.
Streaming (ConverseStream) returns the answer as a sequence of events — messageStart, contentBlockDelta (the text tokens), contentBlockStop, messageStop, plus a final metadata event with token usage. You render deltas as they arrive so the user sees the first words in a fraction of the time. Tool use lets the model emit a structured toolUse request; your code runs the function, returns a toolResult, and calls Converse again so the model can answer with the data. Forced tool choice can require a specific tool or any tool.
You can stream and use tools in the same turn: the stream pauses on a toolUse, you execute and reply with a toolResult, and the model streams its final answer. Production-readiness is mostly error handling and latency UX — retries with backoff on ThrottlingException, partial-response handling on a dropped stream, idempotent tools, and timeouts on every tool call. GenAI bills scale fast; CloudRoute routes you to AWS credits (Activate Portfolio up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted partner to build it — you pay $0.

the core idea

ITwo features, one API — why streaming and tool use live in Converse

Streaming and tool use are the two capabilities that turn a Bedrock chat call into something a user will actually tolerate and trust: streaming makes it feel fast, and tool use makes it able to do things instead of just talk. On Bedrock, both are delivered through the same modern interface — the Converse API — which is why they belong on one page.

The Converse API is Bedrock's unified way to call any chat model. Instead of the older InvokeModel path, where the request and response JSON are shaped differently for every provider, Converse gives you one request and response schema across every model — Anthropic Claude, Meta Llama, Mistral, Amazon Nova, Cohere, and the rest. The relevance here is direct: streaming and tool use are features of the Converse schema, not of any single provider. Write the streaming-parse loop once and it works whether the modelId points at Claude Sonnet or Nova Pro; define a tool once in the Converse toolConfig format and any tool-capable model on the platform can call it. Switching models is usually a one-line change.

There are two entry points. Converse is the unary call: you send messages, you get one complete response back. ConverseStream is the streaming call: you send the same request but receive the response as a sequence of small events over a persistent connection, so you can render the answer as it is generated rather than waiting for the whole thing. Tool use works identically in both — the difference is only whether the model's output (including any tool request) arrives all at once or incrementally.

Why this matters for what you build: a non-streaming, no-tools assistant can only answer from the model's parametric knowledge and makes the user stare at a spinner for the full generation time. Add streaming and the perceived latency collapses — the user reads the first sentence while the rest is still being written. Add tool use and the assistant can look up a live order status, run a database query, call your pricing service, or trigger an action, then ground its answer in the real result. Combine them and you get the now-standard experience: a fast, streaming assistant that can also reach into your systems mid-conversation.

The rest of this guide treats each in turn — streaming first (the event format and how to parse it), then tool use (defining tools and running the request/result loop), then the two combined, and finally the error-handling and latency details that decide whether the thing survives real traffic. For the broader platform context, see the full Amazon Bedrock guide and the Bedrock API reference.

the one-sentence framing

On Bedrock, streaming (ConverseStream) and tool use / function calling are both features of the unified Converse API — so one streaming-parse loop and one tool definition work across every chat model on the platform, and switching models is a one-line modelId change rather than a rewrite.

token-by-token output

IIStreaming with ConverseStream — the event format and how to parse it

ConverseStream returns the response not as one JSON object but as an ordered stream of typed events. To render token-by-token you only have to recognise a handful of event types and pull the text out of the right one. Once you have seen the shape, the parse loop is short and identical across models.

You call converse_stream (boto3) / ConverseStream with exactly the same arguments you would pass to converse — modelId, messages, inferenceConfig, and optionally system and toolConfig. What comes back is a stream you iterate. Each item is a small dictionary keyed by event type. The sequence for a normal text answer is predictable:

messageStart — Sent once at the beginning. Tells you the role of the message being produced (almost always "assistant"). Use it to open a new message bubble in the UI.
contentBlockStart — Marks the start of a content block. For plain text this is often empty; for a tool call it carries the toolUse block's name and toolUseId (covered in the combined section).
contentBlockDelta — The workhorse event, emitted many times. For text, delta.text holds the next chunk of generated tokens — append it to the bubble as it arrives. For a streamed tool call, delta.toolUse.input holds a fragment of the JSON arguments to concatenate.
contentBlockStop — Closes the current content block. A single response can contain multiple blocks (for example a text block followed by a tool-use block).
messageStop — Sent once at the end. Carries stopReason — "end_turn" for a normal finish, "tool_use" when the model wants to call a tool, "max_tokens" if it hit the output limit, "stop_sequence", etc. Branch on this value.
metadata — The final event. Carries usage (inputTokens, outputTokens, totalTokens) and metrics (such as latencyMs). Read it to log cost and latency — streaming is billed identically to non-streaming, so this is where you capture token counts.

A minimal streaming parse loop (python / boto3)

The whole pattern is: iterate the stream, and on each contentBlockDelta that carries delta.text, emit that text. Everything else is bookkeeping.

import boto3
brt = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = brt.converse_stream(
  modelId="anthropic.claude-sonnet",  # swap to switch models
  messages=[{"role": "user", "content": [{"text": "Explain prompt caching in 4 sentences."}]}],
  inferenceConfig={"maxTokens": 512, "temperature": 0.2},
)
for event in resp["stream"]:
  if "contentBlockDelta" in event:
    text = event["contentBlockDelta"]["delta"].get("text", "")
    print(text, end="", flush=True)  # render as it arrives
  elif "messageStop" in event:
    stop = event["messageStop"]["stopReason"]
  elif "metadata" in event:
    usage = event["metadata"]["usage"]  # log inputTokens/outputTokens

IAM, transport, and surfacing it to a browser

Streaming uses a distinct permission and a distinct under-the-hood API. Grant bedrock:InvokeModelWithResponseStream (the streaming sibling of bedrock:InvokeModel) in addition to bedrock:Converse; without it the stream call is denied. Over the wire the stream is delivered as an HTTP/2 event stream — the SDK handles the framing, so on the server you simply iterate, as above.

To get tokens to a web client, re-emit each delta over a streaming transport your frontend understands. The common choices are Server-Sent Events (SSE) or a WebSocket from your backend: your server iterates the Bedrock stream and forwards each delta.text to the browser as it arrives, so the round trip stays incremental end to end. Avoid buffering the whole answer on the server and sending it at once — that throws away the entire latency benefit of streaming.

why stream at all — perceived vs total latency

Streaming does not make generation faster; total tokens and total cost are unchanged. What it changes is perceived latency: the user sees the first words after time-to-first-token (typically a fraction of a second) instead of after the full generation completes (which for a long answer can be many seconds). For any interactive chat or assistant UI, ConverseStream should be the default.

function calling

IIITool use (function calling) — defining tools and running the loop

Tool use — also called function calling — is how a Bedrock model goes from describing an action to requesting one. You declare the tools the model is allowed to call; when it decides a tool would help, it returns a structured request instead of a final answer; your code executes the tool and hands the result back; the model then answers using that result. It is a loop, and getting the loop right is the whole skill.

The mental model is important: the model never runs your code. It only emits a structured request to call a named tool with specific arguments. Your application is what actually calls the database, hits the API, or runs the function — then returns the output to the model. This keeps execution, credentials, and side effects entirely on your side; the model just decides what to call and reads what came back.

Step 1 — Define the tools (toolConfig)

You pass a toolConfig alongside your messages. Each tool has a name, a natural-language description (this is how the model decides when to use it — write it carefully), and an inputSchema expressed as JSON Schema describing the arguments. The JSON Schema both tells the model what arguments to produce and lets you validate what it returns.

toolConfig = {
  "tools": [{
    "toolSpec": {
      "name": "get_order_status",
      "description": "Look up the current status of a customer order by its ID.",
      "inputSchema": {"json": {
        "type": "object",
        "properties": {"order_id": {"type": "string", "description": "The order ID, e.g. A-10293."}},
        "required": ["order_id"]
      }}
    }
  }]
}

Step 2 — The model returns a toolUse request

When the model decides to call a tool, the Converse response comes back with stopReason = "tool_use", and the assistant message contains a toolUse content block: a toolUseId (a correlation handle you must echo back), the tool name, and an input object matching your schema. Critically, you must append the assistant's message (with the toolUse block) to your messages list before you reply — the conversation has to record that the model asked, or the follow-up turn will be malformed.

Step 3 — Execute and return a toolResult

Now your code runs the actual function — calls the orders service, queries the database, whatever the tool represents — and sends the output back as a new user-role message containing a toolResult block. The toolResultBlock must carry the same toolUseId so the model knows which request it answers, plus the content (the data) and a status of success or error. Then you call Converse again with the updated messages.

# after parsing the toolUse block (name, tool_use_id, tool_input):
result = get_order_status(**tool_input)  # YOUR code runs the tool
messages.append({"role": "user", "content": [{
  "toolResult": {
    "toolUseId": tool_use_id,
    "content": [{"json": result}],
    "status": "success"
  }
}]})
final = brt.converse(modelId=MODEL, messages=messages, toolConfig=toolConfig)

Step 4 — The model answers (or asks for another tool)

On the follow-up call the model reads the toolResult and usually returns a normal text answer grounded in your data (stopReason = "end_turn"). But it may decide it needs another tool first — in which case you get another tool_use stop and repeat the loop. Production code therefore wraps Steps 2–4 in a bounded while loop that keeps going while stopReason == "tool_use", with a hard cap on iterations so a model that loops cannot run forever.

the toolUse / toolResult contract — three rules

(1) Always append the assistant's toolUse message to history before replying. (2) Echo the exact toolUseId back in your toolResult. (3) Loop while stopReason == "tool_use", bounded by a max-iterations cap. Break any of these and the conversation desynchronises or the call errors.

selection & control

IVMultiple tools, parallel calls, and forcing a tool choice

Real assistants expose more than one tool, sometimes need several at once, and occasionally must be forced to use a specific tool. The Converse toolConfig handles all three — the model picks from the set, can request multiple tools in a single turn, and obeys a toolChoice directive when you set one.

Multiple tools. Put several toolSpec entries in the tools array and the model chooses which (if any) fits the user's request based on the descriptions you wrote. This is the most common setup — a support assistant might expose get_order_status, issue_refund, and search_knowledge_base, and the model routes to the right one per message. The quality of routing is largely a function of how clearly each tool's description is written; vague descriptions cause wrong-tool selection.

Parallel / multiple calls in one turn. A model can return more than one toolUse block in a single response when a request needs several independent lookups (for example, the status of three different orders). Your handler should therefore iterate all toolUse blocks in the message, execute them (you can run them concurrently since they are independent), and return one toolResult block per request — each with its matching toolUseId — in the next message. Returning a result for only one of several requests will desynchronise the turn.

Forcing a tool choice. By default toolChoice is auto — the model decides whether to use a tool at all. You can override it: {"any": {}} forces the model to call some tool (it may not answer in plain text), and {"tool": {"name": "get_order_status"}} forces a specific tool. Forced choice is how you guarantee structured output — e.g. always extract fields into a schema — rather than hoping the model decides to. Note that not every model supports every toolChoice mode, and forcing a tool disables plain-text answers for that call; check the model's capabilities in the AWS docs.

converse toolChoice modes · what each does and when to use it

toolChoice	Behaviour	Plain-text answer allowed?	Use it when
auto (default)	Model decides: use a tool, or just answer	Yes	General assistants — let the model route naturally
any ({"any": {}})	Model must call some tool from the set	No (must pick a tool)	You always want an action/structured call, not prose
tool ({"tool": {"name": ...}})	Model must call the named tool	No (that tool only)	Guaranteed structured extraction into one schema

toolChoice support varies by model — some support only auto. Forcing "any" or a specific tool suppresses free-text answers for that call, so it is best for extraction/routing steps, not conversational turns. Confirm per-model support in the Bedrock documentation.

the production pattern

VCombining streaming and tool use in one turn

The experience users now expect — a fast assistant that also reaches into live systems — requires streaming and tool use working together in a single turn. ConverseStream supports this directly: the stream surfaces a tool request mid-flight, you pause to execute it, then resume streaming the model's grounded answer.

The flow is the streaming event loop from Section II with a tool branch added. While iterating the stream, watch the events: a contentBlockStart may announce a toolUse block (carrying the tool name and toolUseId), and the subsequent contentBlockDelta events carry fragments of the tool's input JSON in delta.toolUse.input — you concatenate those fragments and parse the assembled string into the arguments object once the block stops. When the stream ends with stopReason == "tool_use", you have a complete tool request.

At that point you do exactly what the non-streaming loop does: append the assistant message, execute the tool, append a toolResult message echoing the toolUseId, and call converse_stream again. The second stream is the model's final answer, now grounded in your tool's output, and you render its deltas token-by-token. So a single user turn can produce: a short streamed preamble, a tool call, a pause while you fetch data, then a streamed final answer. To the user it reads as one smooth response with a brief "looking that up…" beat in the middle — which is exactly the moment to show a tool-call status indicator (see the latency section).

The key parsing nuance unique to streaming-plus-tools is that tool arguments arrive in pieces. With unary Converse you get the whole input object at once; with ConverseStream you must accumulate delta.toolUse.input string fragments and JSON-parse them only after contentBlockStop. Attempting to parse a partial fragment will fail — buffer first, parse once.

streaming + tools, in five beats

1) Stream the first turn. 2) On a toolUse block, accumulate the input-JSON fragments and parse after contentBlockStop. 3) On stopReason == tool_use, append the assistant message, run the tool, append a toolResult with the matching toolUseId. 4) Call converse_stream again and stream the grounded final answer. 5) Cap the loop. This is the standard production shape for an interactive Bedrock assistant.

making it survive real traffic

VIError handling, retries, and resilient tool execution

Demos ignore failure; production cannot. Two surfaces fail independently — the Bedrock call itself (throttling, timeouts, a dropped stream) and your tool execution (a downstream API is slow or errors). Handling both cleanly is most of what separates a robust assistant from a flaky one.

Throttling and retries. The most common Bedrock error under load is ThrottlingException — you have exceeded the account/model requests-per-minute or tokens-per-minute quota for that Region. Treat it as expected: retry with exponential backoff and jitter, and cap the attempts. The AWS SDKs include configurable adaptive retries, but you should still design for throttling rather than assume it away — request a quota increase for steady load, and consider Provisioned Throughput or cross-region inference if you regularly hit limits. Other retryable conditions include ModelTimeoutException and transient ServiceUnavailable; non-retryable ones (ValidationException, AccessDeniedException) mean fix the request, not retry it.

Dropped or partial streams. A long-lived stream can break mid-flight (network blip, timeout). Because you have been rendering deltas, you may have a partial answer on screen. Decide the policy up front: either retry the whole turn (simplest, but the user sees a restart) or mark the message as incomplete and offer a "continue" affordance. Always wrap the stream iteration in try/except so a mid-stream error is caught rather than crashing the request handler, and log the last received event for diagnosis.

Tool execution failures. Your tool is calling something that can fail or hang. Put a timeout on every tool call so a slow downstream cannot freeze the whole turn, and when a tool fails, do not crash — return a toolResult with status: "error" and a short message describing the failure. The model will read that and can apologise, retry differently, or fall back gracefully, which is far better UX than a 500. Make tools idempotent where possible (especially write actions like refunds) so a retried call cannot double-execute, and validate the model's tool arguments against your JSON Schema before you act on them — never trust the arguments blindly for anything with side effects.

Guardrails and limits. Layer Bedrock Guardrails over the call to filter unsafe content and block prompt-injection-style attempts to misuse tools, and set a sane maxTokens so a runaway generation is bounded. Together these make the assistant safe to point at production systems.

common failure modes in streaming + tool-use apps · and the fix

Failure	Where it happens	Handle it by
ThrottlingException	Bedrock call under load	Exponential backoff + jitter, capped retries; quota increase / provisioned throughput
Dropped / partial stream	Mid-stream network or timeout	try/except around iteration; retry turn or offer "continue"; keep partial text
ModelTimeoutException	Long generation	Retry with backoff; lower maxTokens; consider a faster model
Tool downstream slow/down	Your tool execution	Per-tool timeout; return toolResult status:"error"; let the model recover
Bad tool arguments	Model-generated input	Validate against JSON Schema before acting; idempotent write tools
ValidationException / AccessDenied	Malformed request / IAM	Do NOT retry — fix the request shape or the IAM policy (Converse + stream perms)

Rule of thumb: retry transient/throttling errors with backoff; never retry validation or permission errors. Always echo the toolUseId, always cap the tool loop, always timeout tool calls. Error codes and retry guidance evolve — confirm current behaviour in the AWS Bedrock Runtime API docs.

how it should feel

VIILatency UX — making streaming and tool calls feel fast

Streaming and tool use are as much UX features as engineering ones. The same backend can feel instant or sluggish depending on how you surface progress. A few patterns reliably make an interactive Bedrock assistant feel responsive even when a tool call adds a real pause.

Render the first token immediately. The entire point of streaming is time-to-first-token: get the first words on screen the instant they arrive and the assistant feels alive, regardless of total length. Do not gate rendering on the full response, and do not buffer on the server. If you can, start a typing indicator the moment the request is sent and replace it with real text on the first contentBlockDelta.

Make the tool-call pause legible. When the stream stops on a tool_use and you go off to execute, the user is momentarily waiting with no new tokens. Fill that gap with an explicit status — "Checking your order status…", "Searching the knowledge base…" — derived from the tool name. A visible, named action reads as competence; a frozen cursor reads as a hang. Then resume streaming the grounded answer.

Choose the model for the latency budget. First-token and per-token speed differ by model: small, fast models (Nova Lite/Micro, Claude Haiku, Mistral) stream noticeably quicker than frontier models. A strong pattern is to route the easy, latency-sensitive turns to a fast model and escalate only hard reasoning to a frontier model — all through the same Converse code, since only the modelId changes. See Amazon Nova and Claude on Bedrock for the per-model trade-offs.

Mind cost while you optimise feel. Streaming is billed identically to non-streaming, and a tool-use turn means two (or more) model calls — the initial request plus the post-toolResult answer — each billed for its tokens. The system prompt and tool schema are resent on every call in the loop, which is exactly the repeated context that prompt caching is designed to discount. For multi-tool agents that loop several times per user turn, caching the stable system prompt and tool definitions is the highest-leverage cost move; the broader cost mechanics live in the Bedrock pricing coverage.

the latency-UX checklist

(1) Render the first delta instantly — never buffer. (2) Show a named status during the tool-call pause. (3) Route latency-sensitive turns to a fast model, escalate only when needed. (4) Cache the repeated system prompt + tool schema so the multi-call loop stays cheap. Get these four right and a tool-using, streaming assistant feels fast even when it is doing real work.

build it vs let bedrock manage it

VIIIRaw Converse tool use vs Bedrock Agents — which to reach for

Everything above is the do-it-yourself approach: you own the tool loop, the streaming parse, the error handling, the history. Bedrock also offers a managed alternative — Bedrock Agents — that wraps the same toolUse/toolResult mechanic in a higher-level service. Knowing when to use which saves a lot of rework.

Hand-rolled Converse tool use (this guide) gives you maximum control: you decide exactly how the loop runs, how tools execute, how errors recover, how the UI streams, and how state is stored. It is the right choice when you want a tight, custom integration, when tools are simple application functions, when you need bespoke streaming UX, or when you want zero additional managed-service surface. The cost is that you write and maintain the orchestration.

Bedrock Agents takes over the orchestration: you register action groups (your APIs/Lambda functions) and optionally a Knowledge Base, and the Agent plans the steps, constructs the tool calls, invokes your functions, and assembles the answer — the toolUse/toolResult loop is managed for you. Reach for Agents when the workflow is genuinely multi-step, when you want managed planning and built-in retrieval, or when you would rather configure than build. The trade-off is less low-level control over the exact loop and streaming. The full treatment is in Amazon Bedrock Agents.

A useful rule: start with raw Converse tool use for a focused assistant calling a handful of functions with custom UX; graduate to Agents when the orchestration itself becomes the hard part — many tools, multi-step plans, managed RAG, and you no longer want to own the loop. Both sit on the same Bedrock foundation, so the underlying model choice, security model, and cost levers carry across.

at a glance

Converse vs ConverseStream — and unary tool use vs streamed tool use

The two axes that define a Bedrock chat call are whether the response streams and whether tools are in play. This maps the four combinations so you can pick the right call shape for each surface of your app. Behaviour and schema are identical across models — only the modelId changes.

Capability	Converse (unary)	ConverseStream (streaming)	Notes
Response delivery	One complete object	Stream of typed events	Stream renders token-by-token; total tokens/cost identical
Best for	Background / batch / single-shot	Interactive chat & assistant UIs	Default to streaming for anything a user watches
IAM action	bedrock:Converse	bedrock:InvokeModelWithResponseStream	Streaming needs the extra stream permission
Tool use supported?	Yes (toolUse/toolResult)	Yes (toolUse arrives as event blocks)	Same loop; streamed tool input arrives in fragments
Tool input arrival	Whole input object at once	delta.toolUse.input fragments to concatenate	Buffer fragments, JSON-parse after contentBlockStop
toolChoice control	auto / any / specific tool	auto / any / specific tool	Model-dependent; forcing a tool suppresses free text
Typical use	Extraction, enrichment, cron jobs	Support bots, copilots, chat agents	Combine: stream + tool call + stream final answer

For user-facing surfaces use ConverseStream and render the first token immediately; for offline/structured work, unary Converse (often with forced toolChoice for guaranteed schema) is simpler. Confirm per-model streaming and toolChoice support in the AWS Bedrock documentation.

shipping a streaming, tool-using assistant?

Get AWS credits to fund the inference — and a vetted partner to build the Bedrock workload. You pay $0.

Get matched in 24h →

a recent match

A streaming, tool-using support agent — funded by AWS credits, anonymized

inquiry · seed-stage logistics-SaaS, US, building a customer-facing copilot

Seed-stage B2B logistics SaaS, 14 people, building a customer-facing support copilot over their orders and shipments APIs; needed live data in answers, not just canned text

Situation: The team had a working Converse prototype but it felt like a demo, not a product: answers appeared all at once after a long pause, and the assistant could only describe what to do — it could not actually look up a shipment's live status or trigger a re-route. They needed real streaming UX plus tool use against their internal APIs, with the whole thing resilient to throttling and to their own services occasionally timing out. No one on the team had built the streamed-tool-call loop before, and the inference bill for a always-on copilot was an open question.

What CloudRoute did: Routed within 19 hours to a US AWS partner with a GenAI-application track record. The partner rebuilt the assistant on ConverseStream: token-by-token streaming over SSE, tool use against three internal tools (get_shipment_status, reschedule_pickup, search_help_center) via the toolUse/toolResult loop, a named status indicator during each tool-call pause, per-tool timeouts with status:"error" fallbacks, and exponential-backoff retries on ThrottlingException. Cost was controlled with model routing — Nova Lite for classification and easy turns, Claude Sonnet only for hard answers — plus prompt caching on the system prompt and tool schema resent every loop. In parallel the partner filed a Bedrock/GenAI proof-of-concept credit application and an Activate Portfolio application.

Outcome: GenAI POC credits ($25K) approved in under 2 weeks and Portfolio ($100K) shortly after, so the first ~6 months of copilot inference were fully credit-funded. Time-to-first-token dropped to a fraction of a second, the assistant now answers with live shipment data, and tool failures degrade gracefully instead of 500-ing. Shipped to production in 4 weeks. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.

time-to-match: < 24h · credits secured: $125K · first-token latency: sub-second · cost to customer: $0

faq

Common questions

What is the difference between Converse and ConverseStream on Bedrock?

Both call a Bedrock chat model with the same request schema (modelId, messages, inferenceConfig, optional system and toolConfig). Converse returns one complete response object; ConverseStream returns the response as a sequence of events you iterate so you can render the answer token-by-token as it is generated. Streaming reduces perceived latency (the user sees the first words almost immediately) but does not change total tokens or cost. Use ConverseStream for any interactive, user-facing chat; use unary Converse for background, batch, or single-shot calls.

How do I parse the ConverseStream event stream?

Iterate the stream and branch on event type. The sequence is messageStart (role), contentBlockStart (block begins), repeated contentBlockDelta (the text chunks in delta.text — append these to the UI), contentBlockStop (block ends), messageStop (carries stopReason such as end_turn, tool_use, or max_tokens), and a final metadata event (carries token usage and latency). For plain text you mostly just emit delta.text on each contentBlockDelta; everything else is bookkeeping. You need the bedrock:InvokeModelWithResponseStream IAM permission to call it.

What is tool use (function calling) on Amazon Bedrock and how does the loop work?

Tool use lets a model request that your application run a named function with structured arguments, then answer using the result. You pass a toolConfig declaring each tool (name, description, JSON-Schema inputSchema). When the model wants a tool, the response has stopReason "tool_use" and a toolUse block with a toolUseId, name, and input. You append that assistant message to history, run the function yourself, append a user message with a toolResult block echoing the same toolUseId, then call Converse again so the model answers using the data. The model never runs your code — it only requests calls; your app executes them.

Can I use multiple tools, and can the model call several at once?

Yes. Put multiple toolSpec entries in the tools array and the model chooses which fits each message based on the descriptions you write. A model can also return more than one toolUse block in a single turn when a request needs several independent lookups; your handler should execute all of them (concurrently if independent) and return one matching toolResult per request — each with its own toolUseId — in the next message. Returning a result for only some of the requests will desynchronise the turn.

How do I force the model to call a specific tool?

Set toolChoice in the toolConfig. The default is auto (the model decides whether to use a tool). {"any": {}} forces the model to call some tool from the set, and {"tool": {"name": "your_tool"}} forces a specific named tool. Forced choice is how you guarantee structured output — for example, always extracting fields into one schema. Note that forcing a tool suppresses free-text answers for that call, and not every model supports every toolChoice mode, so confirm support in the AWS docs.

Can I stream and use tools in the same turn?

Yes — this is the standard production pattern. You call converse_stream; if the model requests a tool, the toolUse block arrives across stream events (the tool input JSON comes as fragments in delta.toolUse.input that you concatenate and parse after the block stops), and the stream ends with stopReason "tool_use". You then append the assistant message, execute the tool, append a toolResult with the matching toolUseId, and call converse_stream again to stream the grounded final answer. To the user it reads as one response with a brief "looking that up" pause in the middle.

How should I handle errors like throttling, dropped streams, and tool failures?

Treat them as expected. For ThrottlingException, retry with exponential backoff and jitter (capped), and request a quota increase or use provisioned throughput for steady load; never retry ValidationException or AccessDeniedException — fix the request or IAM instead. Wrap stream iteration in try/except so a mid-stream drop is caught, and decide whether to retry the turn or keep the partial answer. For tools, put a timeout on every call, return a toolResult with status "error" on failure so the model can recover gracefully, make write tools idempotent, and validate the model's arguments against your JSON Schema before acting.

Does streaming or tool use change how much Bedrock costs, and when should I use Bedrock Agents instead?

Streaming is billed identically to non-streaming — same tokens, same price; it only changes perceived latency. Tool use does affect cost because a tool-using turn is two or more model calls (the initial request plus the post-toolResult answer), each billed for its tokens, and the system prompt and tool schema are resent every call in the loop — exactly the repeated context that prompt caching discounts, so caching the stable system prompt and tool definitions is the highest-leverage cost move for multi-tool agents. Hand-rolled Converse tool use (owning the loop) is right for a focused assistant calling a handful of functions with custom UX; graduate to Bedrock Agents when the orchestration itself is the hard part — many tools, multi-step plans, managed retrieval via a Knowledge Base. Both use the same toolUse/toolResult mechanic on the same foundation. AWS credits can also fund the inference bill outright — Activate Portfolio (up to $100K), Bedrock/GenAI POC ($10K–$50K), and the GenAI Accelerator (up to $1M); CloudRoute routes you to a partner who files them and you pay $0.

Build the streaming, tool-using assistant — and let AWS credits pay for the inference.

CloudRoute routes you to a vetted AWS partner who files your Bedrock/GenAI credit application (Activate Portfolio up to $100K, GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and, if you need hands, builds the Converse streaming + tool-use workload with you. AWS funds the credits and the engagement. You pay $0.

Get matched in 24h →→ see the data & AI persona detail

matched within< 24h

GenAI credit ceilingup to $1M

cost to you$0