for AWS partners →Fund your Bedrock build with AWS credits →

bedrock api · the developer reference (2026)

The Amazon Bedrock API — a developer reference + quickstart.

Q: What is the Amazon Bedrock API?

It is the set of AWS API operations for using foundation models on Bedrock from code. The runtime (the bedrock-runtime service) has four operations for inference — InvokeModel and InvokeModelWithResponseStream (lower-level, provider-specific request bodies) and Converse and ConverseStream (one unified schema across all chat models). A separate control plane (the bedrock service) manages models, Guardrails, fine-tuning, evaluation, Knowledge Bases, and Agents. There is no API key — Bedrock uses standard AWS IAM credentials.

Q: What is the difference between the Converse API and InvokeModel?

InvokeModel sends a request body in each model provider's own JSON format and returns a provider-specific response, giving maximum control at the cost of provider-specific code and harder model swaps. The Converse API uses one consistent request/response schema across every chat model — with built-in multi-turn messages, system prompts, tool use, and (via ConverseStream) streaming — so switching models is usually just changing the modelId. Use Converse for chat and agents; use InvokeModel for non-conversational modalities like embeddings and images or a provider-specific parameter Converse does not expose.

Q: How do I make my first Bedrock API call?

Three steps. (1) In the Bedrock console enable the model under Model access (per-Region, per-model). (2) Attach an IAM policy allowing the runtime action (bedrock:Converse and/or bedrock:InvokeModel), ideally scoped to the model ARN. (3) Call it: in Python use boto3 client("bedrock-runtime").converse(modelId=..., messages=[...], inferenceConfig={...}); in Node use @aws-sdk/client-bedrock-runtime with a BedrockRuntimeClient and ConverseCommand. The assistant text is at output.message.content[0].text and token counts at usage.

Q: Which SDKs and languages does the Bedrock API support?

All standard AWS SDKs, since Bedrock is a normal AWS service. The most common are boto3 (Python, the bedrock-runtime client) and the AWS SDK for JavaScript v3 (@aws-sdk/client-bedrock-runtime). SDKs also exist for Java, Go, .NET, Rust, and more, plus the AWS CLI. There is also a higher-level cross-provider abstraction in some AWS open-source tooling, but the SDKs calling Converse/InvokeModel directly are the canonical path.

Q: How does tool use (function calling) work in the Bedrock API?

You declare tools in the Converse request's toolConfig — each with a name, a description, and a JSON-Schema inputSchema. When the model decides to use one, the response comes back with stopReason "tool_use" and a toolUse content block ({ toolUseId, name, input }). Your code runs the actual function and sends a follow-up user message containing a toolResult block (echoing the same toolUseId) with the output; the model then produces its final answer. You own the loop; Amazon Bedrock Agents can manage it for you if you prefer not to write it.

Q: How do I handle throttling (429 / ThrottlingException) on Bedrock?

ThrottlingException (HTTP 429) means you exceeded the per-account/per-model requests-per-minute or tokens-per-minute limit. Handle it with exponential backoff and jitter (the AWS SDK retry modes give you a baseline; add your own backoff and a concurrency limiter for high throughput). For more headroom, request a Service Quotas increase for the model, use cross-region inference profiles to spread load, or buy Provisioned Throughput for guaranteed capacity. Retry 429 and 5xx errors with backoff; do not retry 400/403/404 — fix the request, IAM, or modelId instead.

Q: How do I stream responses from the Bedrock API?

Use ConverseStream (or InvokeModelWithResponseStream for the lower-level API). Instead of one response you iterate an event stream: messageStart, a series of contentBlockDelta events carrying text chunks you append to the UI, contentBlockStop/messageStop with the stopReason, and a final metadata event with usage token counts. Streaming changes only when output arrives — it does not change pricing or the request shape. It is the right choice for any human-facing chat or long-generation UI.

Q: What inference parameters can I set, and what are good defaults?

In Converse's inferenceConfig: maxTokens (caps response length and output cost), temperature (randomness — low for deterministic extraction/tool use, higher for creative work), topP (nucleus sampling, usually tuned instead of temperature), and stopSequences (strings that halt generation). The system prompt is set separately in the top-level system field. A solid production default is a low temperature (0.0–0.2) with an explicit maxTokens ceiling; provider-only knobs like topK go through additionalModelRequestFields.

Everything you need to call foundation models on AWS from code: the two runtime APIs (InvokeModel vs Converse, plus ConverseStream), the exact request and response shapes, tool use (function calling), system prompts, inference parameters, the boto3 and JavaScript SDKs, token streaming, and how to handle throttling and errors. Reference-grade, with copy-pasteable snippets — and the cost reality of running it in production.

Fund your Bedrock build with AWS credits →→ jump to the quickstart

runtime APIs

one schema, all chat models

Converse

throttling error

429

servers to manage

TL;DR

The Amazon Bedrock API has two runtime operations for text/chat: InvokeModel (lower-level, provider-specific JSON body) and Converse (one consistent schema across every chat model, with multi-turn messages, system prompts, tool use, and inference config). For streaming you use InvokeModelWithResponseStream or ConverseStream. Pick Converse for almost all new chat and agent work; reach for InvokeModel only for image/embedding modalities or a provider-specific parameter Converse does not expose.
A Converse request is { modelId, messages, system, inferenceConfig, toolConfig }; the response is { output.message, stopReason, usage }. Tool use (function calling) is built in: you declare tools in toolConfig, the model replies with stopReason "tool_use", you run the tool and send the result back as a toolResult content block. Inference is controlled by maxTokens, temperature, topP, and stopSequences. The same call works in boto3 (bedrock-runtime) and the AWS SDK for JavaScript v3 (@aws-sdk/client-bedrock-runtime).
In production the two operational realities are throttling and cost. Bedrock returns ThrottlingException (HTTP 429) when you exceed account/model request-and-token rates — handle it with exponential backoff and jitter, request quota increases, or use Provisioned Throughput / cross-region inference for headroom. And GenAI bills scale fast: CloudRoute routes you to AWS credits (Activate Portfolio up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted partner to build the production setup — you pay $0.

the two runtime APIs

IThe Bedrock API surface — control plane vs runtime, InvokeModel vs Converse

Before any code, get the map straight. "The Bedrock API" is really two API families: a control plane for managing the service and a runtime for actually calling models. For sending a prompt and getting a completion, you live entirely in the runtime, where there are two operations to choose between — InvokeModel and Converse — each with a streaming variant.

The control plane (the bedrock service in the SDKs) is where you manage the service itself: list foundation models, create and manage Guardrails, fine-tuning jobs, model evaluation, provisioned-throughput model units, Knowledge Bases, and Agents. You touch it at setup and configuration time, not on every request. The runtime plane (the bedrock-runtime service) is the hot path — it is where inference happens, and it exposes exactly four operations that matter for day-to-day model calls.

Those four runtime operations are InvokeModel and InvokeModelWithResponseStream (the original, lower-level pair) and Converse and ConverseStream (the modern, unified pair). The non-streaming operations return the full completion in one response; the streaming operations return tokens incrementally as the model generates them. Choosing between the InvokeModel family and the Converse family is the single most consequential API decision you make, so it is worth being precise about the difference.

InvokeModel is provider-shaped. You send a request whose body is a JSON blob in the exact format that specific model provider defined — Anthropic's body looks different from Meta's, which looks different from Amazon Nova's — and you parse a provider-specific response body. It gives you maximum fidelity to each provider's native parameters, but it means your code branches per provider and a model swap can be a rewrite.

Converse is model-shaped from your side and provider-shaped under the hood. You send one canonical request schema — messages, an optional system prompt, inference config, and an optional tool config — and Bedrock translates it to whatever the underlying model expects, returning one canonical response schema regardless of provider. Multi-turn conversation, system prompts, tool use (function calling), and streaming are all first-class. Switching from, say, Claude to Llama to Nova is usually a one-line change to the modelId string. For chat and agentic applications, Converse is the recommended default; the rest of this reference centers on it and notes where InvokeModel is still the right tool.

the four runtime operations at a glance

InvokeModel — single response, provider-specific body (max control, provider-specific code).
InvokeModelWithResponseStream — streamed response, provider-specific body.
Converse — single response, one unified schema across all chat models (recommended default).
ConverseStream — streamed response, same unified schema.
Control-plane operations (ListFoundationModels, Guardrails, fine-tuning, etc.) live in the separate bedrock service, not bedrock-runtime.

choosing the operation

IIConverse vs InvokeModel — when to use which

The honest rule is short: use Converse for anything conversational, and use InvokeModel for the cases Converse does not cover. The table makes the trade-offs explicit so the choice is a five-second decision rather than a research project.

Reach for Converse / ConverseStream when you are building a chatbot, an assistant, an agent, or any multi-turn experience; when you want portability across models so you can A/B or fall back between providers; when you need tool use (function calling) with a consistent shape; or when you simply want the least provider-specific code. This is the overwhelming majority of new applications.

Reach for InvokeModel / InvokeModelWithResponseStream when you are calling a non-conversational modality that Converse does not model — image generation (Stability, Nova Canvas), video (Nova Reel), or text embeddings (Titan, Cohere Embed) — or when you need a niche provider-specific parameter that Converse intentionally does not expose. Embeddings and image endpoints are the most common legitimate reasons teams still use InvokeModel in 2026.

bedrock runtime operation chooser · Converse vs InvokeModel

If you need…	Use	Request body	Streaming variant	Notes
Chat / multi-turn / agents	Converse	Unified schema	ConverseStream	Recommended default for text/chat
Model portability (swap providers)	Converse	Unified schema	ConverseStream	Switch model = change modelId only
Tool use (function calling)	Converse	toolConfig in schema	ConverseStream	Consistent across providers
Text embeddings	InvokeModel	Provider-specific	n/a	Titan / Cohere Embed; no Converse equivalent
Image generation	InvokeModel	Provider-specific	n/a	Stability / Nova Canvas
Video generation	InvokeModel (async)	Provider-specific	n/a	Nova Reel; uses async invocation
A provider-only parameter	InvokeModel	Provider-specific	WithResponseStream	Only when Converse omits it

Practical default: write chat and agent code against Converse/ConverseStream and only drop to InvokeModel for embeddings, images, video, or a genuinely provider-specific knob. Mixing is normal — one app commonly uses Converse for chat and InvokeModel for the embeddings that feed its retrieval step.

zero to a first completion

IIIQuickstart — your first Converse call (boto3 and JavaScript)

Three prerequisites, then one call. The prerequisites are the same regardless of language: enable model access in the Bedrock console for the model you want, attach an IAM policy that allows the runtime action, and pick a Region the model is available in. Then the call itself is a handful of lines.

Prerequisite 1 — model access. In the Bedrock console open Model access and enable the model you intend to call (for example a workhorse chat model); access is per-Region and per-model, and some models require accepting the provider's license first. Prerequisite 2 — IAM. Grant the calling principal the runtime actions it needs — at minimum bedrock:InvokeModel, plus bedrock:InvokeModelWithResponseStream for streaming and bedrock:Converse / bedrock:ConverseStream for the Converse family — ideally scoped to specific model ARNs. Prerequisite 3 — Region. Configure the SDK with a Region where the model is live; frontier models often land in US Regions first. With those in place, the call below returns a completion.

In Python you use boto3 and the bedrock-runtime client. Note there is no API key — Bedrock uses standard AWS credentials (IAM role, environment, or profile), the same as every other AWS SDK call:

Python (boto3)

The converse method takes a modelId, a messages list, and an inferenceConfig; the assistant's text is at response["output"]["message"]["content"][0]["text"], and token counts come back in response["usage"]. Model IDs below are illustrative — copy the exact current ID from the Bedrock console.

JavaScript / TypeScript (AWS SDK v3)

In Node.js you use the modular v3 client @aws-sdk/client-bedrock-runtime: instantiate a BedrockRuntimeClient, build a ConverseCommand with the same logical fields (camelCase here — modelId, messages, inferenceConfig), and send it. The response shape mirrors the Python one: response.output.message.content[0].text and response.usage.

a minimal Converse call — python (boto3) and javascript (sdk v3)

# Python — boto3
import boto3
brt = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = brt.converse(
  modelId="anthropic.claude-sonnet",  # illustrative — copy exact ID from console
  messages=[{"role": "user", "content": [{"text": "List three uses for embeddings."}]}],
  inferenceConfig={"maxTokens": 512, "temperature": 0.2},
)
print(resp["output"]["message"]["content"][0]["text"])
print(resp["usage"])  # {inputTokens, outputTokens, totalTokens}

// JavaScript — @aws-sdk/client-bedrock-runtime (v3)
import { BedrockRuntimeClient, ConverseCommand } from "@aws-sdk/client-bedrock-runtime";
const client = new BedrockRuntimeClient({ region: "us-east-1" });
const resp = await client.send(new ConverseCommand({
  modelId: "anthropic.claude-sonnet",  // illustrative
  messages: [{ role: "user", content: [{ text: "List three uses for embeddings." }] }],
  inferenceConfig: { maxTokens: 512, temperature: 0.2 },
}));
console.log(resp.output.message.content[0].text);

the canonical shape

IVThe Converse request and response shape, in detail

Converse is worth learning once because the shape is stable across every model. A request is four top-level fields; a response is three. Internalize these and you can read or write any Bedrock chat call without per-provider documentation.

A Converse request has these top-level fields. modelId (required) is the model or inference-profile identifier. messages (required) is the conversation: an ordered array of turns, each { role: "user" | "assistant", content: [...] }, where content is a list of content blocks — most commonly { text: "..." }, but also image blocks, document blocks, and (for tool use) toolUse / toolResult blocks. system (optional) is a separate array of system-prompt blocks — Converse keeps the system prompt out of the message list so it is unambiguous. inferenceConfig (optional) holds the sampling parameters. toolConfig (optional) declares the tools the model may call. Two further optional fields are guardrailConfig (attach a Bedrock Guardrail by ID) and additionalModelRequestFields (an escape hatch to pass a provider-specific parameter Converse does not model directly).

A Converse response has three fields you use constantly. output.message is the assistant turn — same { role, content: [...] } structure as an input message, so you can append it straight back onto messages to continue the conversation. stopReason tells you why generation stopped: "end_turn" (the model finished normally), "max_tokens" (it hit your maxTokens limit), "stop_sequence" (it emitted one of your stop sequences), "tool_use" (it wants to call a tool — see section VI), or "guardrail_intervened" (a Guardrail blocked the content). usage reports inputTokens, outputTokens, and totalTokens — the numbers your bill and your rate limits are based on, so log them.

Maintaining a conversation is therefore mechanical: send messages, take output.message from the response, append it to messages, append the next user turn, and call again. There is no server-side session — Bedrock is stateless, and you own the transcript. That statelessness is deliberate: it keeps the API simple, keeps your data in your control, and lets you trim, summarize, or cache history however you like.

converse request & response fields · reference

Field	Where	Required	Purpose
modelId	request	Yes	Model or cross-region inference-profile ID to invoke
messages	request	Yes	Ordered turns; each { role, content[] } of typed blocks
system	request	No	System-prompt blocks, kept separate from messages
inferenceConfig	request	No	maxTokens, temperature, topP, stopSequences
toolConfig	request	No	Tool (function) declarations + tool-choice policy
guardrailConfig	request	No	Attach a Bedrock Guardrail by id/version
output.message	response	—	Assistant turn; append back onto messages to continue
stopReason	response	—	end_turn / max_tokens / stop_sequence / tool_use / guardrail_intervened
usage	response	—	inputTokens, outputTokens, totalTokens (billing + limits)

Field names shown are the Converse logical names; SDKs render them idiomatically (snake_case-free in boto3 keyword args, camelCase in JS). InvokeModel does not use this shape at all — its body and response are whatever the provider defines.

steering the model

VSystem prompts and inference parameters

Two levers control behavior on every call: the system prompt (what the model is and how it should behave) and the inference parameters (how it samples tokens). Converse exposes both cleanly and identically across models.

The system prompt sets persona, rules, output format, and guardrails-in-prose. In Converse it is the top-level system field — an array of blocks such as [{ "text": "You are a precise support assistant. Answer only from provided context. If unsure, say so." }] — kept deliberately separate from the user/assistant messages so there is never ambiguity about what is instruction versus conversation. Keeping the system prompt stable across turns also makes it a prime candidate for prompt caching, which bills repeated context at a steep discount instead of re-charging full input price every call.

The inference parameters live in inferenceConfig and are consistent across providers: maxTokens caps the length of the response (and thus the most expensive half of the bill); temperature controls randomness (low — for example 0–0.3 — for deterministic extraction and tool use; higher for creative drafting); topP is nucleus sampling, an alternative randomness control you usually tune instead of, not together with, temperature; and stopSequences is a list of strings that, when generated, halt the model (useful for forcing structured output or delimiting sections). A small number of model-specific knobs (such as topK on some providers) are not part of the unified config — pass those through additionalModelRequestFields when a given model supports them.

A practical default for production reasoning and tool use is a low temperature (0.0–0.2) with a sensible maxTokens ceiling, which keeps outputs deterministic, cheaper, and easier to test. Raise temperature only where variety is the point. Always set maxTokens explicitly — leaving it unbounded is the most common cause of surprise output-token costs and of responses truncating in ways you did not plan for.

system prompt + inference config in a Converse call

resp = brt.converse(
  modelId="anthropic.claude-sonnet",
  system=[{"text": "You are a precise support assistant. Answer only from the provided context; if unsure, say you don't know."}],
  messages=[{"role": "user", "content": [{"text": "What is our refund window?"}]}],
  inferenceConfig={"maxTokens": 400, "temperature": 0.1, "topP": 0.9, "stopSequences": ["\n\nUSER:"]},
)

Tip: hold the system prompt constant across turns and enable prompt caching so you are not billed full input price to re-process it on every request.

function calling

VITool use (function calling) with the Converse API

Tool use — also called function calling — is how a model goes from talking to doing: it asks your code to run a function, you run it, you hand back the result, and the model continues with that information in hand. Converse standardizes this loop across every model that supports it.

You declare tools in toolConfig.tools. Each tool is a toolSpec with a name, a natural-language description (the model uses this to decide when to call it — write it well), and an inputSchema expressed as a JSON Schema describing the parameters. You can also set toolConfig.toolChoice to influence whether the model may call a tool (auto), must call some tool (any), or must call one specific tool. The model never executes anything itself — it only emits a request to call a tool.

The loop has four steps. (1) You send the user message plus toolConfig. (2) If the model decides to use a tool, it returns stopReason: "tool_use" and an assistant message whose content includes a toolUse block — { toolUseId, name, input } — telling you which tool and with what arguments. (3) Your code runs the actual function (query a database, call an internal API, hit a pricing service) and sends a new user message containing a toolResult block that echoes the same toolUseId and carries the function's output. (4) The model incorporates the result and produces its final answer (or requests another tool). You append each turn to messages as you go, exactly as in any other multi-turn conversation — the only new content-block types are toolUse (from the model) and toolResult (from you).

This is the same primitive that Amazon Bedrock Agents build on top of: an Agent automates this plan-call-observe loop, including the orchestration and prompt construction, when you would rather configure tools than write the loop yourself. Use raw Converse tool use when you want full control of the loop in your own code; use Agents when you want the loop managed for you.

declaring a tool in toolConfig (the model then returns stopReason "tool_use")

tool_config = {
  "tools": [{
    "toolSpec": {
      "name": "get_order_status",
      "description": "Look up the current status of a customer order by its ID.",
      "inputSchema": {"json": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}}
    }
  }]
}
resp = brt.converse(modelId=MODEL, messages=messages, toolConfig=tool_config)
if resp["stopReason"] == "tool_use":
  # find the toolUse block, run get_order_status(order_id),
  # then append a user message with a toolResult block (same toolUseId) and call converse again.

tokens as they generate

VIIStreaming responses with ConverseStream

For anything a human watches — a chat UI, a coding assistant, a long answer — you want tokens to appear as they are generated rather than after the whole completion is done. That is what the streaming operations are for, and Converse makes the event shape uniform.

Call ConverseStream (or InvokeModelWithResponseStream for the lower-level API) and instead of a single response you get an event stream you iterate over. The stream is a sequence of typed events: messageStart (the assistant turn begins, with its role), a series of contentBlockDelta events (each carrying a small chunk of text — the part you append to the UI as it arrives), contentBlockStop and messageStop (the turn is complete, with the stopReason), and a final metadata event that carries the usage token counts and latency metrics. Tool use streams too: the model's toolUse arguments arrive incrementally across deltas, which you accumulate before running the tool.

Streaming is purely a transport choice — it does not change pricing (you pay for the same input and output tokens) or the logical request shape (the request is the same as a non-streaming Converse call). It only changes when you receive the output. The trade-off is in your code: streaming improves perceived latency dramatically but means you handle a stream of events and assemble the final message yourself, rather than reading one field off one response.

consuming a ConverseStream (python / boto3)

stream = brt.converse_stream(
  modelId="anthropic.claude-sonnet",
  messages=[{"role": "user", "content": [{"text": "Explain prompt caching in two sentences."}]}],
  inferenceConfig={"maxTokens": 300},
)
for event in stream["stream"]:
  if "contentBlockDelta" in event:
    print(event["contentBlockDelta"]["delta"]["text"], end="")  # stream chunks to the UI
  elif "metadata" in event:
    print(event["metadata"]["usage"])  # final token counts arrive last

production hardening

VIIIError handling, throttling, and quotas

The difference between a demo and a production integration is almost entirely error handling — and on Bedrock the error you will meet most is throttling. Know the exceptions, retry the right ones correctly, and design for the rate limits up front.

Bedrock rate-limits the runtime per account, per model, and per Region along two axes: requests per minute and tokens per minute. Exceed either and the API returns ThrottlingException with HTTP status 429. This is the normal signal to slow down, not a bug. The correct response is exponential backoff with jitter — wait, then retry with progressively longer, randomized delays so a fleet of clients does not retry in lockstep. The AWS SDKs implement adaptive/standard retry modes that handle a baseline of this automatically; for high-throughput services you typically add your own backoff and a concurrency limiter on top, and you queue or shed load rather than hammering the API.

Beyond retrying, there are three structural ways to get more headroom. Request a quota increase for the specific model in Service Quotas when your steady-state demand genuinely exceeds the defaults. Use cross-region inference profiles so a request can be served from one of several Regions in a geography, spreading load and improving availability without you managing the routing (see cross-region inference). Or buy Provisioned Throughput to reserve dedicated capacity with guaranteed throughput for high, steady volume. Which one fits depends on whether your problem is occasional spikes (backoff + cross-region), a higher steady ceiling (quota increase), or guaranteed latency at scale (provisioned throughput).

The other exceptions are ordinary AWS API errors and should be handled distinctly because most are not retryable. AccessDeniedException means the IAM principal lacks the action or the model is not enabled in Model access — fix the policy or enable access, do not retry. ValidationException means a malformed request (bad parameter, oversized payload, wrong content block) — fix the request, do not retry. ResourceNotFoundException usually means a wrong or unavailable modelId for that Region. ModelTimeoutException and ServiceUnavailableException / InternalServerException (5xx) are transient and are safe to retry with backoff. A clean integration branches on the exception type: retry throttling and 5xx with backoff; surface validation and access errors to the developer immediately.

common bedrock runtime exceptions · retry guidance

Exception	HTTP	Meaning	Retry?	Fix
ThrottlingException	429	Request/token rate exceeded	Yes — backoff + jitter	Backoff; quota increase; cross-region; provisioned throughput
ModelTimeoutException	408/5xx	Model took too long	Yes — backoff	Retry; shorten prompt/maxTokens
ServiceUnavailable / Internal	5xx	Transient server-side issue	Yes — backoff	Retry with backoff
ValidationException	400	Malformed request / bad params	No	Fix the request body
AccessDeniedException	403	Missing IAM perms or model not enabled	No	Add IAM action; enable Model access
ResourceNotFoundException	404	Unknown/unavailable modelId in Region	No	Correct modelId or switch Region

Rule of thumb: retry 429 and 5xx with exponential backoff + jitter; never blind-retry 400/403/404 — they will fail identically until you fix the cause. Always log usage tokens and the request ID from the response metadata so throttling and errors are traceable.

the two APIs, side by side

InvokeModel vs Converse vs their streaming variants

A single scannable map of the runtime operations so the right call is obvious. "Unified schema" means the same request/response shape across all chat models; "provider-specific" means the body matches each model vendor's native format.

Operation	Schema	Streaming?	Tool use	Best for	Avoid for
Converse	Unified	No	Yes (built-in)	Chat, agents, portable text apps	Embeddings, images, video
ConverseStream	Unified	Yes	Yes (streamed)	Chat UIs, coding assistants, long answers	Batch/offline jobs
InvokeModel	Provider-specific	No	Provider-defined	Embeddings, images, provider-only params	New portable chat apps
InvokeModelWithResponseStream	Provider-specific	Yes	Provider-defined	Streaming a provider-specific text body	When Converse would do

Default to the Converse family for text/chat — one schema, model portability, first-class tool use and streaming. Drop to the InvokeModel family only for modalities Converse does not cover (embeddings, image, video) or a genuinely provider-specific parameter. For latency-tolerant bulk work, neither real-time path is ideal — use Batch inference (see /aws-ai/amazon-bedrock-batch-inference) for ~50% lower cost.

building on the bedrock api?

Get AWS credits to fund your Bedrock inference — and a vetted partner to build the production setup. You pay $0.

Get matched in 24h →

a recent match

A Bedrock API integration, funded by AWS credits — anonymized

inquiry · series-a developer-tools startup, US

Series-A dev-tools SaaS, 22 engineers, embedding an AI assistant (chat + tool use) into their product; already on AWS at ~$6K/month

Situation: The team had a working Converse prototype but kept hitting ThrottlingException (429) under real traffic, had no consistent retry/backoff strategy, and were calling a frontier model for every request — including trivial classifications — so the projected inference bill at launch scale was alarming. They also wanted tool use (function calling) wired to their internal APIs without hand-rolling and maintaining the whole plan-call-observe loop, and needed streaming in the chat UI.

What CloudRoute did: Routed within 19 hours to a US-East AWS partner with a Bedrock + production-GenAI track record. The partner hardened the integration: exponential backoff with jitter plus a concurrency limiter around every runtime call, a Service Quotas increase on the workhorse model, and cross-region inference profiles for spike headroom. They added model routing (a small fast model for classification, the frontier model only for hard reasoning), turned on prompt caching for the large stable system prompt, moved tool use onto a clean Converse toolConfig loop, and switched the chat UI to ConverseStream. In parallel they filed a Bedrock/GenAI proof-of-concept credit application and an Activate Portfolio application.

Outcome: GenAI POC credits ($25K) approved in under 2 weeks and Portfolio ($100K) shortly after — the first several months of Bedrock inference were credit-funded. 429s dropped to negligible under load, and per-request cost fell sharply thanks to routing plus prompt caching. The assistant shipped to general availability in 4 weeks with streaming and tool use in production. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.

time-to-match: < 24h · credits secured: $125K · 429 rate: negligible post-fix · cost to customer: $0

faq

Common questions

What is the Amazon Bedrock API?

It is the set of AWS API operations for using foundation models on Bedrock from code. The runtime (the bedrock-runtime service) has four operations for inference — InvokeModel and InvokeModelWithResponseStream (lower-level, provider-specific request bodies) and Converse and ConverseStream (one unified schema across all chat models). A separate control plane (the bedrock service) manages models, Guardrails, fine-tuning, evaluation, Knowledge Bases, and Agents. There is no API key — Bedrock uses standard AWS IAM credentials.

What is the difference between the Converse API and InvokeModel?

InvokeModel sends a request body in each model provider's own JSON format and returns a provider-specific response, giving maximum control at the cost of provider-specific code and harder model swaps. The Converse API uses one consistent request/response schema across every chat model — with built-in multi-turn messages, system prompts, tool use, and (via ConverseStream) streaming — so switching models is usually just changing the modelId. Use Converse for chat and agents; use InvokeModel for non-conversational modalities like embeddings and images or a provider-specific parameter Converse does not expose.

How do I make my first Bedrock API call?

Three steps. (1) In the Bedrock console enable the model under Model access (per-Region, per-model). (2) Attach an IAM policy allowing the runtime action (bedrock:Converse and/or bedrock:InvokeModel), ideally scoped to the model ARN. (3) Call it: in Python use boto3 client("bedrock-runtime").converse(modelId=..., messages=[...], inferenceConfig={...}); in Node use @aws-sdk/client-bedrock-runtime with a BedrockRuntimeClient and ConverseCommand. The assistant text is at output.message.content[0].text and token counts at usage.

Which SDKs and languages does the Bedrock API support?

All standard AWS SDKs, since Bedrock is a normal AWS service. The most common are boto3 (Python, the bedrock-runtime client) and the AWS SDK for JavaScript v3 (@aws-sdk/client-bedrock-runtime). SDKs also exist for Java, Go, .NET, Rust, and more, plus the AWS CLI. There is also a higher-level cross-provider abstraction in some AWS open-source tooling, but the SDKs calling Converse/InvokeModel directly are the canonical path.

How does tool use (function calling) work in the Bedrock API?

You declare tools in the Converse request's toolConfig — each with a name, a description, and a JSON-Schema inputSchema. When the model decides to use one, the response comes back with stopReason "tool_use" and a toolUse content block ({ toolUseId, name, input }). Your code runs the actual function and sends a follow-up user message containing a toolResult block (echoing the same toolUseId) with the output; the model then produces its final answer. You own the loop; Amazon Bedrock Agents can manage it for you if you prefer not to write it.

How do I handle throttling (429 / ThrottlingException) on Bedrock?

ThrottlingException (HTTP 429) means you exceeded the per-account/per-model requests-per-minute or tokens-per-minute limit. Handle it with exponential backoff and jitter (the AWS SDK retry modes give you a baseline; add your own backoff and a concurrency limiter for high throughput). For more headroom, request a Service Quotas increase for the model, use cross-region inference profiles to spread load, or buy Provisioned Throughput for guaranteed capacity. Retry 429 and 5xx errors with backoff; do not retry 400/403/404 — fix the request, IAM, or modelId instead.

How do I stream responses from the Bedrock API?

Use ConverseStream (or InvokeModelWithResponseStream for the lower-level API). Instead of one response you iterate an event stream: messageStart, a series of contentBlockDelta events carrying text chunks you append to the UI, contentBlockStop/messageStop with the stopReason, and a final metadata event with usage token counts. Streaming changes only when output arrives — it does not change pricing or the request shape. It is the right choice for any human-facing chat or long-generation UI.

What inference parameters can I set, and what are good defaults?

In Converse's inferenceConfig: maxTokens (caps response length and output cost), temperature (randomness — low for deterministic extraction/tool use, higher for creative work), topP (nucleus sampling, usually tuned instead of temperature), and stopSequences (strings that halt generation). The system prompt is set separately in the top-level system field. A solid production default is a low temperature (0.0–0.2) with an explicit maxTokens ceiling; provider-only knobs like topK go through additionalModelRequestFields.

Ship on the Bedrock API — and let AWS credits pay for the inference.

CloudRoute routes you to a vetted AWS partner who files your Bedrock/GenAI credit application (Activate Portfolio up to $100K, GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and, if you need hands, builds the production integration — Converse, tool use, streaming, retries, cost controls — with you. AWS funds the credits and the engagement. You pay $0.

Get matched in 24h →→ see the data & AI persona detail

matched within< 24h

GenAI credit ceilingup to $1M

cost to you$0