One managed API, every major foundation model — Anthropic Claude, Meta Llama, Mistral, Amazon Nova, Cohere, Stability AI, AI21, DeepSeek — with no servers to run and enterprise privacy by default. This is the full reference: what Bedrock is, the entire model catalog, how access and the Converse API work, the real pricing model, the feature suite (Agents, Knowledge Bases, Guardrails, fine-tuning, Flows, evaluation), security and data residency, and exactly when Bedrock beats SageMaker or calling a model vendor directly.
Amazon Bedrock is a fully-managed service that lets you access a broad selection of high-performing foundation models — from Anthropic, Meta, Mistral, Amazon, Cohere, Stability AI, AI21, and DeepSeek — through a single API, without provisioning or managing any infrastructure. You send a prompt, you get a completion, you pay for the tokens. That is the whole surface area.
Before Bedrock, putting a large language model into production on your own meant one of two uncomfortable paths. Either you rented GPU instances, downloaded open weights, and took on the full operational burden of serving — autoscaling, sharding, quantization, the perpetual scarcity of accelerators — or you signed a contract with a single model vendor and routed your most sensitive data out to their API. Bedrock collapses both problems. There are no servers, GPUs, or clusters for you to run: AWS operates the inference fleet behind the API. And because Bedrock runs inside AWS, your data never leaves your control — prompts and completions are not used to train the underlying foundation models, are not shared with the model providers, and stay within your AWS account and the AWS Region you call.
The second reason Bedrock exists is choice without lock-in. No single foundation model is best at everything. Claude is exceptional at long-context reasoning, careful instruction-following, and agentic tool use; Llama gives you strong open-weight performance you can fine-tune freely; Mistral is fast and cost-efficient for high-throughput tasks; Amazon Nova is built for very low cost and latency; Stability and Amazon Nova Canvas generate images; Cohere and Titan produce the embeddings that power search and RAG. Bedrock lets you mix and match all of them behind one authentication model, one billing relationship, and — through the Converse API — one request schema. Switching from one model to another is often a one-line change to a model ID rather than a rewrite.
The third reason is that Bedrock is not just a model gateway — it is an application platform. On top of raw inference, AWS layers managed building blocks that most teams would otherwise have to assemble themselves: Agents that plan and call your APIs, Knowledge Bases that implement retrieval-augmented generation end to end, Guardrails that filter harmful content and redact PII, fine-tuning and distillation to specialize models, Flows to orchestrate multi-step generative workflows visually, Prompt Management to version prompts, and model evaluation to compare candidates on your own data. Each is covered in section V.
Put simply: Bedrock is the managed, secure, multi-model foundation for building generative-AI applications on AWS. If Amazon EC2 abstracted away the data center, Bedrock abstracts away the model-serving stack — you reason about prompts, tokens, and outcomes, not about GPUs.
Amazon Bedrock = a single, fully-managed API for many foundation models (Claude, Llama, Mistral, Nova, Titan, Cohere, Stability, AI21, DeepSeek), with serverless inference, AWS-native security, and a built-in suite for RAG, agents, guardrails, and customization — you never touch a GPU, and your data is never used to train the base models.
Bedrock's defining feature is breadth. The catalog spans text and chat models, image and video generators, and embedding models, from eight-plus providers. The exact model versions available evolve continuously and vary by AWS Region — the table below is a representative map of the catalog as of 2026; always confirm the live list in the Bedrock console under Model access.
A few patterns are worth internalizing before reading the table. Anthropic's Claude family is the most widely deployed set of models on Bedrock for serious enterprise reasoning, coding, and agentic work, with a tiered lineup that trades cost against capability. Amazon's own families come in two lines: the newer Nova family (Micro, Lite, Pro, Premier for text; Canvas for images; Reel for video; plus the agentic Nova Act), engineered for very low price and latency, and the older Titan family, still useful for text and especially for embeddings. Meta Llama gives you strong open-weight models you are free to fine-tune. Mistral targets speed and price efficiency. Cohere is known for retrieval and embeddings; Stability AI for image generation; AI21 for its Jamba long-context models; and DeepSeek brings cost-efficient open reasoning models into the managed catalog.
The strategic point for builders: you do not have to pick one provider for your whole application. A common production pattern routes cheap, high-volume calls (classification, extraction, routing) to a small fast model like Nova Lite, Claude Haiku, or Mistral, and escalates only the hard reasoning steps to a frontier model like Claude Sonnet/Opus or Nova Premier — all through the same Converse API, all on one bill.
| Provider | Representative models | Modality | Best-fit use |
|---|---|---|---|
| Anthropic | Claude Opus, Claude Sonnet, Claude Haiku | Text / chat / vision / tools | Frontier reasoning, coding, agents, long-context analysis |
| Meta | Llama (instruct + smaller variants) | Text / chat / vision | Open-weight, freely fine-tunable, self-hostable lineage |
| Mistral | Mistral Large + smaller models | Text / chat | Fast, cost-efficient high-throughput tasks |
| Amazon Nova | Nova Micro / Lite / Pro / Premier; Canvas; Reel; Act | Text / image / video / agentic | Lowest cost + latency; native image & video generation |
| Amazon Titan | Titan Text; Titan Text Embeddings; Titan Multimodal Embeddings | Text / embeddings | Embeddings for RAG & search; economical text |
| Cohere | Command (text); Embed (embeddings); Rerank | Text / embeddings / rerank | Enterprise search, retrieval, reranking |
| Stability AI | Stable Diffusion / Stable Image family | Image generation | Marketing, product, and creative imagery |
| AI21 Labs | Jamba family | Text / chat | Long-context, hybrid-architecture text generation |
| DeepSeek | DeepSeek reasoning models | Text / reasoning | Cost-efficient open reasoning models |
Bedrock is off by default. You do not get models the instant you open the console; you explicitly request access per model, govern it with IAM, choose a Region, and then call models with either the modern Converse API or the lower-level InvokeModel API. Here is the path from a fresh account to a first completion.
The reason access is opt-in is governance. AWS wants every model your organization can call to be a deliberate, auditable decision, and several models carry provider end-user license terms you must accept. So enabling Bedrock is a four-part setup.
In the Bedrock console, open Model access and request the specific models you intend to use (for example, Claude Sonnet and Titan Text Embeddings). For most models access is granted within seconds to a few minutes; some require accepting the provider's EULA first. You enable models per Region — granting access in us-east-1 does not grant it in eu-west-1.
Bedrock is a standard AWS service governed by IAM. You grant principals (users, roles, Lambda functions, ECS tasks) actions such as bedrock:InvokeModel, bedrock:InvokeModelWithResponseStream, and bedrock:Converse, and you can scope policies down to specific model ARNs. Least-privilege here means a service can call exactly the models it needs and nothing else. All API activity is logged to CloudTrail, and you can capture full request/response payloads with Bedrock model-invocation logging to S3 or CloudWatch.
Choose a Region for data-residency, latency, and model-availability reasons — your prompts and completions are processed in the Region you call. Not every model lives in every Region, and frontier models often land first in US Regions. To smooth out capacity and availability, Bedrock offers cross-region inference profiles, which let a single request be served from one of several Regions within a geography (e.g. a US or EU profile) without you managing the routing. Pick the Region first; reach for cross-region inference when you hit availability or throughput limits.
Bedrock exposes two ways to call a model. InvokeModel is the original, lower-level API: you send a raw body whose JSON shape is specific to each provider, and you parse a provider-specific response. It gives maximum control but means provider-specific code. The Converse API is the modern, recommended path: one consistent request and response schema across every chat model, with first-class support for multi-turn conversations, system prompts, tool use (function calling), and streaming. With Converse, switching models is usually just changing the modelId string. Use Converse for chat and agentic applications; reach for InvokeModel only for non-conversational modalities (such as image or embedding endpoints) or when you need a provider-specific parameter Converse does not expose.
A minimal Converse call in Python (using the AWS SDK, boto3) looks like this:
import boto3
brt = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = brt.converse(
modelId="anthropic.claude-sonnet", # swap this string to switch models
messages=[{"role": "user", "content": [{"text": "Summarize our refund policy in 3 bullets."}]}],
inferenceConfig={"maxTokens": 512, "temperature": 0.2},
)
print(resp["output"]["message"]["content"][0]["text"])
The same call structure works for Claude, Llama, Mistral, Nova, Cohere, and more — only modelId changes. Model IDs shown are illustrative; copy the exact current IDs from the Bedrock console.
Bedrock has no platform fee and no minimum. You pay for what you run, and there are four distinct pricing levers — getting them right is the difference between a GenAI bill that is trivial and one that is alarming. Token prices are per-model and per-Region; the figures below are representative as of 2026 and exist to show relative scale — always check the AWS Bedrock pricing page for current rates.
On-Demand is the default. You are billed per 1,000 input tokens and per 1,000 output tokens, at a rate set by the specific model, with no commitment. Output tokens almost always cost several times more than input tokens. Small models (Nova Micro/Lite, Claude Haiku, Mistral small) are an order of magnitude cheaper than frontier models (Claude Opus, Nova Premier), which is why model routing matters so much for cost. For image and video models, you pay per image or per second of generated video rather than per token.
Batch inference processes large sets of prompts asynchronously and is typically priced around 50% lower than on-demand. If your workload is not latency-sensitive — overnight document processing, bulk classification, dataset enrichment, embedding a corpus — batch is the single biggest cost lever available and should be the default for offline jobs.
Provisioned Throughput reserves dedicated model capacity, billed hourly (with deeper discounts on 1-month and 6-month commitments) and measured in model units. It guarantees consistent throughput and latency and is required for serving custom (fine-tuned) models. It only makes economic sense at high, steady volume; for spiky or low volume, on-demand is cheaper.
Prompt caching lets you mark stable, repeated context — a long system prompt, a large document, a tool schema — so Bedrock caches it across calls and you are not billed full input-token price to re-process the same tokens every request. For agents and chat apps that resend a big system prompt or document on every turn, cached input tokens are billed at a steep discount, which can cut total input cost substantially. On top of these four levers, expect additional charges for model customization (fine-tuning/distillation training plus storage of the custom model) and for related services your app uses, such as the vector store behind a Knowledge Base.
| Model tier (example) | Input / 1K tokens | Output / 1K tokens | Batch? | Typical role |
|---|---|---|---|---|
| Nova Micro (ultra-low-cost) | ~$0.000035 | ~$0.00014 | Yes (~50% off) | High-volume classification, routing |
| Nova Lite / Claude Haiku (small) | ~$0.0002–$0.0008 | ~$0.0008–$0.004 | Yes (~50% off) | Cheap chat, extraction, drafts |
| Mistral / mid-tier | ~$0.001–$0.003 | ~$0.003–$0.009 | Yes (~50% off) | Balanced throughput tasks |
| Claude Sonnet / Nova Pro (workhorse) | ~$0.003 | ~$0.015 | Yes (~50% off) | Production reasoning, coding, agents |
| Claude Opus / Nova Premier (frontier) | ~$0.015 | ~$0.075 | Yes (~50% off) | Hardest reasoning, escalation only |
| Titan / Cohere embeddings | ~$0.0001–$0.0002 | n/a (vectors) | Yes | RAG, semantic search, clustering |
The reason teams standardize on Bedrock rather than a bare model API is the managed application layer around the models. Each capability below would otherwise be a project of its own; on Bedrock each is a managed feature you configure rather than build.
These features compose. A typical production assistant uses a Knowledge Base for grounding, wraps the model in Guardrails, exposes actions through an Agent, manages its system prompts via Prompt Management, and was chosen using model evaluation — all on Bedrock, all under one IAM and billing boundary. That composability, not any single model, is the platform's real moat.
For most enterprises the deciding factor in favor of Bedrock is not a benchmark — it is governance. Bedrock is designed so that adopting generative AI does not mean loosening your data controls.
Your data is yours. Prompts, completions, and any data you submit to Bedrock are not used to train the underlying foundation models and are not shared with the model providers. Anthropic, Meta, Mistral, Cohere and the rest do not see your traffic; AWS serves their models on your behalf inside AWS. Your content is encrypted in transit and at rest, and you can use your own AWS KMS keys for encryption.
It stays in your Region. A request to Bedrock is processed in the AWS Region you call, which is how you satisfy data-residency requirements — keep EU data in an EU Region, and so on. When you use cross-region inference, requests are routed only within a defined geography (for example, EU Regions for an EU profile), preserving the residency boundary you chose.
It plugs into your existing controls. Bedrock runs in your VPC reach via AWS PrivateLink (VPC endpoints), so traffic need never traverse the public internet. Access is governed by IAM, every call is recorded in CloudTrail, and you can log full invocation payloads for audit. Guardrails add a content- and PII-safety layer on top.
Compliance coverage is broad. Bedrock is included in AWS's major compliance programs and attestations — commonly SOC 1/2/3, ISO 27001, HIPAA eligibility, PCI DSS, FedRAMP and others depending on Region — which lets regulated industries (financial services, healthcare, public sector) build on it within their existing compliance posture. Always confirm the current scope for your specific Region and program in AWS Artifact, since coverage expands over time.
The combination of no training on your data, in-Region processing, private VPC networking, IAM + CloudTrail governance, and broad compliance attestations is precisely what lets a bank, hospital, or government agency ship generative AI at all. It is the most common reason teams choose Bedrock over sending sensitive data to a public model endpoint.
Bedrock is not the only way to run AI on AWS, and it is not always the right one. The honest decision rule comes down to how much of the ML stack you want to own and how much your data governance matters.
The three options answer three different questions. Bedrock answers "I want to use existing foundation models through a managed, secure API with the least operational overhead." Amazon SageMaker answers "I need to own the ML lifecycle — bring my own model or architecture, run custom training, control the serving infrastructure, or do classical (non-foundation-model) ML." Calling a model vendor's API directly (e.g. straight to a provider's cloud) answers "I want this specific vendor's newest model the moment it ships and I am comfortable with my data leaving AWS."
They are complementary, not mutually exclusive. A common architecture uses Bedrock for the GenAI application layer and SageMaker for a custom recommendation, forecasting, or vision model that no foundation model covers — both inside the same AWS account. The detailed head-to-head lives at Bedrock vs SageMaker; the cross-cloud comparisons at Bedrock vs OpenAI and Bedrock vs Azure OpenAI.
| Dimension | Amazon Bedrock | Amazon SageMaker | Direct vendor API |
|---|---|---|---|
| Best for | Using foundation models in apps, fast | Owning the full ML lifecycle / custom models | One vendor's newest model, day one |
| Infra you manage | None (serverless) | You configure training & endpoints | None (vendor-hosted) |
| Model choice | 8+ providers, one API | Any model you bring or build | That one vendor only |
| Data governance | Stays in your AWS account & Region | Stays in your AWS account & Region | Leaves AWS to the vendor |
| Custom / classical ML | Fine-tune FMs only | Full — any architecture, any ML | No |
| Time to first call | Minutes | Hours to days | Minutes |
| Typical buyer | App & product teams | ML / data-science teams | Teams wedded to one model |
Standing up a first Bedrock application is a day, not a quarter. The harder problem is not getting started — it is what happens to the bill once the application is real and traffic grows.
The fast path: enable access to one workhorse model (say Claude Sonnet) and one embeddings model (Titan or Cohere) in your Region, attach an IAM policy scoped to those model ARNs, make a first Converse call, then — if you need grounding — point a Knowledge Base at a folder of documents in S3 and wrap the whole thing in a Guardrail. You now have a grounded, governed assistant without having provisioned a single GPU.
Then the cost reality arrives. GenAI is cheap per call and expensive in aggregate. A retrieval-augmented chat assistant that resends a large system prompt and retrieved context on every turn, serving thousands of users, can move from a rounding error to five or six figures a month faster than teams expect — especially if every call hits a frontier model. The levers from section IV are how you keep it sane: route cheap calls to small models, run offline work as batch (~50% off), turn on prompt caching for repeated context, and reserve Provisioned Throughput only once volume is steady and high. The companion pages Bedrock pricing, prompt caching, and batch inference go deep on each.
The other lever is funding the bill with someone else's money — specifically AWS's. AWS runs credit programs designed precisely for teams building generative AI on Bedrock: Activate Portfolio (up to $100K) for institutionally-funded startups, dedicated Bedrock / GenAI proof-of-concept funding ($10K–$50K) for a defined GenAI build, and the competitive Generative AI Accelerator (up to $1M) for AI-first companies. These pools are largely partner-filed and invisible on the public Activate page. This is exactly what CloudRoute does: we route you to a vetted AWS partner who files the credit application and, if you need hands, who can build the Bedrock workload with you — and because AWS funds both the credits and the partner engagement, you pay $0. See AWS credits for generative-AI startups, AWS PoC / Bedrock POC funding, and $100K AWS credits.
The most common Bedrock decision is not "which platform" but "which model on the platform." This is a scannable map of the major families by where they sit on the capability/cost curve and what they are for. Cost is relative ($ cheapest → $$$$ frontier); exact rates live on the AWS pricing page.
| Model family | Provider | Relative cost | Strengths | Reach for it when |
|---|---|---|---|---|
| Claude (Haiku → Sonnet → Opus) | Anthropic | $ → $$$$ | Reasoning, coding, agents, long context, instruction-following | You need the most reliable reasoning or agentic tool use |
| Nova (Micro/Lite/Pro/Premier) | Amazon | $ → $$$ | Lowest cost & latency; native image (Canvas) & video (Reel) | You optimize for price/latency or need image/video generation |
| Llama | Meta | $ → $$ | Strong open weights, freely fine-tunable | You want open-weight flexibility and custom fine-tunes |
| Mistral | Mistral AI | $ → $$ | Fast, cost-efficient throughput | High-volume tasks where speed and price dominate |
| Titan / Cohere Embed | Amazon / Cohere | $ | Embeddings + retrieval/rerank | You are building RAG, semantic search, or clustering |
| Stable Diffusion / Stable Image | Stability AI | $$ (per image) | High-quality image generation | You need creative, product, or marketing imagery |
| Jamba / DeepSeek | AI21 / DeepSeek | $ → $$ | Long-context; cost-efficient open reasoning | You want long-context or budget reasoning alternatives |
Situation: The team wanted a grounded, governed support assistant — RAG over their own knowledge base, with PII redaction and EU data residency — but had no ML infrastructure, no GPU budget, and a hard requirement that customer data never leave the EU or get used to train a vendor's model. Calling a US-hosted model API directly was a non-starter with their compliance reviewer, and standing up self-hosted inference was out of scope for an 11-person team.
What CloudRoute did: Routed within 20 hours to an EU-Central AWS partner with a GenAI + data-residency track record. The partner architected the workload entirely on Amazon Bedrock: a Knowledge Base over the docs in S3 (vectors and processing kept in eu-central-1), Claude Sonnet via the Converse API for answers, a Guardrail for PII redaction and denied topics, and prompt caching plus model routing (Nova Lite for classification, Sonnet only for hard answers) to control cost. In parallel the partner filed a Bedrock/GenAI proof-of-concept credit application and an Activate Portfolio application.
Outcome: GenAI POC credits ($25K) approved in under 2 weeks, Portfolio ($100K) shortly after — the first ~6 months of Bedrock inference were fully credit-funded. Grounded assistant in production in 5 weeks, all data resident in the EU, no traffic to any model vendor. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.
time-to-match: < 24h · credits secured: $125K · data residency: EU-only · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who files your Bedrock/GenAI credit application (Activate Portfolio up to $100K, GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and, if you need hands, builds the workload with you. AWS funds the credits and the engagement. You pay $0.