A neutral, build-grade walkthrough of the four reference architectures (chatbot, RAG, agents, batch), the Bedrock-vs-SageMaker-vs-self-host decision, how to pick a model without guessing, the supporting stack (vector store, guardrails, evals, observability), what it actually costs, and the crawl-walk-run path from a weekend POC to a production system you can put your name on.
AWS does not sell you "generative AI" as a single product. It sells three layers, and almost every architecture decision is really a decision about which layer you build on. Get the mental model right and the rest of this guide reads as a series of obvious choices.
The bottom layer is infrastructure: the silicon and the clusters. NVIDIA GPUs (P5/P5e instances built on H100/H200), plus AWS's own accelerators — Trainium for training and Inferentia for inference — exposed through EC2, EKS, and SageMaker HyperPod. This is where you live if you are training or hosting a model yourself. It is the most powerful and the most operationally expensive layer.
The middle layer is the managed model platform: Amazon SageMaker AI (build, train, tune, and host your own models with the undifferentiated heavy lifting removed) and Amazon Bedrock (call a catalog of foundation models through one API, with nothing to provision). This is where most teams should build in 2026 because it converts "operate a model" into "call a model."
The top layer is applications and tools: Amazon Q (a managed assistant for business and for developers), Bedrock Agents, Bedrock Knowledge Bases (managed RAG), Bedrock Guardrails, and the SDKs you wire into your own app. This is where a product team spends most of its time once the platform choice is made.
Two design principles cut across all three layers, and they are worth internalizing before you write any code. First, start serverless and managed; earn your way down the stack. Every step toward raw infrastructure buys control and costs you undifferentiated operational work — capacity planning, drivers, autoscaling, patching — so take those steps only when a concrete requirement forces them. Second, the foundation model is a swappable component, not the architecture. Models are released and re-priced constantly; treat the model as configuration behind a thin interface so you can re-evaluate and switch without a rewrite.
Nearly every generative-AI product on AWS is one of four patterns, or a composition of them. Naming the pattern up front is the single highest-leverage decision you make, because it determines which services you need and which you can ignore. Build the simplest pattern that solves the problem; compose upward only when the use case demands it.
A useful rule of thumb: a chatbot becomes RAG the moment it needs to answer from your private data; RAG becomes an agent the moment it needs to act rather than only answer; and any of them spills into batch the moment the work is high-volume and not interactive. Most teams walk that exact ladder over their first year.
What it is: a request hits your backend, you assemble a prompt (system instructions + user message + maybe a few-shot example), you call a model on Bedrock, you stream tokens back. No private-data retrieval, no tools. Internal copilots, drafting assistants, classification, summarization, extraction, and "explain this" features all start here.
Shape on AWS: a frontend, an API tier (API Gateway + AWS Lambda, or a small container on ECS Fargate / App Runner), and a streaming call to the Bedrock Runtime Converse API. Conversation state, if any, lives in DynamoDB or ElastiCache, not in the model. Wrap the model call in a Guardrail and emit token/latency metrics.
Why start here: it is the cheapest pattern to stand up and the fastest to prove or kill an idea. If a stateless endpoint plus good prompting solves the problem, do not add retrieval or agents for their own sake.
What it is: ground the model in your data so it answers from your documents, policies, tickets, or catalog instead of its training data. The flow is two phases. Ingestion (offline): chunk source documents, embed each chunk into a vector with an embeddings model (Titan Text Embeddings or Cohere Embed), and store the vectors. Query (online): embed the user question, retrieve the most similar chunks, stuff them into the prompt as context, and ask the model to answer using only that context.
Shape on AWS: the fast path is Bedrock Knowledge Bases, which manages chunking, embedding, retrieval, and the optional generation step for you. Point it at an S3 bucket and a vector store and it handles the pipeline. The build-it-yourself path uses a vector store directly — OpenSearch Serverless (vector engine), Aurora PostgreSQL with pgvector, or a managed third party — with your own ingestion Lambda and retrieval logic when you need fine control over chunking, hybrid search, or re-ranking.
Why it matters: RAG is the default answer to "the model does not know about our stuff" and a strong answer to hallucination, because the model is told to answer from retrieved, attributable sources. It is the most common production GenAI pattern in the enterprise. Most quality problems in RAG are retrieval problems (bad chunking, weak embeddings, no re-ranking), not model problems — instrument retrieval before you blame the model.
What it is: the model is given a set of tools (functions, APIs, a knowledge base, code execution) and the autonomy to decide which to call, in what order, to accomplish a goal. Instead of only producing text, it acts — looks up an order, files a ticket, queries a database, calls your internal service — and reasons over the results across multiple turns.
Shape on AWS: Amazon Bedrock Agents orchestrates the reason-act loop, calls action groups (backed by Lambda or an OpenAPI schema), pulls from Knowledge Bases for grounding, and manages session state. For multi-agent systems and more code-first control, teams use the open-source Strands Agents SDK or frameworks like LangGraph and CrewAI running on AWS compute, often calling Bedrock underneath.
Honest tradeoff: agents are powerful and the highest-variance pattern. Every added tool and autonomous step multiplies the ways things go wrong, multiplies token cost (the model re-reads the growing transcript each turn), and complicates evaluation. Start with the fewest tools that work, constrain the action space tightly, and add autonomy only where it earns its keep. Many "agent" requirements are satisfied by RAG plus one or two well-defined tool calls.
What it is: generation over a large set of inputs where no human is waiting — classify a million support tickets, summarize a document corpus, extract structured fields from a back-catalog, generate embeddings for an entire knowledge base, enrich a data warehouse. Throughput and cost matter; per-request latency does not.
Shape on AWS: Bedrock Batch Inference takes a JSONL manifest in S3, processes it asynchronously, and writes results back to S3 — at a roughly 50% discount versus on-demand because you are trading latency for price. Orchestrate with Step Functions or an event-driven pipeline (S3 event → queue → workers). For self-hosted models, SageMaker Batch Transform or async endpoints play the same role.
Why it matters: moving non-interactive work to batch is one of the largest and most overlooked cost levers in a GenAI system. Teams routinely run interactive traffic on-demand and push everything offline to batch, cutting that slice of the bill in half with no quality change.
These are building blocks, not rival camps. A mature product is often RAG inside an agent (the agent uses retrieval as one of its tools) with a batch pipeline pre-computing embeddings and enrichment, and a plain chat endpoint for the simple features. Pick the smallest pattern per feature; let the system grow into a composition.
This is the fork that determines your cost structure, your operational burden, and how fast you ship. The good news: for most teams in 2026 the answer is clear, and the cases that justify the harder paths are specific and recognizable.
Choose Amazon Bedrock (the default for ~90% of teams) when you want to consume a foundation model through an API with zero infrastructure to manage. No GPUs, no endpoints to keep warm, no capacity planning. You get a catalog of 100+ models from Anthropic, Meta, Mistral, Amazon (Nova, Titan), Cohere, AI21, and others behind a single API, with Guardrails, Knowledge Bases, and Agents built in. Pricing is per-token on-demand (pay only for what you call), with Provisioned Throughput available when you need reserved capacity and predictable latency. If you are building a chatbot, RAG system, or agent on top of a strong general model, this is almost certainly your layer.
Choose Amazon SageMaker AI when you need to own the model itself. Deep fine-tuning or continued pre-training on your data; hosting an open-weight model that Bedrock does not carry (or a custom architecture); full control of the training loop, the serving container, and autoscaling behavior; or an ML platform that spans classical ML and generative AI for a data-science team. SageMaker removes the heavy lifting of training and hosting but leaves you owning endpoints, instance types, and scaling — more control, more responsibility. (Note: Bedrock also offers managed fine-tuning and Custom Model Import, which covers many customization needs without leaving the serverless world — try that before committing to SageMaker.)
Choose self-hosting on EC2 / EKS (with Trainium, Inferentia, or NVIDIA GPUs, often via SageMaker HyperPod for large clusters) only when a concrete requirement forces it: extreme scale where per-token economics beat managed pricing; ultra-low-latency or data-residency needs a managed endpoint cannot meet; a specific open-weight model with licensing or modification requirements; or you are training a foundation model from scratch. This path has the lowest unit cost at very high, steady utilization and the highest operational cost in every other respect. Most teams never need it; the ones that do, know exactly why.
The pattern across all three: control and unit-cost-at-scale increase as you move down the stack; speed-to-ship and operational simplicity increase as you move up. Start at the top and move down one level only when you can name the requirement the current level cannot meet. "We might need it later" is not that requirement — Bedrock makes most "later" needs cheap to satisfy when they actually arrive.
With 100+ models on Bedrock, "which model?" feels paralyzing. It is not, if you replace vibes with a short, structured process. The headline: there is no single best model — there is a best model for a given task at a given quality bar and a given cost ceiling, and that answer changes per workload.
Sort your workloads onto a quality/cost ladder and match each to a model tier rather than picking one model for everything. A frontier reasoning model (e.g., the largest Claude or comparable) for the hard tasks — complex reasoning, agentic tool use, nuanced writing, code. A balanced mid-tier model for the bulk of production traffic where it is plenty capable. A small, fast, cheap model (e.g., Claude Haiku, Amazon Nova Micro/Lite, Mistral small, Llama 8B-class) for high-volume, well-scoped tasks like classification, routing, extraction, and short summaries. Routing the easy work to a small model is one of the biggest cost wins available and usually costs nothing in quality.
Beyond the quality/cost axis, screen on the practical constraints that disqualify models fast: context window (does your RAG or document workload fit?), modality (text-only vs vision/image/multimodal), latency and streaming (interactive UX vs offline batch), tool-use and structured-output support (mandatory for agents and for JSON-mode pipelines), region and data-residency availability (not every model is in every region — check before you design around one), and licensing for open-weight models you might self-host.
1 — Build a representative eval set first. Collect 30–100 real inputs from your use case with known-good outputs or clear acceptance criteria. This is the most important artifact in the whole project and the one teams most often skip. Without it, model selection is taste; with it, it is measurement.
2 — Shortlist 2–4 candidates across tiers based on the constraints above (context, modality, latency, region, tools). Do not test all 100 — test one frontier, one or two mid, one small.
3 — Run the eval set through each. Score quality (LLM-as-judge plus human spot-checks), and record latency and per-request token cost for each. Use Bedrock Model Evaluation to run this systematically rather than eyeballing a handful of prompts.
4 — Pick the cheapest model that clears your quality bar for each workload tier — not the most capable model overall. Re-run this exact process every quarter as new models and prices land; the thin model interface from Section I is what makes that re-evaluation cheap.
A model call in a notebook is a demo. Production is the scaffolding around the call: where retrieval lives, how you keep it safe, how you know it works, and how you see what it is doing. Underinvesting here is the most common reason promising POCs never ship.
Five components carry most production GenAI systems. None of them are the model. All of them are where reliability, trust, and cost actually come from.
GenAI cost is dominated by tokens: input tokens plus output tokens, priced per model. Two systems with identical features can differ 5–10× in cost based purely on engineering discipline. The good news is that the biggest levers are simple, and most of them cost nothing in quality.
Think about cost in two buckets. First, reduce tokens — the number one driver. Trim bloated system prompts; retrieve fewer, better chunks instead of stuffing the whole document; cap conversation history with summarization instead of replaying the full transcript every turn; and constrain output length. Most teams are paying for tokens they do not need, especially on the input side of RAG and multi-turn chat. Second, price tokens lower for the tokens you do send.
Route to the right-sized model. Sending classification, routing, and extraction to a small model instead of a frontier model often cuts the cost of that traffic by 10–20× with no quality loss. Bedrock Intelligent Prompt Routing can do this automatically per request.
Use prompt caching. When many requests share a large, stable prefix — a long system prompt, a tool schema, a fixed knowledge block — Bedrock prompt caching charges that prefix at a steep discount on cache hits. For RAG and agents with big static contexts, this is a major, low-effort saving.
Move offline work to Batch. Anything non-interactive belongs in Bedrock Batch Inference at roughly half the on-demand price. Reclassifying interactive-priced work as batch is a frequent quick win.
Match the pricing model to the traffic shape. Spiky or low volume → on-demand (pay per token, $0 when idle). High, steady, latency-sensitive volume → Provisioned Throughput for reserved capacity and predictable cost. Self-hosting only beats both at very high, sustained utilization where you can keep accelerators busy.
Then optimize tokens at the prompt level — shorter prompts, tighter retrieval, capped history and output — and measure each change against the eval set so a cost cut never becomes a silent quality cut.
Security objections are the most common reason GenAI projects stall in regulated organizations — and most of those objections have clean, documented answers on AWS. Knowing them up front turns a six-week governance fight into a one-meeting sign-off.
The foundational fact, and the one that unblocks the most reviews: your prompts and completions on Amazon Bedrock are not used to train the base foundation models, and your data is not shared with model providers. Inference runs within your AWS account boundary; data is encrypted in transit and at rest; and you can keep all traffic on the AWS network with VPC endpoints (AWS PrivateLink) so nothing traverses the public internet. For most enterprise privacy reviews, those three facts are the crux.
The most reliable way to ship GenAI is not a big-bang build — it is a staged path that proves value cheaply, then hardens, then scales. Each stage has a clear goal and a clear exit criterion, so you fail fast on bad ideas and invest only behind validated ones.
The single biggest predictor of whether a GenAI project ships is whether the team builds the evaluation set early and treats the supporting stack — not the model — as the real work. Teams that prove value in a narrow POC and then harden methodically ship; teams that try to build the perfect general system on day one tend to stall in demo purgatory.
Pick one narrow, high-value use case. Build the simplest pattern that could work — usually a stateless endpoint or basic Knowledge-Bases RAG on Bedrock. Hand-build a small eval set from real inputs. Goal: a working prototype real users can react to. Exit criterion: the prototype clears a "is this useful?" bar with actual users. If it does not, kill it cheap and move on — the whole point of crawl is to make that decision fast.
Now build the production scaffolding: guardrails, the automated eval harness, observability and cost tracking, proper retrieval (chunking, hybrid search, re-ranking) if RAG, and IAM/VPC/encryption. Run real traffic at limited scale, watch the dashboards, and tune retrieval and prompts against the eval set. Exit criterion: quality, latency, cost, and safety all sit inside targets you can defend, and you can change the system without regressing it.
Optimize cost (model routing, prompt caching, batch for offline work), choose the right pricing model for your traffic shape (on-demand vs Provisioned Throughput), and add capabilities (agents, more tools, more data sources) only when the eval harness confirms they help. Re-evaluate models quarterly. Steady state: a system you can evolve confidently because every change is measured against a known bar — that is what "production" actually means for GenAI.
The same fork as Section III, as a scannable table. Read top to bottom: simplicity and speed-to-ship are highest on the left; control and unit-cost-at-scale are highest on the right. Most 2026 teams should be in the left column and move right only on a named requirement.
| Dimension | Amazon Bedrock | Amazon SageMaker AI | Self-host (EC2/EKS) |
|---|---|---|---|
| What you manage | Nothing — call an API | Endpoints, instances, scaling | Everything: cluster, drivers, serving, scaling |
| Model access | 100+ managed FMs, one API | Your own / open-weight / fine-tuned | Any model you can run |
| Customization | Prompting, RAG, managed fine-tune, Custom Model Import | Deep fine-tune + continued pre-training | Full — train from scratch if you want |
| Time to first call | Minutes | Hours–days | Days–weeks |
| Idle cost | $0 (on-demand, per token) | Endpoint runs = it costs | Instances run = they cost |
| Best unit cost at | Spiky / low / medium volume | Custom models at moderate scale | Very high, steady utilization |
| Ops burden | Minimal | Moderate | High |
| Right for ~ | 90% of teams (default) | Custom-model / data-science teams | Scale, residency, or licensing edge cases |
Situation: Strong product team but no in-house GenAI experience. A weekend prototype on Bedrock impressed leadership, but it hallucinated on edge cases, had no guardrails, no eval harness, and no cost visibility — and a privacy review was blocking any rollout to customer data. They needed someone who had shipped production RAG to harden it, and they wanted to avoid burning runway on an open-ended consulting engagement.
What CloudRoute did: Routed within a day to a vetted AWS Advanced partner with production RAG and Bedrock Guardrails experience in the EU region. The partner moved the prototype onto Bedrock Knowledge Bases with OpenSearch Serverless, added hybrid search + re-ranking to fix retrieval quality, wired Bedrock Guardrails for PII redaction and contextual grounding, stood up a Model-Evaluation harness on a 60-example eval set, and added token/latency/cost observability in CloudWatch. VPC endpoints and KMS closed the privacy review. The engagement was filed as an AWS-funded GenAI POC, so the build work was credit-covered.
Outcome: Hallucination rate on the eval set dropped sharply once retrieval was fixed; privacy review signed off in one meeting on the VPC/KMS/no-training-on-your-data posture; the assistant went to production in 7 weeks. With model routing (small model for classification, mid-tier for answers) and prompt caching on the static system prompt, per-conversation cost landed well under target. CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.
POC → production: 7 weeks · founder/eng time: ~12 hours of oversight · privacy review: 1 meeting · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who picks the architecture, builds the supporting stack, and ships it — often as an AWS-funded GenAI POC, so you pay $0. No procurement. No open-ended consulting bill.