Almost every generative-AI system on AWS is one of seven reference architectures, or a composition of them: a simple chatbot, managed RAG with Bedrock Knowledge Bases, a DIY RAG pipeline, an agentic workflow, batch document processing, a fine-tuned or self-hosted model, and the enterprise multi-account platform that hosts the rest. This page is the catalogue — for each pattern: the exact AWS services it uses, when to reach for it (and when not to), what it costs in shape, and a link to the deep build guide. Use it to pick the right architecture before you write a line of code.
A reference architecture is a named, reusable shape: a set of components wired in a known way to solve a recurring problem. It is not a product you buy or a single service you turn on — it is the blueprint you assemble from AWS services. The value of naming them is that it makes the build decision a selection, not an invention.
Most teams starting a generative-AI project on AWS face the same paralysis: the stack has dozens of relevant services (Amazon Bedrock, SageMaker, Q, OpenSearch, Lambda, Step Functions, ECS, and more) and it is not obvious which to combine. The good news is that the space of useful combinations is small. Across the production systems people actually ship, almost everything is one of seven patterns — or a composition where one pattern calls another.
These seven are ordered roughly by complexity, from a stateless chatbot you can stand up in an afternoon to a multi-account enterprise platform that takes a quarter. Reading them in order is also a decision tree: start at the top, and stop at the first pattern that meets your requirement. The single most common and most expensive early mistake is reaching for a heavier pattern than the problem needs — building a DIY RAG pipeline when managed RAG would have answered the same questions, or fine-tuning a model to memorise facts that RAG retrieves more cheaply and keeps current.
For each pattern below you get four things: the shape (what is wired to what), the AWS services that typically implement it, when to use it and when not to, and a pointer to the deep build guide. Section VIII collects every service into one matrix so you can see at a glance which patterns share which building blocks. Section IX compares all seven side by side on the dimensions that drive the choice — build effort, control, latency, and cost shape.
One framing to keep throughout: Amazon Bedrock is the spine. Six of the seven patterns use Bedrock to serve foundation models through one API, with your data kept private to your account and never used to train the base models. The only pattern that can live without Bedrock is the self-hosted-model architecture (pattern 6), and even that often pairs with Bedrock for the models you do not want to host yourself.
Read the seven patterns top to bottom and stop at the first one that satisfies your requirement. Heavier is not better — every step down the list adds engineering, operational surface, and cost. Compose patterns only when a real requirement (grounding, tool use, scale, isolation) forces it.
The first four patterns cover the overwhelming majority of generative-AI applications: answer a question, answer it from your data, answer it from your data with custom retrieval, or take an action. Each builds on the one before.
These four are the workhorses. If you are building a customer-facing assistant, an internal knowledge tool, or a workflow that uses an LLM to decide and act, you are almost certainly building one of these. They share the same generation layer — a foundation model on Bedrock — and differ in what surrounds it.
Shape: a thin application layer calls a foundation model on Amazon Bedrock through the Converse API, optionally streaming the response, with conversation history held in the request or a fast store. No retrieval, no tools — the model answers from its own training plus whatever you put in the system prompt. Add Bedrock Guardrails for input/output safety and you have a production-shaped assistant.
Services: Amazon Bedrock (Converse API; a model such as Claude or Amazon Nova), API Gateway + AWS Lambda (or AWS App Runner / ECS) for the endpoint, Amazon DynamoDB or ElastiCache for session/history, Bedrock Guardrails for safety, and Amazon CloudWatch for logs and metrics.
When to use it: general-purpose assistants, copilots over public or prompt-supplied knowledge, drafting and rewriting tools, classification and extraction, and any case where the model does not need your private documents to answer. When not to: the moment answers must come from your own corpus (move to pattern 2) or the assistant must take actions in other systems (move to pattern 4).
Deep build guide: see Build a chatbot on AWS for the full walkthrough — model choice, streaming, memory, guardrails, and cost.
Shape: ground the model in your own documents without building a pipeline. You point a Bedrock Knowledge Base at an Amazon S3 bucket (or a connector — web crawler, Confluence, Salesforce, SharePoint), and Bedrock handles ingestion, chunking, embedding, vector storage, retrieval, and optional re-ranking. Your app calls RetrieveAndGenerate and gets a cited answer in one call.
Services: Amazon Bedrock Knowledge Bases, Amazon S3 (source documents), an embedding model (Amazon Titan Text Embeddings v2 or Cohere Embed), a vector store (Amazon OpenSearch Serverless by default; Aurora pgvector, Pinecone, or Redis selectable), a generation model on Bedrock, Bedrock Guardrails, and CloudWatch.
When to use it: internal knowledge assistants, support automation, policy and documentation Q&A — any case where the answer must come from your documents and standard fixed/semantic/hierarchical chunking is good enough. This is the default RAG choice and covers most use cases. When not to: you need custom document-aware chunking, hybrid retrieval with your own score fusion, or row-level multi-tenant isolation the managed path cannot express — then move to pattern 3.
Deep build guide: see How to build RAG on AWS, which covers managed vs DIY, vector stores, embeddings, re-ranking, and evaluation in depth.
Shape: the same logical pipeline as pattern 2 — ingest → chunk → embed → store → retrieve → re-rank → generate — but you own every stage. Your own parser, your own chunker (Lambda, AWS Glue, or Step Functions), direct writes to a vector store you control, your own hybrid (vector + keyword) retrieval and score fusion, your own re-ranking call, and your own prompt assembly. Bedrock still serves the embeddings and generation; you orchestrate everything in between, often with LangChain or LlamaIndex.
Services: Amazon Bedrock (embeddings + generation + Rerank), Amazon S3, AWS Lambda / AWS Glue / AWS Step Functions for the indexing workflow, a vector store you operate (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone, or Amazon MemoryDB/Redis), and CloudWatch.
When to use it: custom or document-aware chunking; hybrid search with your own fusion; strict multi-tenant or row-level access control; reuse of a vector store you already run; or aggressive cost/latency tuning where the managed convenience premium matters. When not to: you do not yet have a concrete requirement the managed path fails — most teams overbuild here and pay for a pipeline they then maintain.
Deep build guide: the DIY path is covered alongside managed in How to build RAG on AWS — including the exact line where DIY starts to pay for itself.
Shape: instead of only answering, the model acts. An agent takes a goal, reasons about steps, calls tools (APIs, databases, code, other models) to gather information and make changes, observes the results, and loops until the task is done. On AWS this is implemented with Amazon Bedrock Agents — you define action groups (tools, typically backed by Lambda or an OpenAPI schema), optionally attach a Knowledge Base for grounding, and Bedrock runs the reason-act loop. For complex multi-step or multi-agent orchestration, AWS Step Functions coordinates the flow.
Services: Amazon Bedrock Agents, AWS Lambda (action-group implementations), Amazon Bedrock Knowledge Bases (optional grounding), AWS Step Functions (multi-step / multi-agent orchestration), Amazon API Gateway (tool endpoints), Bedrock Guardrails, and CloudWatch for tracing the agent's steps.
When to use it: tasks that require taking actions, not just answering — booking, updating records, querying live systems, multi-step research, running code, or coordinating several specialised sub-agents. When not to: a single retrieval-and-answer is enough (use pattern 2) — agents add latency, cost, and a much larger surface to test and secure, so do not reach for one until the task genuinely needs tools.
Deep build guide: see Build an AI agent on AWS for action groups, orchestration, tool design, and guardrails for agentic systems.
The last three patterns cover the heavier end: processing documents at volume offline, owning a model's weights or hosting, and the platform layer that lets a whole organisation build the first six safely. These are where SageMaker, batch inference, and multi-account governance enter.
Patterns 5 through 7 are not "advanced" in the sense of better — they solve different problems. Pattern 5 is about throughput and cost on offline work; pattern 6 is about control over the model itself; pattern 7 is about letting many teams build patterns 1–6 without each reinventing security and governance.
Shape: run a foundation model over a large set of documents offline, where latency does not matter and throughput and cost do. Documents land in S3, a workflow fans them out, each is parsed and sent to a model for extraction, classification, summarisation, or enrichment, and results are written back to S3 or a database. Because the work is asynchronous, you use Bedrock batch inference, which processes large jobs at roughly half the on-demand token price.
Services: Amazon S3 (input + output), Amazon Textract (parse scanned PDFs, forms, tables), Amazon Bedrock batch inference (the model calls), AWS Step Functions or Amazon SQS for the fan-out workflow, AWS Lambda for per-document handling, and optionally Amazon Bedrock Data Automation for managed document-to-structured-output pipelines.
When to use it: invoice and contract extraction, bulk document classification, transcript or report summarisation, dataset enrichment, and any high-volume back-office task that runs on a schedule rather than per user request. When not to: the result is needed interactively in real time — then it belongs in pattern 1, 2, or 4 with on-demand (not batch) inference.
Deep build guide: the batch and cost mechanics (batch inference, prompt caching, model choice) are covered across the AWS AI cluster — start from the AWS AI & Bedrock hub for the relevant service pages.
Shape: when a prompt and retrieval are not enough, you adapt or own the model. Two sub-shapes: fine-tune a model on Bedrock (custom model) to teach a consistent format, tone, or narrow skill on your labelled data; or self-host an open-weights model on Amazon SageMaker (real-time, serverless, or asynchronous endpoints) when you need full control over the model, its version, its hardware, or its data path. SageMaker JumpStart provides ready open models; AWS Trainium and Inferentia provide cheaper-than-GPU silicon via the Neuron SDK for training and inference.
Services: Amazon Bedrock fine-tuning / custom models (managed adaptation), Amazon SageMaker (training jobs, JumpStart, endpoints) for self-hosting, AWS Trainium / Inferentia instances + the Neuron SDK for cost-efficient compute, Amazon S3 for training data and artifacts, and CloudWatch for monitoring.
When to use it: a consistent output style or domain behaviour a base model will not reliably produce; a narrow specialised task where a smaller fine-tuned model is cheaper and faster than a frontier one; strict requirements to host a specific open model in your own VPC; or research workloads needing full training control. When not to: the goal is to add knowledge — RAG (patterns 2–3) keeps facts current and citable far more cheaply than baking them into weights.
Related guides: Bedrock vs SageMaker, fine-tuning, and the AI-silicon pages live in the AWS AI & Bedrock hub.
Shape: the platform layer that lets many teams build patterns 1–6 safely on shared rails. A multi-account AWS Organizations landing zone separates workloads; a centralised model-access and gateway layer fronts Bedrock so every team uses approved models with consistent guardrails, logging, quotas, and cost attribution; private networking (VPC endpoints / PrivateLink) keeps traffic off the public internet; and identity, audit, and FinOps are built in from the start. This is less a single app than the foundation the rest run on.
Services: AWS Organizations + AWS Control Tower (landing zone), Amazon Bedrock with cross-region inference and provisioned throughput, an internal LLM gateway (often API Gateway + Lambda in front of Bedrock), Bedrock Guardrails as an org standard, AWS IAM Identity Center (SSO), AWS PrivateLink / VPC endpoints for private access, AWS CloudTrail + CloudWatch for audit, and AWS Budgets / Cost Explorer for per-team attribution.
When to use it: a mid-size or larger organisation with several teams shipping GenAI, where central security, compliance, cost control, and reuse matter more than any single app's speed. When not to: you are one team shipping one product — build the relevant pattern (1–6) directly and add the platform later when the second and third teams arrive.
Deep build guide: see Generative AI on AWS for enterprises for the landing zone, gateway, governance, and FinOps detail.
The seven patterns are building blocks, and most production systems use more than one. Knowing the common compositions saves you from treating them as mutually exclusive choices when they are really layers.
The patterns nest naturally because they share the same Bedrock generation layer. An agent (pattern 4) routinely calls a Knowledge Base (pattern 2) as one of its tools, so it can both answer from your documents and act. A managed RAG assistant (pattern 2) may sit behind the chatbot interface of pattern 1. A fine-tuned model (pattern 6) can be the generation model inside a RAG pipeline (patterns 2–3) — RAG for the facts, fine-tuning for the behaviour. And in a mature organisation, every one of these runs on top of the enterprise platform (pattern 7), which supplies the model access, guardrails, networking, and cost controls they all need.
Two compositions are worth calling out because they are so common. First, agentic RAG: an agent whose primary tool is a Knowledge Base, giving you grounded answers and the ability to take follow-up actions — the shape behind most "do something with my data" assistants. Second, RAG over a fine-tuned model: keep volatile knowledge in the retrieval layer where it stays current and citable, and use a lightly fine-tuned model only to lock in a house style or output format. Reaching for fine-tuning to store facts that change is the classic anti-pattern; this composition is the right way to combine the two.
The practical implication is the same as the selection rule: start at the simplest pattern that works, and add a layer only when a concrete requirement appears. A chatbot becomes managed RAG when answers must be grounded; managed RAG becomes an agent when the assistant must act; any of them moves onto the enterprise platform when a second team needs the same rails. Each step is a deliberate addition, not a rewrite — which is exactly why naming the patterns is useful.
Agentic RAG (pattern 4 + pattern 2): an agent whose main tool is a Bedrock Knowledge Base — grounded answers plus the ability to act. RAG over a fine-tuned model (patterns 2/3 + pattern 6): retrieval supplies current, citable facts; light fine-tuning supplies a consistent style. Knowledge in retrieval, behaviour in the model.
Almost every pattern question eventually reduces to one decision: managed foundation models through Amazon Bedrock, or full model control through Amazon SageMaker. They are complementary, not competing, and the line between them is clean.
Amazon Bedrock is the managed foundation-model API. You get many models — Anthropic (Claude), Meta (Llama), Mistral, Amazon (Nova + Titan), Cohere, Stability AI, AI21, DeepSeek — through one interface, with enterprise security and privacy (your prompts and data are not used to train the base models and stay in your account and Region). On top of the raw models, Bedrock layers the capabilities the patterns above lean on: the Converse API, Agents, Knowledge Bases (managed RAG), Guardrails, fine-tuning and custom models, model distillation, Flows, Prompt Management, batch inference, provisioned throughput, prompt caching, and model evaluation. For patterns 1–5 and the managed half of pattern 6, Bedrock is the whole serving layer — you never touch infrastructure.
Amazon SageMaker is the end-to-end ML platform for when you need to build, train, and host models yourself: the Studio IDE, JumpStart's catalogue of open models, training jobs, and endpoints (real-time, serverless, asynchronous). You reach for SageMaker in the self-hosting half of pattern 6 — running a specific open-weights model in your own VPC, training or heavily fine-tuning a model with full control, or serving on custom hardware such as AWS Trainium and Inferentia via the Neuron SDK. SageMaker gives maximum control at the cost of owning the operational surface that Bedrock hides.
The rule of thumb across all seven patterns: default to Bedrock; move to SageMaker only when you need to own the model or the hardware. Bedrock is the spine because it serves the models without infrastructure; SageMaker is the workbench you pull out when managed serving cannot give you the control a specific requirement demands. Many enterprise platforms (pattern 7) standardise on Bedrock for general use and keep SageMaker available for the teams that genuinely need it. The deeper Bedrock-vs-SageMaker comparison lives on its own page in the AWS AI & Bedrock hub.
Each pattern has a characteristic cost shape — what you pay for, when, and which lever controls it. Knowing the shape before you build prevents the most common budget surprises, almost all of which come from a baseline you did not expect or generation tokens you did not meter.
The figures and ranges in this section are representative as of 2026 to convey shape, not quotes — always check the AWS pricing page (and the third-party vendor for Pinecone) for current rates. Two cost ideas recur across every pattern. Generation tokens — input plus output, priced per model — are usually the largest line in any interactive pattern, which is why model choice, fewer/tighter chunks, prompt caching, and tight max-token limits are the highest-leverage levers. And a baseline you pay regardless of traffic — chiefly an always-on vector store (patterns 2–3) or a hosted SageMaker endpoint (pattern 6) — is what surprises teams that budgeted only for per-call costs.
Bedrock's pricing modes map onto the patterns directly. On-Demand (per 1K input/output tokens) suits interactive patterns 1, 2, 4. Batch (~50% cheaper) is the right mode for pattern 5's offline volume. Provisioned Throughput (reserved capacity) fits steady high-volume production and pattern 7's shared platform. Prompt caching cuts the cost of repeated context across all of them. Customisation and storage fees apply when you fine-tune (pattern 6). The table below summarises where the money goes per pattern.
| Pattern | Dominant cost | Baseline (pay regardless of traffic) | Bedrock pricing mode | Top lever to control it |
|---|---|---|---|---|
| 1. Chatbot | Generation tokens | Minimal (serverless compute) | On-Demand (+ caching) | Model choice; prompt caching; max-tokens |
| 2. Managed RAG | Generation tokens | Vector store (e.g. OpenSearch OCUs) | On-Demand | Re-rank to fewer chunks; right-size vector store |
| 3. DIY RAG | Generation + engineering time | Vector store + your compute | On-Demand | Tune every stage; cheaper/embedded dims; caching |
| 4. Agentic workflow | Generation tokens (multi-turn) | Vector store if grounded | On-Demand | Cap loop steps; cheaper model for routing; caching |
| 5. Batch processing | Generation tokens (high volume) | Minimal (offline) | Batch (~50% off) | Batch mode; model choice; tight prompts |
| 6. Fine-tuned / self-hosted | Training + hosted endpoint | Endpoint / custom-model storage | Custom + Provisioned | Right-size endpoint; Trainium/Inferentia; serverless |
| 7. Enterprise platform | Aggregate of all hosted patterns | Provisioned throughput + networking | Provisioned + On-Demand | Per-team budgets; provisioned commit; central caching |
Run your use case through these questions in order. The first one whose answer is "yes" points you at the pattern to build — and tells you which deep guide to open next.
Every pattern is assembled from a shared set of AWS services. This matrix shows, for each core service, which of the seven patterns relies on it — so you can see the common spine (Bedrock everywhere) and where the heavier services (SageMaker, Trainium, Organizations) only appear at the deep end.
Read each row as "this service is used by these patterns." A dot in a cell means the service is a typical building block of that pattern; it is not an exhaustive dependency list, but it captures the components you would actually name on an architecture diagram. The pattern numbers map to the catalogue above: 1 chatbot · 2 managed RAG · 3 DIY RAG · 4 agentic · 5 batch · 6 fine-tuned/self-hosted · 7 enterprise platform.
| AWS service | Role | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| Amazon Bedrock (models / Converse) | Foundation-model serving | ● | ● | ● | ● | ● | ◐ | ● |
| Bedrock Knowledge Bases | Managed RAG | ● | ◐ | ◐ | ||||
| Bedrock Agents | Tool-using agent loop | ● | ◐ | |||||
| Bedrock Guardrails | Input/output safety | ● | ● | ● | ● | ● | ||
| Bedrock batch inference | Offline high-volume calls | ● | ◐ | |||||
| Amazon SageMaker | Train / host your own model | ● | ◐ | |||||
| Trainium / Inferentia (Neuron) | Cheaper-than-GPU silicon | ● | ◐ | |||||
| Amazon S3 | Source data + artifacts | ● | ● | ◐ | ● | ● | ● | |
| Vector store (OpenSearch / pgvector / Pinecone / Redis) | Embedding storage + ANN search | ● | ● | ◐ | ◐ | |||
| Amazon Textract | Parse PDFs / forms / tables | ◐ | ◐ | ● | ||||
| AWS Lambda | Glue / tools / endpoints | ● | ● | ● | ● | ● | ||
| AWS Step Functions | Workflow / multi-step orchestration | ◐ | ● | ● | ||||
| API Gateway | HTTP front door / tool endpoints | ● | ● | ● | ||||
| AWS Organizations + Control Tower | Multi-account landing zone | ● | ||||||
| PrivateLink / VPC endpoints | Private network access | ◐ | ◐ | ● | ||||
| CloudWatch / CloudTrail | Logging / metrics / audit | ● | ● | ● | ● | ● | ● | ● |
One table to choose from. Read across each row for the pattern's fit, then down the columns to compare build effort, control, and cost. The deep build guides are linked from the catalogue sections above.
| Pattern | What it does | Core AWS services | Build effort | Best for | Avoid when |
|---|---|---|---|---|---|
| 1. Simple chatbot | Answers from model + prompt | Bedrock Converse, Lambda/API GW, Guardrails | Hours | Assistants, copilots, drafting, extraction | Answers must come from your docs |
| 2. Managed RAG | Answers from your docs (cited) | Bedrock Knowledge Bases, S3, OpenSearch | Hours–days | Internal knowledge, support Q&A | Need custom chunking / hybrid / row-level ACLs |
| 3. DIY RAG | RAG with full pipeline control | Bedrock, vector store, Lambda/Glue/Step Fns | Days–weeks | Custom chunking, hybrid search, multi-tenancy | No concrete requirement managed RAG fails |
| 4. Agentic workflow | Reasons + acts via tools | Bedrock Agents, Lambda, Step Functions, KB | Days–weeks | Tasks needing actions / multi-step / tools | A single retrieve-and-answer suffices |
| 5. Batch processing | Model over many docs, offline | Bedrock batch, Textract, Step Functions, S3 | Days | Bulk extract / classify / summarise | Result needed interactively in real time |
| 6. Fine-tuned / self-hosted | Adapt or own the model | Bedrock custom model OR SageMaker + Neuron | Weeks | Specific style/skill; host own open model | Goal is knowledge (use RAG instead) |
| 7. Enterprise platform | Shared rails for all patterns | Organizations, Bedrock gateway, Guardrails, PrivateLink | Weeks–quarter | Many teams; central security + FinOps | You are one team shipping one product |
Situation: The team had pitched investors on "AI" and started hand-building a DIY RAG pipeline plus an early attempt at fine-tuning a model on their documents — two of the heaviest patterns at once, before validating either. Answers were unreliable, the fine-tune had baked in stale policy facts, and the bulk back-office extraction they actually needed was being run one document at a time through on-demand inference, running up the bill. The two ML-capable engineers were fully committed to the core product, and the projected Bedrock + hosting cost had the founder hesitating to continue.
What CloudRoute did: Routed within 24 hours to a US-East AWS partner with a GenAI/ML track record. The partner ran the catalogue decision tree and re-scoped the work onto the right patterns: <strong>managed RAG (pattern 2)</strong> on Bedrock Knowledge Bases for the cited policy assistant (S3 + hierarchical chunking + Titan v2 + OpenSearch Serverless + Cohere Rerank + Claude), <strong>batch document processing (pattern 5)</strong> with Bedrock batch inference + Textract + Step Functions for the high-volume claims extraction at ~half the token cost, and they dropped the fine-tune entirely — knowledge belonged in retrieval, not weights. The whole engagement was funded by AWS credits the partner filed for: Activate Portfolio plus a Bedrock POC allocation.
Outcome: A cited policy assistant and a batch extraction pipeline both in production in about six weeks — two correct patterns instead of two overbuilt ones. The abandoned fine-tune saved ongoing training and hosting cost; moving extraction to batch cut that line roughly in half. The build and the first months of inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.
engagement window: ~6 weeks · founder time: ~8 hours · patterns built: 2 (managed RAG + batch) · cost to customer: $0
Tell CloudRoute the problem; we route you to a vetted AWS GenAI/ML partner who picks the right reference architecture and ships it — chatbot, managed or DIY RAG, agentic workflow, batch processing, a fine-tuned/self-hosted model, or the enterprise platform. AWS credits fund the build and the inference. You pay $0.