aws genai reference architectures · 2026

GenAI reference architectures on AWS — the seven canonical patterns.

Almost every generative-AI system on AWS is one of seven reference architectures, or a composition of them: a simple chatbot, managed RAG with Bedrock Knowledge Bases, a DIY RAG pipeline, an agentic workflow, batch document processing, a fine-tuned or self-hosted model, and the enterprise multi-account platform that hosts the rest. This page is the catalogue — for each pattern: the exact AWS services it uses, when to reach for it (and when not to), what it costs in shape, and a link to the deep build guide. Use it to pick the right architecture before you write a line of code.

reference patterns
7
core service
Amazon Bedrock
deep build guides
4 linked
credits to fund it
up to $1M
TL;DR
  • Generative AI on AWS collapses into seven canonical reference architectures: (1) simple chatbot, (2) managed RAG on Bedrock Knowledge Bases, (3) DIY RAG, (4) agentic workflow, (5) batch document processing, (6) fine-tuned / self-hosted model, and (7) the enterprise multi-account platform. Real systems are usually one of these or a composition of two or three.
  • Amazon Bedrock is the common spine across all seven — it serves the foundation models (Claude, Amazon Nova, Llama, Mistral, Titan), the managed RAG (Knowledge Bases), the agents, the guardrails, and batch inference through one API. SageMaker enters only when you need full training/hosting control; the enterprise pattern wraps everything in Organizations, networking, and an LLM gateway.
  • Choosing the wrong pattern is the most expensive early mistake — teams hand-roll DIY RAG when managed would have shipped in a day, or fine-tune when RAG was the right tool. This page maps each pattern to its services, fit, and cost shape. GenAI bills scale fast; CloudRoute routes you to AWS credits (Activate Portfolio up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and vetted ML partners who build any pattern — you pay $0.
orientation

IHow to read this catalogue — patterns, not products

A reference architecture is a named, reusable shape: a set of components wired in a known way to solve a recurring problem. It is not a product you buy or a single service you turn on — it is the blueprint you assemble from AWS services. The value of naming them is that it makes the build decision a selection, not an invention.

Most teams starting a generative-AI project on AWS face the same paralysis: the stack has dozens of relevant services (Amazon Bedrock, SageMaker, Q, OpenSearch, Lambda, Step Functions, ECS, and more) and it is not obvious which to combine. The good news is that the space of useful combinations is small. Across the production systems people actually ship, almost everything is one of seven patterns — or a composition where one pattern calls another.

These seven are ordered roughly by complexity, from a stateless chatbot you can stand up in an afternoon to a multi-account enterprise platform that takes a quarter. Reading them in order is also a decision tree: start at the top, and stop at the first pattern that meets your requirement. The single most common and most expensive early mistake is reaching for a heavier pattern than the problem needs — building a DIY RAG pipeline when managed RAG would have answered the same questions, or fine-tuning a model to memorise facts that RAG retrieves more cheaply and keeps current.

For each pattern below you get four things: the shape (what is wired to what), the AWS services that typically implement it, when to use it and when not to, and a pointer to the deep build guide. Section VIII collects every service into one matrix so you can see at a glance which patterns share which building blocks. Section IX compares all seven side by side on the dimensions that drive the choice — build effort, control, latency, and cost shape.

One framing to keep throughout: Amazon Bedrock is the spine. Six of the seven patterns use Bedrock to serve foundation models through one API, with your data kept private to your account and never used to train the base models. The only pattern that can live without Bedrock is the self-hosted-model architecture (pattern 6), and even that often pairs with Bedrock for the models you do not want to host yourself.

the one rule of pattern selection

Read the seven patterns top to bottom and stop at the first one that satisfies your requirement. Heavier is not better — every step down the list adds engineering, operational surface, and cost. Compose patterns only when a real requirement (grounding, tool use, scale, isolation) forces it.

patterns 1–4

IIPatterns 1–4 — chatbot, managed RAG, DIY RAG, agentic workflow

The first four patterns cover the overwhelming majority of generative-AI applications: answer a question, answer it from your data, answer it from your data with custom retrieval, or take an action. Each builds on the one before.

These four are the workhorses. If you are building a customer-facing assistant, an internal knowledge tool, or a workflow that uses an LLM to decide and act, you are almost certainly building one of these. They share the same generation layer — a foundation model on Bedrock — and differ in what surrounds it.

Pattern 1 — Simple chatbot (stateless / conversational)

Shape: a thin application layer calls a foundation model on Amazon Bedrock through the Converse API, optionally streaming the response, with conversation history held in the request or a fast store. No retrieval, no tools — the model answers from its own training plus whatever you put in the system prompt. Add Bedrock Guardrails for input/output safety and you have a production-shaped assistant.

Services: Amazon Bedrock (Converse API; a model such as Claude or Amazon Nova), API Gateway + AWS Lambda (or AWS App Runner / ECS) for the endpoint, Amazon DynamoDB or ElastiCache for session/history, Bedrock Guardrails for safety, and Amazon CloudWatch for logs and metrics.

When to use it: general-purpose assistants, copilots over public or prompt-supplied knowledge, drafting and rewriting tools, classification and extraction, and any case where the model does not need your private documents to answer. When not to: the moment answers must come from your own corpus (move to pattern 2) or the assistant must take actions in other systems (move to pattern 4).

Deep build guide: see Build a chatbot on AWS for the full walkthrough — model choice, streaming, memory, guardrails, and cost.

Pattern 2 — Managed RAG (Amazon Bedrock Knowledge Bases)

Shape: ground the model in your own documents without building a pipeline. You point a Bedrock Knowledge Base at an Amazon S3 bucket (or a connector — web crawler, Confluence, Salesforce, SharePoint), and Bedrock handles ingestion, chunking, embedding, vector storage, retrieval, and optional re-ranking. Your app calls RetrieveAndGenerate and gets a cited answer in one call.

Services: Amazon Bedrock Knowledge Bases, Amazon S3 (source documents), an embedding model (Amazon Titan Text Embeddings v2 or Cohere Embed), a vector store (Amazon OpenSearch Serverless by default; Aurora pgvector, Pinecone, or Redis selectable), a generation model on Bedrock, Bedrock Guardrails, and CloudWatch.

When to use it: internal knowledge assistants, support automation, policy and documentation Q&A — any case where the answer must come from your documents and standard fixed/semantic/hierarchical chunking is good enough. This is the default RAG choice and covers most use cases. When not to: you need custom document-aware chunking, hybrid retrieval with your own score fusion, or row-level multi-tenant isolation the managed path cannot express — then move to pattern 3.

Deep build guide: see How to build RAG on AWS, which covers managed vs DIY, vector stores, embeddings, re-ranking, and evaluation in depth.

Pattern 3 — DIY RAG pipeline (Bedrock + your own stack)

Shape: the same logical pipeline as pattern 2 — ingest → chunk → embed → store → retrieve → re-rank → generate — but you own every stage. Your own parser, your own chunker (Lambda, AWS Glue, or Step Functions), direct writes to a vector store you control, your own hybrid (vector + keyword) retrieval and score fusion, your own re-ranking call, and your own prompt assembly. Bedrock still serves the embeddings and generation; you orchestrate everything in between, often with LangChain or LlamaIndex.

Services: Amazon Bedrock (embeddings + generation + Rerank), Amazon S3, AWS Lambda / AWS Glue / AWS Step Functions for the indexing workflow, a vector store you operate (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone, or Amazon MemoryDB/Redis), and CloudWatch.

When to use it: custom or document-aware chunking; hybrid search with your own fusion; strict multi-tenant or row-level access control; reuse of a vector store you already run; or aggressive cost/latency tuning where the managed convenience premium matters. When not to: you do not yet have a concrete requirement the managed path fails — most teams overbuild here and pay for a pipeline they then maintain.

Deep build guide: the DIY path is covered alongside managed in How to build RAG on AWS — including the exact line where DIY starts to pay for itself.

Pattern 4 — Agentic workflow (tool-using AI)

Shape: instead of only answering, the model acts. An agent takes a goal, reasons about steps, calls tools (APIs, databases, code, other models) to gather information and make changes, observes the results, and loops until the task is done. On AWS this is implemented with Amazon Bedrock Agents — you define action groups (tools, typically backed by Lambda or an OpenAPI schema), optionally attach a Knowledge Base for grounding, and Bedrock runs the reason-act loop. For complex multi-step or multi-agent orchestration, AWS Step Functions coordinates the flow.

Services: Amazon Bedrock Agents, AWS Lambda (action-group implementations), Amazon Bedrock Knowledge Bases (optional grounding), AWS Step Functions (multi-step / multi-agent orchestration), Amazon API Gateway (tool endpoints), Bedrock Guardrails, and CloudWatch for tracing the agent's steps.

When to use it: tasks that require taking actions, not just answering — booking, updating records, querying live systems, multi-step research, running code, or coordinating several specialised sub-agents. When not to: a single retrieval-and-answer is enough (use pattern 2) — agents add latency, cost, and a much larger surface to test and secure, so do not reach for one until the task genuinely needs tools.

Deep build guide: see Build an AI agent on AWS for action groups, orchestration, tool design, and guardrails for agentic systems.

patterns 5–7

IIIPatterns 5–7 — batch document processing, fine-tuned / self-hosted model, enterprise platform

The last three patterns cover the heavier end: processing documents at volume offline, owning a model's weights or hosting, and the platform layer that lets a whole organisation build the first six safely. These are where SageMaker, batch inference, and multi-account governance enter.

Patterns 5 through 7 are not "advanced" in the sense of better — they solve different problems. Pattern 5 is about throughput and cost on offline work; pattern 6 is about control over the model itself; pattern 7 is about letting many teams build patterns 1–6 without each reinventing security and governance.

Pattern 5 — Batch document processing (extract / classify / summarise at scale)

Shape: run a foundation model over a large set of documents offline, where latency does not matter and throughput and cost do. Documents land in S3, a workflow fans them out, each is parsed and sent to a model for extraction, classification, summarisation, or enrichment, and results are written back to S3 or a database. Because the work is asynchronous, you use Bedrock batch inference, which processes large jobs at roughly half the on-demand token price.

Services: Amazon S3 (input + output), Amazon Textract (parse scanned PDFs, forms, tables), Amazon Bedrock batch inference (the model calls), AWS Step Functions or Amazon SQS for the fan-out workflow, AWS Lambda for per-document handling, and optionally Amazon Bedrock Data Automation for managed document-to-structured-output pipelines.

When to use it: invoice and contract extraction, bulk document classification, transcript or report summarisation, dataset enrichment, and any high-volume back-office task that runs on a schedule rather than per user request. When not to: the result is needed interactively in real time — then it belongs in pattern 1, 2, or 4 with on-demand (not batch) inference.

Deep build guide: the batch and cost mechanics (batch inference, prompt caching, model choice) are covered across the AWS AI cluster — start from the AWS AI & Bedrock hub for the relevant service pages.

Pattern 6 — Fine-tuned or self-hosted model

Shape: when a prompt and retrieval are not enough, you adapt or own the model. Two sub-shapes: fine-tune a model on Bedrock (custom model) to teach a consistent format, tone, or narrow skill on your labelled data; or self-host an open-weights model on Amazon SageMaker (real-time, serverless, or asynchronous endpoints) when you need full control over the model, its version, its hardware, or its data path. SageMaker JumpStart provides ready open models; AWS Trainium and Inferentia provide cheaper-than-GPU silicon via the Neuron SDK for training and inference.

Services: Amazon Bedrock fine-tuning / custom models (managed adaptation), Amazon SageMaker (training jobs, JumpStart, endpoints) for self-hosting, AWS Trainium / Inferentia instances + the Neuron SDK for cost-efficient compute, Amazon S3 for training data and artifacts, and CloudWatch for monitoring.

When to use it: a consistent output style or domain behaviour a base model will not reliably produce; a narrow specialised task where a smaller fine-tuned model is cheaper and faster than a frontier one; strict requirements to host a specific open model in your own VPC; or research workloads needing full training control. When not to: the goal is to add knowledge — RAG (patterns 2–3) keeps facts current and citable far more cheaply than baking them into weights.

Related guides: Bedrock vs SageMaker, fine-tuning, and the AI-silicon pages live in the AWS AI & Bedrock hub.

Pattern 7 — Enterprise multi-account GenAI platform

Shape: the platform layer that lets many teams build patterns 1–6 safely on shared rails. A multi-account AWS Organizations landing zone separates workloads; a centralised model-access and gateway layer fronts Bedrock so every team uses approved models with consistent guardrails, logging, quotas, and cost attribution; private networking (VPC endpoints / PrivateLink) keeps traffic off the public internet; and identity, audit, and FinOps are built in from the start. This is less a single app than the foundation the rest run on.

Services: AWS Organizations + AWS Control Tower (landing zone), Amazon Bedrock with cross-region inference and provisioned throughput, an internal LLM gateway (often API Gateway + Lambda in front of Bedrock), Bedrock Guardrails as an org standard, AWS IAM Identity Center (SSO), AWS PrivateLink / VPC endpoints for private access, AWS CloudTrail + CloudWatch for audit, and AWS Budgets / Cost Explorer for per-team attribution.

When to use it: a mid-size or larger organisation with several teams shipping GenAI, where central security, compliance, cost control, and reuse matter more than any single app's speed. When not to: you are one team shipping one product — build the relevant pattern (1–6) directly and add the platform later when the second and third teams arrive.

Deep build guide: see Generative AI on AWS for enterprises for the landing zone, gateway, governance, and FinOps detail.

how they fit together

IVComposing the patterns — real systems are combinations

The seven patterns are building blocks, and most production systems use more than one. Knowing the common compositions saves you from treating them as mutually exclusive choices when they are really layers.

The patterns nest naturally because they share the same Bedrock generation layer. An agent (pattern 4) routinely calls a Knowledge Base (pattern 2) as one of its tools, so it can both answer from your documents and act. A managed RAG assistant (pattern 2) may sit behind the chatbot interface of pattern 1. A fine-tuned model (pattern 6) can be the generation model inside a RAG pipeline (patterns 2–3) — RAG for the facts, fine-tuning for the behaviour. And in a mature organisation, every one of these runs on top of the enterprise platform (pattern 7), which supplies the model access, guardrails, networking, and cost controls they all need.

Two compositions are worth calling out because they are so common. First, agentic RAG: an agent whose primary tool is a Knowledge Base, giving you grounded answers and the ability to take follow-up actions — the shape behind most "do something with my data" assistants. Second, RAG over a fine-tuned model: keep volatile knowledge in the retrieval layer where it stays current and citable, and use a lightly fine-tuned model only to lock in a house style or output format. Reaching for fine-tuning to store facts that change is the classic anti-pattern; this composition is the right way to combine the two.

The practical implication is the same as the selection rule: start at the simplest pattern that works, and add a layer only when a concrete requirement appears. A chatbot becomes managed RAG when answers must be grounded; managed RAG becomes an agent when the assistant must act; any of them moves onto the enterprise platform when a second team needs the same rails. Each step is a deliberate addition, not a rewrite — which is exactly why naming the patterns is useful.

the two most common compositions

Agentic RAG (pattern 4 + pattern 2): an agent whose main tool is a Bedrock Knowledge Base — grounded answers plus the ability to act. RAG over a fine-tuned model (patterns 2/3 + pattern 6): retrieval supplies current, citable facts; light fine-tuning supplies a consistent style. Knowledge in retrieval, behaviour in the model.

the spine vs the workbench

VWhere Bedrock ends and SageMaker begins

Almost every pattern question eventually reduces to one decision: managed foundation models through Amazon Bedrock, or full model control through Amazon SageMaker. They are complementary, not competing, and the line between them is clean.

Amazon Bedrock is the managed foundation-model API. You get many models — Anthropic (Claude), Meta (Llama), Mistral, Amazon (Nova + Titan), Cohere, Stability AI, AI21, DeepSeek — through one interface, with enterprise security and privacy (your prompts and data are not used to train the base models and stay in your account and Region). On top of the raw models, Bedrock layers the capabilities the patterns above lean on: the Converse API, Agents, Knowledge Bases (managed RAG), Guardrails, fine-tuning and custom models, model distillation, Flows, Prompt Management, batch inference, provisioned throughput, prompt caching, and model evaluation. For patterns 1–5 and the managed half of pattern 6, Bedrock is the whole serving layer — you never touch infrastructure.

Amazon SageMaker is the end-to-end ML platform for when you need to build, train, and host models yourself: the Studio IDE, JumpStart's catalogue of open models, training jobs, and endpoints (real-time, serverless, asynchronous). You reach for SageMaker in the self-hosting half of pattern 6 — running a specific open-weights model in your own VPC, training or heavily fine-tuning a model with full control, or serving on custom hardware such as AWS Trainium and Inferentia via the Neuron SDK. SageMaker gives maximum control at the cost of owning the operational surface that Bedrock hides.

The rule of thumb across all seven patterns: default to Bedrock; move to SageMaker only when you need to own the model or the hardware. Bedrock is the spine because it serves the models without infrastructure; SageMaker is the workbench you pull out when managed serving cannot give you the control a specific requirement demands. Many enterprise platforms (pattern 7) standardise on Bedrock for general use and keep SageMaker available for the teams that genuinely need it. The deeper Bedrock-vs-SageMaker comparison lives on its own page in the AWS AI & Bedrock hub.

what each pattern costs

VIThe cost shape of each pattern

Each pattern has a characteristic cost shape — what you pay for, when, and which lever controls it. Knowing the shape before you build prevents the most common budget surprises, almost all of which come from a baseline you did not expect or generation tokens you did not meter.

The figures and ranges in this section are representative as of 2026 to convey shape, not quotes — always check the AWS pricing page (and the third-party vendor for Pinecone) for current rates. Two cost ideas recur across every pattern. Generation tokens — input plus output, priced per model — are usually the largest line in any interactive pattern, which is why model choice, fewer/tighter chunks, prompt caching, and tight max-token limits are the highest-leverage levers. And a baseline you pay regardless of traffic — chiefly an always-on vector store (patterns 2–3) or a hosted SageMaker endpoint (pattern 6) — is what surprises teams that budgeted only for per-call costs.

Bedrock's pricing modes map onto the patterns directly. On-Demand (per 1K input/output tokens) suits interactive patterns 1, 2, 4. Batch (~50% cheaper) is the right mode for pattern 5's offline volume. Provisioned Throughput (reserved capacity) fits steady high-volume production and pattern 7's shared platform. Prompt caching cuts the cost of repeated context across all of them. Customisation and storage fees apply when you fine-tune (pattern 6). The table below summarises where the money goes per pattern.

cost shape by pattern · representative as of 2026 — check the AWS pricing page for current rates
PatternDominant costBaseline (pay regardless of traffic)Bedrock pricing modeTop lever to control it
1. ChatbotGeneration tokensMinimal (serverless compute)On-Demand (+ caching)Model choice; prompt caching; max-tokens
2. Managed RAGGeneration tokensVector store (e.g. OpenSearch OCUs)On-DemandRe-rank to fewer chunks; right-size vector store
3. DIY RAGGeneration + engineering timeVector store + your computeOn-DemandTune every stage; cheaper/embedded dims; caching
4. Agentic workflowGeneration tokens (multi-turn)Vector store if groundedOn-DemandCap loop steps; cheaper model for routing; caching
5. Batch processingGeneration tokens (high volume)Minimal (offline)Batch (~50% off)Batch mode; model choice; tight prompts
6. Fine-tuned / self-hostedTraining + hosted endpointEndpoint / custom-model storageCustom + ProvisionedRight-size endpoint; Trainium/Inferentia; serverless
7. Enterprise platformAggregate of all hosted patternsProvisioned throughput + networkingProvisioned + On-DemandPer-team budgets; provisioned commit; central caching
In every interactive pattern, generation tokens dominate and prompt caching + re-ranking are the biggest levers. In patterns 2, 3, and 6 an always-on baseline (vector store or hosted endpoint) is the cost teams forget — right-size it to actual corpus/traffic, not peak imagination. Pattern 5 should always use batch inference for ~50% off.
the decision tree

VIIA decision tree for picking your pattern

Run your use case through these questions in order. The first one whose answer is "yes" points you at the pattern to build — and tells you which deep guide to open next.

  • Does the assistant need to take actions (call APIs, update systems, run multi-step tasks), not just answer? — Yes → Pattern 4, agentic workflow (Bedrock Agents). If it also needs your documents, that is agentic RAG (pattern 4 + 2). Open Build an AI agent on AWS.
  • Must answers come from your own documents / knowledge? — Yes, and standard chunking is fine → Pattern 2, managed RAG (Bedrock Knowledge Bases). Yes, but you need custom chunking, hybrid search, or row-level isolation → Pattern 3, DIY RAG. Open How to build RAG on AWS.
  • Is the work high-volume and offline (no real-time user waiting)? — Yes → Pattern 5, batch document processing (Bedrock batch inference + Textract + Step Functions). Use batch mode for ~50% off.
  • Do you need a specific output style/skill a base model won't reliably produce, or must you host a specific open model in your VPC? — Style/skill on labelled data → fine-tune (pattern 6, Bedrock custom model). Own the model/hardware/data path → self-host (pattern 6, SageMaker + Trainium/Inferentia). But if the goal is knowledge, use RAG instead.
  • Are several teams going to build GenAI and need shared security, guardrails, networking, and cost control? — Yes → Pattern 7, enterprise multi-account platform (Organizations + Bedrock gateway + Guardrails + PrivateLink + FinOps). Open Generative AI on AWS for enterprises.
  • None of the above — you just need a model to answer from its own knowledge or a supplied prompt? — Pattern 1, simple chatbot (Bedrock Converse API + Guardrails). The fastest pattern to ship. Open Build a chatbot on AWS.
the building blocks

VIIIServices-used matrix — which pattern uses what

Every pattern is assembled from a shared set of AWS services. This matrix shows, for each core service, which of the seven patterns relies on it — so you can see the common spine (Bedrock everywhere) and where the heavier services (SageMaker, Trainium, Organizations) only appear at the deep end.

Read each row as "this service is used by these patterns." A dot in a cell means the service is a typical building block of that pattern; it is not an exhaustive dependency list, but it captures the components you would actually name on an architecture diagram. The pattern numbers map to the catalogue above: 1 chatbot · 2 managed RAG · 3 DIY RAG · 4 agentic · 5 batch · 6 fine-tuned/self-hosted · 7 enterprise platform.

aws services used across the seven genai reference architectures · representative as of 2026
AWS serviceRole1234567
Amazon Bedrock (models / Converse)Foundation-model serving
Bedrock Knowledge BasesManaged RAG
Bedrock AgentsTool-using agent loop
Bedrock GuardrailsInput/output safety
Bedrock batch inferenceOffline high-volume calls
Amazon SageMakerTrain / host your own model
Trainium / Inferentia (Neuron)Cheaper-than-GPU silicon
Amazon S3Source data + artifacts
Vector store (OpenSearch / pgvector / Pinecone / Redis)Embedding storage + ANN search
Amazon TextractParse PDFs / forms / tables
AWS LambdaGlue / tools / endpoints
AWS Step FunctionsWorkflow / multi-step orchestration
API GatewayHTTP front door / tool endpoints
AWS Organizations + Control TowerMulti-account landing zone
PrivateLink / VPC endpointsPrivate network access
CloudWatch / CloudTrailLogging / metrics / audit
● = typical building block · ◐ = used in some variants / when composed. Amazon Bedrock and CloudWatch appear in essentially every pattern (the spine). SageMaker, Trainium/Inferentia, and Organizations/Control Tower appear only at the deep end (patterns 6–7). The vector-store row is the dividing line between "answer from the model" (pattern 1) and "answer from your data" (patterns 2–3).
all seven, side by side

The seven GenAI reference architectures compared

One table to choose from. Read across each row for the pattern's fit, then down the columns to compare build effort, control, and cost. The deep build guides are linked from the catalogue sections above.

PatternWhat it doesCore AWS servicesBuild effortBest forAvoid when
1. Simple chatbotAnswers from model + promptBedrock Converse, Lambda/API GW, GuardrailsHoursAssistants, copilots, drafting, extractionAnswers must come from your docs
2. Managed RAGAnswers from your docs (cited)Bedrock Knowledge Bases, S3, OpenSearchHours–daysInternal knowledge, support Q&ANeed custom chunking / hybrid / row-level ACLs
3. DIY RAGRAG with full pipeline controlBedrock, vector store, Lambda/Glue/Step FnsDays–weeksCustom chunking, hybrid search, multi-tenancyNo concrete requirement managed RAG fails
4. Agentic workflowReasons + acts via toolsBedrock Agents, Lambda, Step Functions, KBDays–weeksTasks needing actions / multi-step / toolsA single retrieve-and-answer suffices
5. Batch processingModel over many docs, offlineBedrock batch, Textract, Step Functions, S3DaysBulk extract / classify / summariseResult needed interactively in real time
6. Fine-tuned / self-hostedAdapt or own the modelBedrock custom model OR SageMaker + NeuronWeeksSpecific style/skill; host own open modelGoal is knowledge (use RAG instead)
7. Enterprise platformShared rails for all patternsOrganizations, Bedrock gateway, Guardrails, PrivateLinkWeeks–quarterMany teams; central security + FinOpsYou are one team shipping one product
Build effort assumes a competent team and clean data; data preparation, not AWS wiring, is usually the long pole. Patterns compose — agentic RAG is 4 + 2; RAG over a fine-tuned model is 2/3 + 6; in a mature org everything runs on 7. Start at the simplest pattern that meets the requirement and add a layer only when forced.
know which pattern you need?
Have a vetted AWS partner build your reference architecture — and let AWS credits pay for it
Start in 3 minutes →
a recent match

Picking the right pattern, then building it — anonymized

inquiry · Series-A insurtech, document-heavy workflow, US
Series-A insurtech, 22 people, large volume of policy + claims PDFs, wanted "an AI assistant" but had not settled on an architecture

Situation: The team had pitched investors on "AI" and started hand-building a DIY RAG pipeline plus an early attempt at fine-tuning a model on their documents — two of the heaviest patterns at once, before validating either. Answers were unreliable, the fine-tune had baked in stale policy facts, and the bulk back-office extraction they actually needed was being run one document at a time through on-demand inference, running up the bill. The two ML-capable engineers were fully committed to the core product, and the projected Bedrock + hosting cost had the founder hesitating to continue.

What CloudRoute did: Routed within 24 hours to a US-East AWS partner with a GenAI/ML track record. The partner ran the catalogue decision tree and re-scoped the work onto the right patterns: <strong>managed RAG (pattern 2)</strong> on Bedrock Knowledge Bases for the cited policy assistant (S3 + hierarchical chunking + Titan v2 + OpenSearch Serverless + Cohere Rerank + Claude), <strong>batch document processing (pattern 5)</strong> with Bedrock batch inference + Textract + Step Functions for the high-volume claims extraction at ~half the token cost, and they dropped the fine-tune entirely — knowledge belonged in retrieval, not weights. The whole engagement was funded by AWS credits the partner filed for: Activate Portfolio plus a Bedrock POC allocation.

Outcome: A cited policy assistant and a batch extraction pipeline both in production in about six weeks — two correct patterns instead of two overbuilt ones. The abandoned fine-tune saved ongoing training and hosting cost; moving extraction to batch cut that line roughly in half. The build and the first months of inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.

engagement window: ~6 weeks · founder time: ~8 hours · patterns built: 2 (managed RAG + batch) · cost to customer: $0

faq

Common questions

What are the main generative-AI reference architectures on AWS?
Seven canonical patterns cover almost everything: (1) a simple chatbot (Bedrock Converse API answering from the model + prompt); (2) managed RAG using Amazon Bedrock Knowledge Bases to answer from your documents with citations; (3) DIY RAG, the same pipeline hand-built on Bedrock plus a vector store you control; (4) an agentic workflow using Bedrock Agents to reason and take actions via tools; (5) batch document processing with Bedrock batch inference + Textract for high-volume offline extraction/classification/summarisation; (6) a fine-tuned or self-hosted model (Bedrock custom models or SageMaker for full control); and (7) an enterprise multi-account platform that provides shared security, guardrails, networking, and cost control for all the others. Real systems are usually one of these or a composition of two or three.
Which AWS service is common to most GenAI architectures?
Amazon Bedrock is the spine of six of the seven patterns. It serves many foundation models — Claude, Amazon Nova, Llama, Mistral, Titan, Cohere, and more — through one API with enterprise privacy (your data is not used to train the base models and stays in your account/Region), and it provides the higher-level capabilities the patterns lean on: the Converse API, Agents, Knowledge Bases (managed RAG), Guardrails, fine-tuning, batch inference, provisioned throughput, and prompt caching. Only the self-hosted-model pattern can live without Bedrock, and it often still pairs with it.
When should I use Amazon Bedrock vs Amazon SageMaker for GenAI?
Default to Bedrock; move to SageMaker only when you need to own the model or the hardware. Bedrock is the managed foundation-model API — you call models without touching infrastructure, which covers patterns 1–5 and the managed (fine-tuning) half of pattern 6. SageMaker is the full ML platform for building, training, and self-hosting models — real-time/serverless/async endpoints, JumpStart open models, training jobs, and custom hardware (Trainium/Inferentia via the Neuron SDK). You reach for SageMaker in the self-hosting half of pattern 6: running a specific open-weights model in your own VPC, heavy fine-tuning with full control, or research workloads. They are complementary, not competing.
How do I choose between managed RAG and a DIY RAG pipeline?
Start with managed RAG — Amazon Bedrock Knowledge Bases handles ingestion, chunking, embedding, vector storage, retrieval, and re-ranking, and returns cited answers from a single RetrieveAndGenerate call, so you can ship in hours. Move to a DIY pipeline (Bedrock + a vector store you control, orchestrated with Lambda/Glue/Step Functions) only when a concrete requirement forces it: custom document-aware chunking, hybrid vector+keyword search with your own score fusion, strict multi-tenant or row-level access control, reuse of an existing vector store, or aggressive cost/latency tuning. Most teams overbuild here; managed covers the majority of use cases. Both paths use the same Bedrock embedding and generation models.
When is an agentic architecture the right choice over plain RAG?
Use an agent (pattern 4) when the assistant must take actions, not just answer — booking, updating records, querying live systems, running code, multi-step research, or coordinating specialised sub-agents. On AWS that is Amazon Bedrock Agents: you define action groups (tools, usually Lambda-backed), optionally attach a Knowledge Base for grounding, and Bedrock runs the reason-act loop; Step Functions coordinates complex multi-step or multi-agent flows. If a single retrieve-and-answer is enough, stay with RAG (pattern 2) — agents add latency, cost, and a much larger surface to test and secure. A very common middle ground is agentic RAG: an agent whose main tool is a Knowledge Base, giving grounded answers plus the ability to act.
Can I combine these patterns in one system?
Yes — most production systems do, because the patterns share the same Bedrock generation layer. Common compositions: agentic RAG (an agent whose primary tool is a Bedrock Knowledge Base); RAG over a fine-tuned model (retrieval supplies current, citable facts while light fine-tuning supplies a consistent style — knowledge in retrieval, behaviour in the model); and a managed RAG assistant behind a chatbot interface. In a mature organisation, every pattern runs on top of the enterprise platform (pattern 7), which supplies model access, guardrails, networking, and cost controls. The guiding rule is to start at the simplest pattern that meets the requirement and add a layer only when a concrete need appears — each step is an addition, not a rewrite.
What does it cost to run these architectures on AWS?
Each pattern has a characteristic cost shape. In every interactive pattern (1, 2, 4) generation tokens — input + output, priced per model — are the largest line, so model choice, fewer/tighter chunks, prompt caching, and tight max-token limits are the biggest levers. Patterns 2, 3, and 6 add a baseline you pay regardless of traffic — an always-on vector store or a hosted SageMaker endpoint — which is the cost teams most often forget. Pattern 5 should always use Bedrock batch inference for roughly half the on-demand price. Bedrock pricing modes map directly: On-Demand for interactive, Batch for offline volume, Provisioned Throughput for steady high-volume and shared platforms, plus prompt caching across all of them. Figures are representative as of 2026 — check the AWS pricing page for current rates.
How long does it take to build one of these on AWS, and how is it funded?
It varies by pattern: a simple chatbot or a managed-RAG prototype can be standing in hours to a day; a production agentic workflow or DIY RAG pipeline is typically a few weeks; a fine-tuned/self-hosted model or an enterprise multi-account platform runs several weeks to a quarter. The slowest part is almost always data preparation (clean parsing and chunking), not the AWS wiring. On funding: generative-AI inference and training bills scale fast, and CloudRoute routes you to AWS credits — Activate Portfolio up to $100K, a Bedrock/GenAI POC allocation of $10K–$50K, and the GenAI Accelerator up to $1M — plus vetted ML partners who design and build whichever pattern fits. The credits fund the build and the inference, so the customer pays $0.

Build any of these architectures on AWS — funded by AWS credits

Tell CloudRoute the problem; we route you to a vetted AWS GenAI/ML partner who picks the right reference architecture and ships it — chatbot, managed or DIY RAG, agentic workflow, batch processing, a fine-tuned/self-hosted model, or the enterprise platform. AWS credits fund the build and the inference. You pay $0.

matched within< 24h
credits to fund itup to $1M
cost to you$0
AWS GenAI Reference Architectures — the 7 canonical patterns (2026) · CloudRoute