Retrieval-augmented generation grounds a language model in your own documents so it answers from your data instead of hallucinating. This is the full build guide: the reference architecture end to end (ingest → chunk → embed → store → retrieve → re-rank → generate), the two paths — managed with Amazon Bedrock Knowledge Bases vs a DIY pipeline on Bedrock plus your own vector store — every vector-store option on AWS, how to choose embeddings and a re-ranker, how to evaluate faithfulness and relevance, and what production actually costs.
Retrieval-augmented generation is a pattern, not a product: before a language model answers a question, you retrieve the most relevant passages from your own corpus and inject them into the prompt as grounding context. The model then answers from those passages instead of from parametric memory alone.
A foundation model knows only what was in its training data, frozen at a cutoff date, and it has no idea what is inside your wiki, your support tickets, your contracts, or last night's product changes. Ask it about any of that and it will either refuse or — worse — confidently invent an answer. RAG fixes this by separating knowledge from reasoning. The model keeps doing what it is good at (language and reasoning); the facts come from a retrieval step over data you control and can update at any time.
The mechanics are straightforward. Offline, you split your documents into chunks, convert each chunk into a numeric vector with an embedding model, and store those vectors in a vector database. At query time, you embed the user's question with the same embedding model, find the chunks whose vectors are nearest to the question vector (semantic similarity, not keyword match), optionally re-rank them for precision, and paste the top results into the prompt alongside an instruction like "answer using only the context below, and cite which passage you used."
The payoff is three things at once. Freshness: update a document and the next answer reflects it — no retraining. Attribution: because the answer is built from retrieved passages, you can show citations and let users verify. Governance: you decide what is in the corpus and who can retrieve which parts, which matters enormously for enterprise data. The alternative to RAG — fine-tuning the model on your data — bakes knowledge into weights, is expensive to update, cannot easily enforce per-user access control, and does not give you citations. For knowledge that changes or must be access-controlled, RAG wins; fine-tuning is for teaching a model a style or a narrow skill, not for memorizing a knowledge base.
On AWS, every stage of this pattern maps to a managed service, which is why AWS is one of the most common places to build RAG. The next section walks the full reference architecture stage by stage.
RAG = retrieve the most relevant chunks of your data for a question, then have a foundation model answer using those chunks as grounding — so the answer is current, cited, and access-controlled instead of hallucinated.
Every RAG system — managed or DIY — runs the same seven logical stages. Understanding each one is what lets you debug a system that returns irrelevant answers, because almost every quality problem traces back to a specific stage.
It helps to split the seven stages into two phases. The indexing phase (ingest → chunk → embed → store) runs offline whenever your data changes. The query phase (retrieve → re-rank → generate) runs in real time on every user question. The table below maps each stage to the AWS service that typically implements it.
1. Ingest. Get raw documents into a staging location — almost always an Amazon S3 bucket, populated from your wiki, CRM, ticketing system, or a database export. PDFs, HTML, Word, and Markdown all need a parsing step to extract clean text; messy extraction (broken tables, headers mixed into body text) is the single most common silent cause of bad answers downstream.
2. Chunk. Split each document into passages small enough to embed precisely but large enough to carry meaning — typically 300–800 tokens with a 10–20% overlap so a sentence split across a boundary still appears whole in one chunk. Chunking is covered in depth in section V; it matters more than almost any other knob.
3. Embed. Convert each chunk into a vector with an embedding model — on AWS, Amazon Titan Text Embeddings v2 or Cohere Embed (both served through Amazon Bedrock). The model and its output dimensionality are fixed for the life of the index: change the embedding model and you must re-embed everything.
4. Store. Write each vector plus its source text and metadata (document ID, title, ACL tags, timestamps) into a vector store with an approximate-nearest-neighbour index. Options on AWS — OpenSearch Serverless, Aurora pgvector, Pinecone, Redis — are compared in section IV.
5. Retrieve. Embed the incoming question with the same embedding model, then query the vector store for the top-K nearest chunks (K is commonly 10–50 at this stage). Hybrid retrieval — combining vector similarity with a keyword/BM25 search and fusing the scores — almost always beats pure-vector retrieval on real corpora, because it catches exact terms, product names, and acronyms that embeddings sometimes blur.
6. Re-rank. Pass the top-K candidates through a cross-encoder re-ranker (e.g. Amazon Rerank or Cohere Rerank on Bedrock) that scores each chunk against the question directly and keeps only the best 3–8. This is the highest-leverage precision step in the whole pipeline — re-ranking routinely turns a mediocre retriever into a good one. It is covered in section V.
7. Generate. Build the final prompt — system instruction + the re-ranked chunks + the user question — and send it to a generation model on Bedrock (Claude, Amazon Nova, Llama, Mistral). Instruct the model to answer only from the supplied context and to cite the passages it used; return those citations to the user so the answer is verifiable.
| Stage | Phase | What it does | Typical AWS service |
|---|---|---|---|
| 1. Ingest | Indexing (offline) | Land + parse raw documents | Amazon S3 (+ Textract / parsing) |
| 2. Chunk | Indexing (offline) | Split documents into passages | Bedrock KB built-in, or Lambda/Glue (DIY) |
| 3. Embed | Indexing (offline) | Text → vectors | Titan Text Embeddings v2 / Cohere Embed (Bedrock) |
| 4. Store | Indexing (offline) | Index vectors + metadata for ANN search | OpenSearch Serverless / Aurora pgvector / Pinecone / Redis |
| 5. Retrieve | Query (real time) | Find top-K nearest chunks | Vector store query (+ hybrid BM25) |
| 6. Re-rank | Query (real time) | Score + keep the best 3–8 chunks | Amazon Rerank / Cohere Rerank (Bedrock) |
| 7. Generate | Query (real time) | Answer from context, with citations | Claude / Nova / Llama / Mistral (Bedrock) |
The first real decision is not which vector store or which model — it is whether to let Amazon Bedrock Knowledge Bases run the pipeline for you, or to assemble it yourself. This single choice determines how much you build, how much you control, and how fast you ship.
The honest framing: start managed, move to DIY only when a specific requirement forces it. Most teams overbuild RAG on day one, hand-rolling a pipeline they then have to maintain, when Bedrock Knowledge Bases would have shipped the same answers in an afternoon. Conversely, some teams force everything into the managed path and then fight it when they hit a hard requirement it does not cover. Knowing where the line is saves weeks.
Bedrock Knowledge Bases is AWS's fully-managed RAG service. You point a knowledge base at an S3 bucket (or a connected source like a web crawler, Confluence, Salesforce, or SharePoint), pick an embedding model and a vector store, and Bedrock handles ingestion, chunking, embedding, indexing, retrieval, and optional re-ranking. You then call Retrieve (get relevant chunks) or RetrieveAndGenerate (get a cited answer in one call). It manages incremental sync when documents change, returns source citations automatically, and integrates with Bedrock Guardrails for safety filtering.
Choose managed when: you want to ship in hours not weeks; your documents live in S3 or a supported connector; standard fixed/semantic/hierarchical chunking is good enough; and you do not need exotic retrieval logic. This covers the large majority of internal-knowledge and support use cases.
In the DIY path you still call Bedrock for embeddings and generation, but you own every stage: your own parser, your own chunker (Lambda, AWS Glue, or a Step Functions workflow), direct writes to your chosen vector store, your own retrieval and hybrid-search logic, your own re-ranking call, and your own prompt assembly. Orchestration frameworks like LangChain or LlamaIndex are common here but optional.
Choose DIY when: you need custom or document-aware chunking; you require hybrid (vector + keyword) search with your own score fusion; you have strict multi-tenant isolation or row-level access control the managed path can't express; you want to reuse a vector store you already operate; or you are squeezing cost/latency hard enough that the managed convenience premium matters. DIY costs more engineering time and ongoing maintenance — that is the trade.
Prototype on Bedrock Knowledge Bases to prove the use case and get a baseline answer quality fast. Graduate specific stages to DIY only when a concrete requirement — custom chunking, hybrid search, multi-tenant ACLs, or aggressive cost control — actually forces it. Many production systems are a hybrid: managed KB for the bulk corpus, a DIY path for one demanding source.
The vector store holds your embeddings and answers nearest-neighbour queries. All four common options work with both Bedrock Knowledge Bases and DIY pipelines; they differ on operational model, cost shape, scale, and whether you already run the underlying engine.
Amazon OpenSearch Serverless is the most common default for Bedrock Knowledge Bases. It is fully managed, scales automatically, supports both vector (k-NN) and keyword search in one engine (so hybrid retrieval is native), and integrates tightly with the rest of AWS. The trade is that it bills by OpenSearch Compute Units (OCUs) with a minimum baseline, so it can feel expensive for very small corpora even when idle.
Aurora PostgreSQL with the pgvector extension is the pragmatic choice when you already run Postgres. Your vectors live in the same database as your relational data, so you can filter by SQL predicates and join to business tables in one query, and there is no new system to operate. It scales well into the millions of vectors with HNSW indexing; beyond that, or for very high query concurrency, a purpose-built vector engine pulls ahead.
Pinecone is a managed, vector-native database available through the AWS Marketplace (and selectable in Bedrock Knowledge Bases). It is built only for vector search, so it offers strong performance, serverless scaling, and rich metadata filtering with minimal tuning — attractive when vector search is your core workload and you want a specialist rather than a general engine. It is a third-party service billed separately (Marketplace billing can route through your AWS invoice).
Amazon MemoryDB / Redis (Redis with vector search) is the option to reach for when retrieval latency is critical — it is in-memory, so single-digit-millisecond queries are realistic, which suits real-time chat and agent loops. The trade is cost at large scale (RAM is pricier than disk) and that it is best for hot, latency-sensitive indexes rather than enormous archival corpora.
Two more AWS-native choices exist and are worth knowing: Amazon Aurora aside, Amazon DocumentDB and Amazon Neptune Analytics both support vector search, and the new S3 Vectors capability targets very large, cost-optimized vector sets with infrequent queries. For most builds, though, the decision is among the four in the table below.
| Vector store | Managed? | Hybrid search | Best fit | Cost shape | Watch-out |
|---|---|---|---|---|---|
| OpenSearch Serverless | Fully managed | Native (vector + BM25) | Default for Bedrock KB; teams wanting one engine for both | Per OCU + storage; baseline minimum | Baseline cost stings tiny corpora |
| Aurora PostgreSQL (pgvector) | Managed DB | Vector + SQL filters | Teams already on Postgres; SQL-joined metadata | Per Aurora instance / ACU + storage | Specialist engines win at extreme scale/QPS |
| Pinecone | Fully managed (3rd-party) | Vector + metadata filters | Vector-native workloads wanting zero tuning | Per usage / pod (Marketplace) | Separate vendor; data leaves AWS-native services |
| Amazon MemoryDB / Redis | Managed | Vector + filters | Ultra-low-latency real-time retrieval | Per node (RAM-priced) | Expensive for very large indexes |
Retrieval quality — not the generation model — is what makes or breaks a RAG system. If the right chunk never makes it into the prompt, no model can answer well. Three knobs dominate: which embedding model you use, how you chunk, and whether you re-rank.
On Bedrock the two mainstream embedding families are Amazon Titan Text Embeddings v2 and Cohere Embed. Titan v2 is AWS-native, inexpensive, and supports configurable output dimensions (e.g. 256 / 512 / 1024) — smaller dimensions cut storage and speed up search at a modest recall cost, which is a real lever at scale. Cohere Embed (English and multilingual variants) is strong on retrieval benchmarks and a frequent pick for multilingual corpora and search-heavy products.
Two rules matter more than the specific winner. First, you must embed queries and documents with the same model and version — mixing them produces incomparable vectors and silently destroys recall. Second, changing the embedding model means re-embedding the entire corpus, so treat this choice as semi-permanent and benchmark on your own data before committing. Pick based on language coverage, dimensionality/cost, and measured recall on a sample of your real questions — not on a leaderboard.
Chunking is the highest-variance decision in RAG. Chunks that are too large dilute the embedding (one vector trying to represent five topics) and waste prompt tokens; chunks that are too small lose the context needed to answer. A sensible default is 300–800 tokens with 10–20% overlap, then tune on your corpus. Bedrock Knowledge Bases offers fixed-size, semantic (split on meaning shifts), and hierarchical (parent/child) chunking out of the box.
Structure-aware chunking beats naive splitting whenever your documents have structure: split on Markdown headings, respect table and code-block boundaries, and keep a section title in each chunk's metadata so retrieval has a breadcrumb. For long technical or legal documents, hierarchical chunking — retrieve on small precise child chunks but feed the larger parent chunk to the model — gives both precision and enough surrounding context, and is one of the most reliable quality upgrades.
Embedding-based retrieval is fast but approximate; it returns chunks that are roughly relevant. A re-ranker is a cross-encoder that reads the question and each candidate chunk together and scores true relevance, which is far more accurate than vector distance. The standard pattern: retrieve a wide net (top-30 to top-50) cheaply with vectors, then re-rank and keep only the best 3–8 to put in the prompt.
On AWS, Amazon Rerank and Cohere Rerank are available through Bedrock, and Bedrock Knowledge Bases can apply re-ranking automatically. The win is twofold: better answers (the model sees only high-precision context) and lower generation cost (fewer, tighter chunks means fewer input tokens). If you do one thing to improve a struggling RAG system, add re-ranking before you touch the generation model.
When answers are bad, debug retrieval before generation. Check in order: (1) is the right chunk even in the index (parsing/chunking)? (2) does it come back in the top-50 (embedding model / hybrid search)? (3) does it survive to the top-5 (re-ranking)? (4) only then, is the prompt/model the problem? ~80% of RAG quality issues are retrieval, not generation.
"It looks good in the demo" is not evaluation. A RAG system has two failure surfaces — retrieval and generation — and you need metrics that isolate each so you know which stage to fix. The industry-standard RAG metrics break down cleanly along those lines.
Build a fixed evaluation set first: 50–200 real questions paired with the correct answer and (ideally) the source passage that contains it. Run it on every change so you can tell whether a new chunk size or a new re-ranker actually helped instead of guessing. The four metrics below are the standard "RAG triad plus relevance," and an LLM-as-a-judge model on Bedrock can score most of them automatically.
Amazon Bedrock includes RAG evaluation in its model-evaluation suite: you supply a dataset of prompts (and references) and Bedrock runs an LLM-as-a-judge to score retrieval and response quality, including faithfulness and relevance, with results you can compare across configurations. For DIY pipelines, open-source frameworks such as Ragas implement the same metrics and run anywhere. Either way, the discipline is identical: a fixed golden set, automated scoring, and a number that moves when you change a knob.
Two non-negotiables for production: log every query with its retrieved chunks and the final answer (so you can reproduce and audit any response), and add a small human-review loop on a sample of traffic, because automated judges miss domain-specific errors a subject-matter expert catches instantly.
A RAG demo and a production RAG system differ on four axes that rarely show up in a prototype: keeping the index fresh, enforcing who can see what, hitting a latency budget, and controlling a bill that scales with usage. Each has a concrete AWS answer.
Your index is only as current as your last sync. Bedrock Knowledge Bases supports incremental ingestion so changed documents re-embed automatically; in a DIY pipeline you trigger re-embedding on an S3 event (new/changed object) via Lambda or a scheduled Glue job. Decide a freshness SLA per source — a status page may need minutes, a policy archive can sync nightly — and store a timestamp in each chunk's metadata so you can filter out or down-weight stale content.
If different users may see different documents, access control must live in retrieval, not in a post-filter on the answer — by the time the model has written the answer, the data has already leaked. The pattern: tag every chunk with ACL metadata (user, group, tenant, classification) at index time, and apply a metadata filter on every query so a user only ever retrieves chunks they are entitled to. Both Bedrock Knowledge Bases (metadata filtering) and DIY stores (OpenSearch filters, pgvector SQL predicates, Pinecone metadata filters) support this. For multi-tenant SaaS, isolate tenants with a per-tenant filter at minimum, or separate indexes for hard isolation.
End-to-end latency is retrieval + re-rank + generation; generation usually dominates, so streaming the response token-by-token is the cheapest perceived-latency win. For cost, the levers are: a smaller/cheaper generation model for easy questions (route hard ones to a frontier model); fewer, re-ranked chunks (fewer input tokens); Bedrock prompt caching to avoid re-paying for a static system prompt or repeated context; and a sensible embedding dimensionality. Vector-store cost is a baseline you pay regardless of traffic — right-size it to corpus size, not peak imagination.
Before launch: golden evaluation set wired into CI · citations returned with every answer · ACL metadata filtering on every retrieval · freshness SLA per source with auto-sync · Guardrails on inputs and outputs · full query/context/answer logging · streaming responses · a cost ceiling with billing alarms. Miss any one and the gap shows up in production, not the demo.
Here is the fastest credible path from zero to a cited, production-leaning RAG system on AWS using Bedrock Knowledge Bases. The DIY path follows the same logical order with each stage hand-built.
A RAG bill has five line items. None is exotic, but together they surprise teams that budgeted only for the generation model. Here is the full stack and the lever on each.
The figures below are representative as of 2026 to show the shape of the bill, not a quote — always check the AWS pricing page (and the third-party vendor for Pinecone) for current rates. The dominant cost in almost every production RAG system is generation tokens, followed by the always-on vector-store baseline.
| Cost line | When you pay | Driver | Main lever to control it |
|---|---|---|---|
| Embeddings (indexing) | One-time per corpus + on updates | Total tokens embedded | Chunk size; smaller embedding dimensions; only re-embed changed docs |
| Vector store | Continuous (baseline) | Corpus size + index type + engine | Right-size the engine; pick pgvector if Postgres already runs; tune dimensions |
| Query embeddings | Per query | Question volume | Negligible per call; cache embeddings for repeated queries |
| Re-ranking | Per query | Candidates re-ranked × queries | Re-rank top-30/50, not top-500; skip on trivial queries |
| Generation | Per query (usually the largest) | Input + output tokens × model price | Cheaper model for easy queries; fewer chunks; prompt caching; tight max-tokens |
This is the comparison that decides your architecture. Read it as "default to managed; move a stage to DIY only when a row in the right column is a hard requirement for you."
| Dimension | Bedrock Knowledge Bases (managed) | DIY (Bedrock + your stack) |
|---|---|---|
| Time to first answer | Hours — point at S3, sync, call an API | Days to weeks — build every stage |
| Pipeline you maintain | Almost none — AWS runs ingest→retrieve→re-rank | All of it — parser, chunker, retriever, re-ranker |
| Chunking control | Fixed / semantic / hierarchical presets | Anything — fully document-aware, custom |
| Retrieval logic | Managed vector + optional re-rank + metadata filters | Custom hybrid search, score fusion, your own logic |
| Access control | Metadata filtering | Anything — row-level, per-tenant, external policy engine |
| Vector store choice | OpenSearch Serverless / Aurora pgvector / Pinecone / Redis | Any store, including ones KB doesn't support |
| Cost control | Less granular; convenience premium | Maximum — tune every stage |
| Best for | Most internal-knowledge + support use cases | Custom chunking, hybrid search, strict multi-tenancy, cost-squeeze |
Situation: Wanted an AI support assistant that answered strictly from their own docs with citations — and that never leaked one customer's data to another. A first in-house attempt hallucinated, returned irrelevant passages, and had no access-control story. The two engineers who could build it were fully committed to the core product, and the projected Bedrock + OpenSearch bill made the founder hesitate to even start.
What CloudRoute did: Routed within 24 hours to an EU-region AWS partner with a GenAI/ML track record. The partner scoped a Bedrock Knowledge Bases build in eu-central-1: S3 ingestion, hierarchical chunking, Titan v2 embeddings, OpenSearch Serverless as the vector store, Cohere Rerank for precision, Claude for generation with strict grounded-answer prompting and citations, per-tenant metadata filtering for isolation, and a 120-question golden set scored with Bedrock RAG evaluation. The whole engagement was funded by AWS credits the partner filed for — Activate Portfolio plus a Bedrock POC allocation.
Outcome: Cited, grounded assistant in production in under 5 weeks. Faithfulness and context-precision scores cleared the team's bar on the golden set; per-tenant isolation enforced at retrieval. The build and the first months of inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.
engagement window: ~5 weeks · founder time: ~6 hours · stack: Bedrock KB + OpenSearch Serverless + Titan v2 + Cohere Rerank + Claude · cost to customer: $0
CloudRoute routes you to a vetted AWS GenAI/ML partner who designs and ships the pipeline — Bedrock Knowledge Bases or a custom DIY stack, the right vector store, embeddings, re-ranking, access control, and evaluation. AWS credits fund the build and the inference. You pay $0.