rag on aws · the 2026 build guide

How to build a RAG system on AWS (2026).

Retrieval-augmented generation grounds a language model in your own documents so it answers from your data instead of hallucinating. This is the full build guide: the reference architecture end to end (ingest → chunk → embed → store → retrieve → re-rank → generate), the two paths — managed with Amazon Bedrock Knowledge Bases vs a DIY pipeline on Bedrock plus your own vector store — every vector-store option on AWS, how to choose embeddings and a re-ranker, how to evaluate faithfulness and relevance, and what production actually costs.

pipeline stages
7
build paths
2
managed RAG service
Bedrock KB
credits to fund it
up to $100K
TL;DR
  • RAG (retrieval-augmented generation) grounds a model in your own data: you embed your documents into a vector store, retrieve the most relevant chunks for each question, and pass them to the model as context so it answers from sources instead of guessing. On AWS the canonical pipeline is ingest → chunk → embed → store → retrieve → re-rank → generate.
  • There are two ways to build it. Managed: Amazon Bedrock Knowledge Bases handles chunking, embedding, storage, retrieval, and re-ranking for you — point it at an S3 bucket and you get a RetrieveAndGenerate API in hours. DIY: you orchestrate the same stages yourself on Bedrock plus a vector store you control (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone, or Redis) when you need custom chunking, hybrid search, multi-tenancy, or tighter cost control.
  • The hard parts are not the wiring — they are chunking strategy, embedding-model choice (Titan v2 vs Cohere Embed), re-ranking, freshness, access control, and evaluation (faithfulness + answer relevance + context precision). GenAI inference and vector storage bills add up fast; CloudRoute routes you to AWS credits (Activate Portfolio up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and vetted ML partners who build the pipeline — you pay $0.
the core idea

IWhat RAG is — and the specific problem it solves

Retrieval-augmented generation is a pattern, not a product: before a language model answers a question, you retrieve the most relevant passages from your own corpus and inject them into the prompt as grounding context. The model then answers from those passages instead of from parametric memory alone.

A foundation model knows only what was in its training data, frozen at a cutoff date, and it has no idea what is inside your wiki, your support tickets, your contracts, or last night's product changes. Ask it about any of that and it will either refuse or — worse — confidently invent an answer. RAG fixes this by separating knowledge from reasoning. The model keeps doing what it is good at (language and reasoning); the facts come from a retrieval step over data you control and can update at any time.

The mechanics are straightforward. Offline, you split your documents into chunks, convert each chunk into a numeric vector with an embedding model, and store those vectors in a vector database. At query time, you embed the user's question with the same embedding model, find the chunks whose vectors are nearest to the question vector (semantic similarity, not keyword match), optionally re-rank them for precision, and paste the top results into the prompt alongside an instruction like "answer using only the context below, and cite which passage you used."

The payoff is three things at once. Freshness: update a document and the next answer reflects it — no retraining. Attribution: because the answer is built from retrieved passages, you can show citations and let users verify. Governance: you decide what is in the corpus and who can retrieve which parts, which matters enormously for enterprise data. The alternative to RAG — fine-tuning the model on your data — bakes knowledge into weights, is expensive to update, cannot easily enforce per-user access control, and does not give you citations. For knowledge that changes or must be access-controlled, RAG wins; fine-tuning is for teaching a model a style or a narrow skill, not for memorizing a knowledge base.

On AWS, every stage of this pattern maps to a managed service, which is why AWS is one of the most common places to build RAG. The next section walks the full reference architecture stage by stage.

the one-sentence definition

RAG = retrieve the most relevant chunks of your data for a question, then have a foundation model answer using those chunks as grounding — so the answer is current, cited, and access-controlled instead of hallucinated.

end to end

IIThe reference RAG architecture on AWS, stage by stage

Every RAG system — managed or DIY — runs the same seven logical stages. Understanding each one is what lets you debug a system that returns irrelevant answers, because almost every quality problem traces back to a specific stage.

It helps to split the seven stages into two phases. The indexing phase (ingest → chunk → embed → store) runs offline whenever your data changes. The query phase (retrieve → re-rank → generate) runs in real time on every user question. The table below maps each stage to the AWS service that typically implements it.

Indexing phase — ingest, chunk, embed, store

1. Ingest. Get raw documents into a staging location — almost always an Amazon S3 bucket, populated from your wiki, CRM, ticketing system, or a database export. PDFs, HTML, Word, and Markdown all need a parsing step to extract clean text; messy extraction (broken tables, headers mixed into body text) is the single most common silent cause of bad answers downstream.

2. Chunk. Split each document into passages small enough to embed precisely but large enough to carry meaning — typically 300–800 tokens with a 10–20% overlap so a sentence split across a boundary still appears whole in one chunk. Chunking is covered in depth in section V; it matters more than almost any other knob.

3. Embed. Convert each chunk into a vector with an embedding model — on AWS, Amazon Titan Text Embeddings v2 or Cohere Embed (both served through Amazon Bedrock). The model and its output dimensionality are fixed for the life of the index: change the embedding model and you must re-embed everything.

4. Store. Write each vector plus its source text and metadata (document ID, title, ACL tags, timestamps) into a vector store with an approximate-nearest-neighbour index. Options on AWS — OpenSearch Serverless, Aurora pgvector, Pinecone, Redis — are compared in section IV.

Query phase — retrieve, re-rank, generate

5. Retrieve. Embed the incoming question with the same embedding model, then query the vector store for the top-K nearest chunks (K is commonly 10–50 at this stage). Hybrid retrieval — combining vector similarity with a keyword/BM25 search and fusing the scores — almost always beats pure-vector retrieval on real corpora, because it catches exact terms, product names, and acronyms that embeddings sometimes blur.

6. Re-rank. Pass the top-K candidates through a cross-encoder re-ranker (e.g. Amazon Rerank or Cohere Rerank on Bedrock) that scores each chunk against the question directly and keeps only the best 3–8. This is the highest-leverage precision step in the whole pipeline — re-ranking routinely turns a mediocre retriever into a good one. It is covered in section V.

7. Generate. Build the final prompt — system instruction + the re-ranked chunks + the user question — and send it to a generation model on Bedrock (Claude, Amazon Nova, Llama, Mistral). Instruct the model to answer only from the supplied context and to cite the passages it used; return those citations to the user so the answer is verifiable.

the seven RAG stages mapped to AWS services · representative as of 2026
StagePhaseWhat it doesTypical AWS service
1. IngestIndexing (offline)Land + parse raw documentsAmazon S3 (+ Textract / parsing)
2. ChunkIndexing (offline)Split documents into passagesBedrock KB built-in, or Lambda/Glue (DIY)
3. EmbedIndexing (offline)Text → vectorsTitan Text Embeddings v2 / Cohere Embed (Bedrock)
4. StoreIndexing (offline)Index vectors + metadata for ANN searchOpenSearch Serverless / Aurora pgvector / Pinecone / Redis
5. RetrieveQuery (real time)Find top-K nearest chunksVector store query (+ hybrid BM25)
6. Re-rankQuery (real time)Score + keep the best 3–8 chunksAmazon Rerank / Cohere Rerank (Bedrock)
7. GenerateQuery (real time)Answer from context, with citationsClaude / Nova / Llama / Mistral (Bedrock)
Bedrock Knowledge Bases collapses stages 2–6 behind two API calls (Retrieve and RetrieveAndGenerate). A DIY pipeline implements each stage yourself for control. Both paths use the same Bedrock embedding + generation models.
the central decision

IIITwo paths: managed (Bedrock Knowledge Bases) vs DIY

The first real decision is not which vector store or which model — it is whether to let Amazon Bedrock Knowledge Bases run the pipeline for you, or to assemble it yourself. This single choice determines how much you build, how much you control, and how fast you ship.

The honest framing: start managed, move to DIY only when a specific requirement forces it. Most teams overbuild RAG on day one, hand-rolling a pipeline they then have to maintain, when Bedrock Knowledge Bases would have shipped the same answers in an afternoon. Conversely, some teams force everything into the managed path and then fight it when they hit a hard requirement it does not cover. Knowing where the line is saves weeks.

Path A — Amazon Bedrock Knowledge Bases (managed)

Bedrock Knowledge Bases is AWS's fully-managed RAG service. You point a knowledge base at an S3 bucket (or a connected source like a web crawler, Confluence, Salesforce, or SharePoint), pick an embedding model and a vector store, and Bedrock handles ingestion, chunking, embedding, indexing, retrieval, and optional re-ranking. You then call Retrieve (get relevant chunks) or RetrieveAndGenerate (get a cited answer in one call). It manages incremental sync when documents change, returns source citations automatically, and integrates with Bedrock Guardrails for safety filtering.

Choose managed when: you want to ship in hours not weeks; your documents live in S3 or a supported connector; standard fixed/semantic/hierarchical chunking is good enough; and you do not need exotic retrieval logic. This covers the large majority of internal-knowledge and support use cases.

Path B — DIY pipeline (Bedrock + your own vector store)

In the DIY path you still call Bedrock for embeddings and generation, but you own every stage: your own parser, your own chunker (Lambda, AWS Glue, or a Step Functions workflow), direct writes to your chosen vector store, your own retrieval and hybrid-search logic, your own re-ranking call, and your own prompt assembly. Orchestration frameworks like LangChain or LlamaIndex are common here but optional.

Choose DIY when: you need custom or document-aware chunking; you require hybrid (vector + keyword) search with your own score fusion; you have strict multi-tenant isolation or row-level access control the managed path can't express; you want to reuse a vector store you already operate; or you are squeezing cost/latency hard enough that the managed convenience premium matters. DIY costs more engineering time and ongoing maintenance — that is the trade.

the pragmatic rule

Prototype on Bedrock Knowledge Bases to prove the use case and get a baseline answer quality fast. Graduate specific stages to DIY only when a concrete requirement — custom chunking, hybrid search, multi-tenant ACLs, or aggressive cost control — actually forces it. Many production systems are a hybrid: managed KB for the bulk corpus, a DIY path for one demanding source.

where the vectors live

IVVector store options on AWS — OpenSearch, Aurora pgvector, Pinecone, Redis

The vector store holds your embeddings and answers nearest-neighbour queries. All four common options work with both Bedrock Knowledge Bases and DIY pipelines; they differ on operational model, cost shape, scale, and whether you already run the underlying engine.

Amazon OpenSearch Serverless is the most common default for Bedrock Knowledge Bases. It is fully managed, scales automatically, supports both vector (k-NN) and keyword search in one engine (so hybrid retrieval is native), and integrates tightly with the rest of AWS. The trade is that it bills by OpenSearch Compute Units (OCUs) with a minimum baseline, so it can feel expensive for very small corpora even when idle.

Aurora PostgreSQL with the pgvector extension is the pragmatic choice when you already run Postgres. Your vectors live in the same database as your relational data, so you can filter by SQL predicates and join to business tables in one query, and there is no new system to operate. It scales well into the millions of vectors with HNSW indexing; beyond that, or for very high query concurrency, a purpose-built vector engine pulls ahead.

Pinecone is a managed, vector-native database available through the AWS Marketplace (and selectable in Bedrock Knowledge Bases). It is built only for vector search, so it offers strong performance, serverless scaling, and rich metadata filtering with minimal tuning — attractive when vector search is your core workload and you want a specialist rather than a general engine. It is a third-party service billed separately (Marketplace billing can route through your AWS invoice).

Amazon MemoryDB / Redis (Redis with vector search) is the option to reach for when retrieval latency is critical — it is in-memory, so single-digit-millisecond queries are realistic, which suits real-time chat and agent loops. The trade is cost at large scale (RAM is pricier than disk) and that it is best for hot, latency-sensitive indexes rather than enormous archival corpora.

Two more AWS-native choices exist and are worth knowing: Amazon Aurora aside, Amazon DocumentDB and Amazon Neptune Analytics both support vector search, and the new S3 Vectors capability targets very large, cost-optimized vector sets with infrequent queries. For most builds, though, the decision is among the four in the table below.

aws vector store options for rag · representative as of 2026 — check the AWS pricing page for current rates
Vector storeManaged?Hybrid searchBest fitCost shapeWatch-out
OpenSearch ServerlessFully managedNative (vector + BM25)Default for Bedrock KB; teams wanting one engine for bothPer OCU + storage; baseline minimumBaseline cost stings tiny corpora
Aurora PostgreSQL (pgvector)Managed DBVector + SQL filtersTeams already on Postgres; SQL-joined metadataPer Aurora instance / ACU + storageSpecialist engines win at extreme scale/QPS
PineconeFully managed (3rd-party)Vector + metadata filtersVector-native workloads wanting zero tuningPer usage / pod (Marketplace)Separate vendor; data leaves AWS-native services
Amazon MemoryDB / RedisManagedVector + filtersUltra-low-latency real-time retrievalPer node (RAM-priced)Expensive for very large indexes
All four are selectable as the vector store behind Bedrock Knowledge Bases and usable directly in DIY pipelines. OpenSearch Serverless is the path of least resistance on AWS; pgvector is the path of least new infrastructure if you already run Postgres.
the quality knobs

VEmbeddings, chunking, and re-ranking — where answer quality is won or lost

Retrieval quality — not the generation model — is what makes or breaks a RAG system. If the right chunk never makes it into the prompt, no model can answer well. Three knobs dominate: which embedding model you use, how you chunk, and whether you re-rank.

Embedding model choice — Titan v2 vs Cohere

On Bedrock the two mainstream embedding families are Amazon Titan Text Embeddings v2 and Cohere Embed. Titan v2 is AWS-native, inexpensive, and supports configurable output dimensions (e.g. 256 / 512 / 1024) — smaller dimensions cut storage and speed up search at a modest recall cost, which is a real lever at scale. Cohere Embed (English and multilingual variants) is strong on retrieval benchmarks and a frequent pick for multilingual corpora and search-heavy products.

Two rules matter more than the specific winner. First, you must embed queries and documents with the same model and version — mixing them produces incomparable vectors and silently destroys recall. Second, changing the embedding model means re-embedding the entire corpus, so treat this choice as semi-permanent and benchmark on your own data before committing. Pick based on language coverage, dimensionality/cost, and measured recall on a sample of your real questions — not on a leaderboard.

Chunking strategy

Chunking is the highest-variance decision in RAG. Chunks that are too large dilute the embedding (one vector trying to represent five topics) and waste prompt tokens; chunks that are too small lose the context needed to answer. A sensible default is 300–800 tokens with 10–20% overlap, then tune on your corpus. Bedrock Knowledge Bases offers fixed-size, semantic (split on meaning shifts), and hierarchical (parent/child) chunking out of the box.

Structure-aware chunking beats naive splitting whenever your documents have structure: split on Markdown headings, respect table and code-block boundaries, and keep a section title in each chunk's metadata so retrieval has a breadcrumb. For long technical or legal documents, hierarchical chunking — retrieve on small precise child chunks but feed the larger parent chunk to the model — gives both precision and enough surrounding context, and is one of the most reliable quality upgrades.

Re-ranking

Embedding-based retrieval is fast but approximate; it returns chunks that are roughly relevant. A re-ranker is a cross-encoder that reads the question and each candidate chunk together and scores true relevance, which is far more accurate than vector distance. The standard pattern: retrieve a wide net (top-30 to top-50) cheaply with vectors, then re-rank and keep only the best 3–8 to put in the prompt.

On AWS, Amazon Rerank and Cohere Rerank are available through Bedrock, and Bedrock Knowledge Bases can apply re-ranking automatically. The win is twofold: better answers (the model sees only high-precision context) and lower generation cost (fewer, tighter chunks means fewer input tokens). If you do one thing to improve a struggling RAG system, add re-ranking before you touch the generation model.

the debugging order

When answers are bad, debug retrieval before generation. Check in order: (1) is the right chunk even in the index (parsing/chunking)? (2) does it come back in the top-50 (embedding model / hybrid search)? (3) does it survive to the top-5 (re-ranking)? (4) only then, is the prompt/model the problem? ~80% of RAG quality issues are retrieval, not generation.

measuring it

VIEvaluating a RAG system — faithfulness, relevance, and context quality

"It looks good in the demo" is not evaluation. A RAG system has two failure surfaces — retrieval and generation — and you need metrics that isolate each so you know which stage to fix. The industry-standard RAG metrics break down cleanly along those lines.

Build a fixed evaluation set first: 50–200 real questions paired with the correct answer and (ideally) the source passage that contains it. Run it on every change so you can tell whether a new chunk size or a new re-ranker actually helped instead of guessing. The four metrics below are the standard "RAG triad plus relevance," and an LLM-as-a-judge model on Bedrock can score most of them automatically.

  • Faithfulness (groundedness) — Does the answer follow from the retrieved context, or did the model add unsupported claims? This is the anti-hallucination metric. Low faithfulness with good context means a prompting/generation problem — tighten the "answer only from context" instruction or switch generation models.
  • Answer relevance — Does the answer actually address the question asked (not a related-but-different one)? Catches the model wandering off or being evasive even when the facts are right.
  • Context precision — Of the chunks you retrieved, how many were actually relevant? Low precision means your retriever is noisy — fix with re-ranking or hybrid search. Noisy context also raises cost and can distract the model.
  • Context recall — Did retrieval surface all the chunks needed to answer fully? Low recall means the right information never reached the model — fix with chunking, a better embedding model, a larger top-K, or hybrid search.

How to run it on AWS

Amazon Bedrock includes RAG evaluation in its model-evaluation suite: you supply a dataset of prompts (and references) and Bedrock runs an LLM-as-a-judge to score retrieval and response quality, including faithfulness and relevance, with results you can compare across configurations. For DIY pipelines, open-source frameworks such as Ragas implement the same metrics and run anywhere. Either way, the discipline is identical: a fixed golden set, automated scoring, and a number that moves when you change a knob.

Two non-negotiables for production: log every query with its retrieved chunks and the final answer (so you can reproduce and audit any response), and add a small human-review loop on a sample of traffic, because automated judges miss domain-specific errors a subject-matter expert catches instantly.

shipping it for real

VIIProduction concerns — freshness, access control, latency, and cost

A RAG demo and a production RAG system differ on four axes that rarely show up in a prototype: keeping the index fresh, enforcing who can see what, hitting a latency budget, and controlling a bill that scales with usage. Each has a concrete AWS answer.

Freshness

Your index is only as current as your last sync. Bedrock Knowledge Bases supports incremental ingestion so changed documents re-embed automatically; in a DIY pipeline you trigger re-embedding on an S3 event (new/changed object) via Lambda or a scheduled Glue job. Decide a freshness SLA per source — a status page may need minutes, a policy archive can sync nightly — and store a timestamp in each chunk's metadata so you can filter out or down-weight stale content.

Access control (the one most teams underestimate)

If different users may see different documents, access control must live in retrieval, not in a post-filter on the answer — by the time the model has written the answer, the data has already leaked. The pattern: tag every chunk with ACL metadata (user, group, tenant, classification) at index time, and apply a metadata filter on every query so a user only ever retrieves chunks they are entitled to. Both Bedrock Knowledge Bases (metadata filtering) and DIY stores (OpenSearch filters, pgvector SQL predicates, Pinecone metadata filters) support this. For multi-tenant SaaS, isolate tenants with a per-tenant filter at minimum, or separate indexes for hard isolation.

Latency and cost

End-to-end latency is retrieval + re-rank + generation; generation usually dominates, so streaming the response token-by-token is the cheapest perceived-latency win. For cost, the levers are: a smaller/cheaper generation model for easy questions (route hard ones to a frontier model); fewer, re-ranked chunks (fewer input tokens); Bedrock prompt caching to avoid re-paying for a static system prompt or repeated context; and a sensible embedding dimensionality. Vector-store cost is a baseline you pay regardless of traffic — right-size it to corpus size, not peak imagination.

production readiness checklist

Before launch: golden evaluation set wired into CI · citations returned with every answer · ACL metadata filtering on every retrieval · freshness SLA per source with auto-sync · Guardrails on inputs and outputs · full query/context/answer logging · streaming responses · a cost ceiling with billing alarms. Miss any one and the gap shows up in production, not the demo.

the build, in order

VIIIA step-by-step build outline (managed path)

Here is the fastest credible path from zero to a cited, production-leaning RAG system on AWS using Bedrock Knowledge Bases. The DIY path follows the same logical order with each stage hand-built.

  • Step 1 — Stage your corpus in S3 — Land your documents in an S3 bucket. Parse non-text formats (PDF/Word/HTML) to clean text first — Amazon Textract for scanned PDFs and tables. Garbage in, garbage out: clean extraction here pays off at every later stage.
  • Step 2 — Enable Bedrock model access — In the Bedrock console, request access to an embedding model (Titan Text Embeddings v2 or Cohere Embed) and a generation model (Claude, Nova, Llama, or Mistral), in your chosen Region.
  • Step 3 — Create the Knowledge Base — Create a Bedrock Knowledge Base, point it at the S3 bucket, pick the embedding model, and pick a vector store (OpenSearch Serverless is the default; pgvector/Pinecone/Redis are selectable). Choose a chunking strategy — start with semantic or hierarchical chunking.
  • Step 4 — Sync and inspect — Run the initial ingestion job. Spot-check that documents chunked sensibly and that retrieval returns the right passages for a handful of known questions using the Retrieve API. Fix parsing/chunking now, before anyone sees an answer.
  • Step 5 — Wire RetrieveAndGenerate — Call RetrieveAndGenerate with a system prompt that instructs the model to answer only from the retrieved context and to cite sources. Enable re-ranking. Attach a Bedrock Guardrail for input/output safety. Return citations to the UI.
  • Step 6 — Add access control + freshness — Tag chunks with ACL metadata and apply per-query metadata filters. Wire incremental sync so changed documents re-embed automatically. Set a freshness SLA per source.
  • Step 7 — Evaluate, then iterate — Build a 50–200 question golden set and score faithfulness, answer relevance, and context precision/recall with Bedrock RAG evaluation. Tune chunk size, top-K, and re-ranking against the numbers. Add logging and a human-review sample before scaling traffic.
what it costs

IXThe RAG cost stack on AWS — where the money goes

A RAG bill has five line items. None is exotic, but together they surprise teams that budgeted only for the generation model. Here is the full stack and the lever on each.

The figures below are representative as of 2026 to show the shape of the bill, not a quote — always check the AWS pricing page (and the third-party vendor for Pinecone) for current rates. The dominant cost in almost every production RAG system is generation tokens, followed by the always-on vector-store baseline.

rag cost stack on aws · representative shape as of 2026 — check the AWS pricing page for current rates
Cost lineWhen you payDriverMain lever to control it
Embeddings (indexing)One-time per corpus + on updatesTotal tokens embeddedChunk size; smaller embedding dimensions; only re-embed changed docs
Vector storeContinuous (baseline)Corpus size + index type + engineRight-size the engine; pick pgvector if Postgres already runs; tune dimensions
Query embeddingsPer queryQuestion volumeNegligible per call; cache embeddings for repeated queries
Re-rankingPer queryCandidates re-ranked × queriesRe-rank top-30/50, not top-500; skip on trivial queries
GenerationPer query (usually the largest)Input + output tokens × model priceCheaper model for easy queries; fewer chunks; prompt caching; tight max-tokens
Two levers cut the biggest line — generation — the most: prompt caching (stop re-paying for a static system prompt and repeated context) and re-ranking down to a few tight chunks (fewer input tokens per call). Batch any offline generation for roughly half price.
the central decision, side by side

Managed (Bedrock Knowledge Bases) vs DIY RAG — which to build

This is the comparison that decides your architecture. Read it as "default to managed; move a stage to DIY only when a row in the right column is a hard requirement for you."

DimensionBedrock Knowledge Bases (managed)DIY (Bedrock + your stack)
Time to first answerHours — point at S3, sync, call an APIDays to weeks — build every stage
Pipeline you maintainAlmost none — AWS runs ingest→retrieve→re-rankAll of it — parser, chunker, retriever, re-ranker
Chunking controlFixed / semantic / hierarchical presetsAnything — fully document-aware, custom
Retrieval logicManaged vector + optional re-rank + metadata filtersCustom hybrid search, score fusion, your own logic
Access controlMetadata filteringAnything — row-level, per-tenant, external policy engine
Vector store choiceOpenSearch Serverless / Aurora pgvector / Pinecone / RedisAny store, including ones KB doesn't support
Cost controlLess granular; convenience premiumMaximum — tune every stage
Best forMost internal-knowledge + support use casesCustom chunking, hybrid search, strict multi-tenancy, cost-squeeze
Both paths call the same Bedrock embedding and generation models — the difference is who orchestrates the stages in between. A common production shape is a hybrid: managed KB for the bulk corpus, DIY for one demanding source.
building this for real?
Have a vetted AWS partner build your RAG — and let AWS credits pay for it
Start in 3 minutes →
a recent match

A grounded support assistant — anonymized

inquiry · seed-stage b2b SaaS, support automation, EU
Seed-stage B2B SaaS, 14 people, ~12k help-centre articles + 3 years of resolved tickets, EU data-residency requirement

Situation: Wanted an AI support assistant that answered strictly from their own docs with citations — and that never leaked one customer's data to another. A first in-house attempt hallucinated, returned irrelevant passages, and had no access-control story. The two engineers who could build it were fully committed to the core product, and the projected Bedrock + OpenSearch bill made the founder hesitate to even start.

What CloudRoute did: Routed within 24 hours to an EU-region AWS partner with a GenAI/ML track record. The partner scoped a Bedrock Knowledge Bases build in eu-central-1: S3 ingestion, hierarchical chunking, Titan v2 embeddings, OpenSearch Serverless as the vector store, Cohere Rerank for precision, Claude for generation with strict grounded-answer prompting and citations, per-tenant metadata filtering for isolation, and a 120-question golden set scored with Bedrock RAG evaluation. The whole engagement was funded by AWS credits the partner filed for — Activate Portfolio plus a Bedrock POC allocation.

Outcome: Cited, grounded assistant in production in under 5 weeks. Faithfulness and context-precision scores cleared the team's bar on the golden set; per-tenant isolation enforced at retrieval. The build and the first months of inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.

engagement window: ~5 weeks · founder time: ~6 hours · stack: Bedrock KB + OpenSearch Serverless + Titan v2 + Cohere Rerank + Claude · cost to customer: $0

faq

Common questions

What is the difference between RAG and fine-tuning on AWS?
RAG retrieves your data at query time and feeds it to the model as context — knowledge lives outside the model, so you update it by editing documents (no retraining), you get citations, and you can enforce per-user access control. Fine-tuning bakes knowledge or style into the model's weights — it is good for teaching a consistent format, tone, or a narrow skill, but it is expensive to update, gives no citations, and cannot easily do access control. For knowledge that changes or must be access-controlled, use RAG. Many production systems use both: RAG for facts, light fine-tuning for behaviour.
Should I use Amazon Bedrock Knowledge Bases or build my own RAG pipeline?
Start with Bedrock Knowledge Bases. It manages ingestion, chunking, embedding, storage, retrieval, and re-ranking, and exposes a RetrieveAndGenerate API that returns cited answers — you can ship in hours. Move a stage to a DIY pipeline (Bedrock + your own vector store and orchestration) only when a concrete requirement forces it: custom document-aware chunking, hybrid vector+keyword search with your own score fusion, strict multi-tenant or row-level access control, reuse of an existing vector store, or aggressive cost/latency tuning. Most teams overbuild here — managed covers the majority of use cases.
Which vector store should I use for RAG on AWS?
OpenSearch Serverless is the path of least resistance — it is the default behind Bedrock Knowledge Bases and supports vector and keyword search in one engine (native hybrid), though its baseline cost stings tiny corpora. Aurora PostgreSQL with pgvector is the path of least new infrastructure if you already run Postgres, and lets you filter vectors with SQL and join to business tables. Pinecone is a vector-native managed option (via Marketplace) for zero-tuning vector workloads. Amazon MemoryDB/Redis is for ultra-low-latency real-time retrieval. All four work with both Bedrock KB and DIY pipelines.
Which embedding model is better — Amazon Titan or Cohere?
Both run on Bedrock and both are good — benchmark on your own data rather than a leaderboard. Amazon Titan Text Embeddings v2 is AWS-native, inexpensive, and supports configurable output dimensions (smaller dimensions cut storage/latency at a small recall cost). Cohere Embed (English + multilingual) scores strongly on retrieval benchmarks and is a common pick for multilingual corpora. Two hard rules regardless of choice: embed queries and documents with the same model/version, and remember that changing the embedding model later means re-embedding your entire corpus — so treat it as semi-permanent.
How do I stop a RAG system from hallucinating?
Most hallucination in RAG is actually a retrieval failure — the right chunk never reached the model. Debug retrieval first (is the chunk in the index? does it come back in the top-50? does it survive re-ranking to the top-5?). Then, on the generation side: instruct the model to answer only from the supplied context and to say "I don't know" when the context lacks the answer, add a re-ranker so only high-precision chunks are passed, return citations so answers are verifiable, attach Bedrock Guardrails, and measure faithfulness on a golden set so you catch regressions. Roughly 80% of RAG quality problems are retrieval, not the model.
How do I enforce access control so users only see documents they're allowed to?
Enforce it in retrieval, never as a post-filter on the generated answer — by the time the model has written the answer, restricted data has already been used. Tag every chunk with ACL metadata (user, group, tenant, classification) at index time, and apply a metadata filter on every query so a user can only retrieve chunks they are entitled to. Bedrock Knowledge Bases supports metadata filtering; DIY stores support it via OpenSearch filters, pgvector SQL predicates, or Pinecone metadata filters. For multi-tenant SaaS, filter by tenant at minimum, or use separate indexes for hard isolation.
What does RAG on AWS actually cost?
Five line items: one-time embedding of the corpus (re-embed only on updates), a continuous vector-store baseline, per-query question embeddings (negligible), per-query re-ranking, and per-query generation — which is usually the largest cost. Generation scales with input+output tokens, so the biggest levers are routing easy questions to a cheaper model, passing fewer re-ranked chunks, using Bedrock prompt caching for static context, and batching any offline generation (~50% cheaper). Vector-store cost is a baseline you pay regardless of traffic, so right-size it to corpus size. Figures are representative as of 2026 — check the AWS pricing page for current rates.
How long does it take to build a production RAG system on AWS?
A managed Bedrock Knowledge Bases prototype that returns cited answers can be standing in hours to a day. Getting to genuinely production-ready — access control, freshness sync, a golden evaluation set, Guardrails, logging, and cost controls — is typically 2–6 weeks depending on data cleanliness and requirements. A fully custom DIY pipeline takes longer. The slowest part is almost always data preparation (clean parsing and chunking), not the AWS wiring. A specialist ML partner compresses this materially, which is the engagement CloudRoute routes — funded by AWS credits, so the customer pays $0.

Build your RAG on AWS — funded by AWS credits

CloudRoute routes you to a vetted AWS GenAI/ML partner who designs and ships the pipeline — Bedrock Knowledge Bases or a custom DIY stack, the right vector store, embeddings, re-ranking, access control, and evaluation. AWS credits fund the build and the inference. You pay $0.

matched within< 24h
credits to fund itup to $100K
cost to you$0
How to build a RAG system on AWS (2026) — full build guide · CloudRoute