A complete, neutral reference for Amazon Bedrock Knowledge Bases in 2026: what they are (a fully-managed retrieval-augmented-generation pipeline), the data sources they connect (S3, web crawler, Confluence, SharePoint, Salesforce), how ingestion works (chunking strategies, parsing including FM parsing for complex docs), which embeddings model and vector store to pick, the Retrieve and RetrieveAndGenerate APIs, metadata filtering, when managed beats DIY — and how AWS credits make the whole build $0.
Knowledge Bases is the part of Amazon Bedrock that turns "ask questions over my own documents" from a multi-service engineering project into a managed feature. The clearest one-line definition: it is a fully-managed retrieval-augmented-generation (RAG) pipeline.
To see why that matters, it helps to know what RAG is and why teams build it. A foundation model only knows what was in its training data; it has never seen your internal wiki, your product manuals, your support tickets, or last quarter's contracts. Retrieval-augmented generation fixes that by retrieving the relevant snippets of your data at question time and putting them into the model's context, so the answer is grounded in your facts rather than the model's memory. RAG is how most enterprise "chat with your documents" and "answer from our knowledge base" products work, because it is cheaper, faster to update, and more auditable than fine-tuning a model on the same data.
Building RAG by hand means stitching together at least five moving parts: a way to load documents from wherever they live; a parser to extract clean text (and tables and images) from messy formats like PDF; a chunker to split that text into retrieval-sized pieces; an embeddings model to turn each chunk into a vector; a vector store to hold those vectors and search them by similarity; and then the query-time logic to embed the question, retrieve the closest chunks, assemble a prompt, call a model, and return a cited answer. Each piece is a service to deploy, secure, scale, and keep in sync as the source data changes.
Knowledge Bases collapses all of that into one managed service. You declare a data source and a vector store, choose an embeddings model and a chunking strategy, and Bedrock runs the ingestion pipeline for you — parsing, chunking, embedding, and writing the vectors — and keeps it in sync when the underlying data changes. At query time you call one of two APIs and get back either the relevant chunks or a fully-grounded, cited answer. You never write the retrieval loop, and you never operate the embedding or sync infrastructure.
It is worth being precise about what is and is not managed. Bedrock manages the pipeline — the orchestration of parse → chunk → embed → store → retrieve. It does not hide the vector store: you bring (or let it create) a real vector database that you can see and pay for. And it does not remove your decisions — chunking strategy, embeddings model, and vector store are all yours to choose, and they materially change quality and cost. Knowledge Bases removes the undifferentiated heavy lifting, not the architecture decisions.
Amazon Bedrock Knowledge Bases is a fully-managed RAG pipeline: point it at your data, pick an embeddings model and a vector store, and it handles parsing, chunking, embedding, syncing, and retrieval — exposed through the Retrieve and RetrieveAndGenerate APIs. You get grounded, cited answers over your own data without building the plumbing.
A Knowledge Base is only as useful as the data it can reach. Bedrock supports a growing set of first-party data-source connectors so you can index data where it already lives rather than copying it into a bucket first.
The foundational source is Amazon S3 — you put documents (PDF, plain text, HTML, Markdown, Word, CSV, and more) in a bucket, point the Knowledge Base at the prefix, and it ingests them. S3 is the path most teams start with because almost any pipeline can drop files into a bucket. Beyond S3, Bedrock offers connectors that crawl or sync from systems of record so the source content stays where its owners maintain it:
A few practical notes. First, you can attach multiple data sources to one Knowledge Base, so a single assistant can answer across S3 documents, a Confluence wiki, and a website at once. Second, each connector has its own sync model — you trigger or schedule an ingestion job, and the Knowledge Base reflects additions, changes, and deletions from the source on the next sync. Third, connectors honor the scope you configure (which spaces, sites, URL patterns, or prefixes), which is the first line of access control — though sensitive deployments should also lean on metadata filtering and Bedrock Guardrails. The exact connector list and capabilities expand over time, so confirm current support in the AWS Bedrock documentation when you scope a build.
Ingestion is where a Knowledge Base earns its keep, and it is also where the two highest-leverage quality decisions live: how documents are parsed, and how they are chunked. Get these right and retrieval is sharp; get them wrong and the model retrieves noise no matter how good it is.
When you run a sync, the pipeline executes four steps for every document: parse (extract clean text and structure from the source format), chunk (split that text into retrieval-sized pieces), embed (turn each chunk into a vector with the embeddings model you chose), and store (write the vectors plus their source text and metadata into the vector store). The two steps you actively configure are parsing and chunking.
Standard parsing extracts the text from a document and works well for clean, text-first files. It struggles with complex documents — PDFs full of tables, multi-column layouts, scanned pages, charts, or images that carry meaning. For those, Bedrock offers foundation-model (FM) parsing: instead of naive text extraction, a multimodal foundation model reads the page and produces a faithful structured representation — preserving tables as tables, capturing the content of figures, and respecting layout. FM parsing costs more per document (you are paying a model to read each page) but is often the difference between a financial-report or engineering-spec corpus being usable or useless. The honest guidance: use standard parsing by default, and turn on FM parsing for sources where layout and tables carry the meaning.
Chunking decides how text is cut into the pieces that get embedded and retrieved, and it is the single biggest lever on retrieval quality. Bedrock supports several strategies. Fixed-size chunking splits text into chunks of a set token length with a configurable overlap between neighbours — simple, predictable, and a fine default. Semantic chunking uses embeddings to find natural topic boundaries and splits there, so each chunk is a coherent idea rather than an arbitrary span — better for prose where a fixed cut might slice a thought in half. Hierarchical chunking builds parent/child chunks: small child chunks are embedded and searched for precision, but the larger parent chunk is what gets returned to the model for context — combining sharp retrieval with enough surrounding text to answer well. You can also supply no chunking (treat each file as one chunk) when documents are already short and self-contained, or use a custom transformation (e.g. via a Lambda) for bespoke logic.
Each chunk is then passed to the embeddings model, which returns a vector — a list of numbers that captures the chunk's meaning so that semantically similar text lands near it in vector space. Those vectors, along with the original chunk text and any metadata, are written to the vector store. From that point the corpus is queryable: a question is embedded the same way, and the store returns the chunks whose vectors are closest. Re-running a sync after the source changes updates only what changed, keeping the index current.
| Strategy | How it splits | Strength | Watch out for | Good for |
|---|---|---|---|---|
| Fixed-size | Set token length + overlap | Simple, predictable, cheap | Can cut mid-idea | General default, uniform docs |
| Semantic | At meaning boundaries (via embeddings) | Coherent, self-contained chunks | Extra embedding cost to find boundaries | Long prose, mixed-topic docs |
| Hierarchical | Small child chunks + larger parents | Precise retrieval, rich context returned | More config + storage | Technical docs, long manuals |
| None (per-file) | One chunk per document | Keeps short docs whole | Poor for long files | FAQs, short articles |
| Custom (Lambda) | Your own transformation | Full control | You own the logic | Bespoke formats / rules |
The embeddings model turns text into the vectors that power retrieval. Bedrock lets you choose which one a Knowledge Base uses, and the choice is a quiet but real lever on both quality and cost — and it is effectively permanent for a given index.
The two main families on Bedrock are Amazon Titan Text Embeddings and Cohere Embed. Titan Text Embeddings is Amazon's own embeddings model, available in versions that trade off vector dimensionality and cost; it is the common default and is well-integrated and inexpensive. Cohere Embed is a strong alternative, offered in English and multilingual variants — the multilingual model is the usual pick when your corpus or your users span many languages. Both are billed per input token (the output vector is not charged), at the very low embeddings rates covered in §VII.
Two technical points matter when you choose. First, dimensionality: embeddings models output vectors of a fixed size (a few hundred to a couple of thousand numbers). Larger vectors can capture more nuance but cost more to store and search; some models let you pick a smaller dimension to save on storage and latency. Second — and this is the one teams forget — the embeddings model and the index are bound together. Vectors from one model are not comparable to vectors from another, so you cannot swap embeddings models without re-embedding the entire corpus into a fresh index. Choose deliberately up front, because changing later means a full re-ingestion.
For most English-language corpora, Titan Text Embeddings is a sensible, low-cost default. Reach for Cohere's multilingual model when language coverage is a first-class requirement. Either way, the embeddings model is a smaller quality lever than chunking and parsing — pick a reasonable one and spend your tuning effort on the ingestion pipeline first.
A Knowledge Base needs a vector store to hold and search the embeddings. This is the one piece Bedrock does not abstract away — you choose the store, you can see it, and you pay for it directly. The choice affects cost, latency, operational model, and whether you reuse infrastructure you already run.
Bedrock can create and manage a vector store for you (the quickstart path) or connect to one you already operate. The default and fastest way to get started is Amazon OpenSearch Serverless; the alternatives matter when you have an existing database investment, specific cost targets, or a vendor preference. Here is the practical rundown:
The default decision tree is simple. If you just want it to work, let Bedrock provision OpenSearch Serverless. If you already run Aurora/Postgres and want to minimize new infrastructure (and cost at low volume), use pgvector. If your organization already standardizes on Pinecone or Redis, reuse it. If your data is graph-shaped and relationships matter, evaluate Neptune Analytics. The supported-store list grows over time, so confirm current options in the AWS docs — but for the large majority of builds, OpenSearch Serverless or Aurora pgvector is the right answer.
Once a Knowledge Base is built and synced, you query it through two APIs. Which one you call decides how much of the RAG loop Bedrock runs versus how much you keep control of — and metadata filtering is the feature that makes retrieval precise and access-aware.
The Retrieve API takes a query, embeds it, searches the vector store, and returns the top-matching chunks — the source text, a relevance score, and the location/metadata of each. Crucially, it does not call a generation model. You get the raw retrieved context and do whatever you want with it: assemble your own prompt, mix in other context, route to a specific model, run your own re-ranking, or feed it into an agent or a Bedrock Flow. Retrieve is the right call when you want the quality of managed retrieval but full control over the generation step.
The RetrieveAndGenerate API does the whole RAG loop in a single request: it retrieves the relevant chunks, constructs the prompt, calls a foundation model you specify, and returns a natural-language answer with citations back to the source chunks. It also supports multi-turn conversations, carrying session context so follow-up questions work. This is the fastest path to a working "chat with your docs" experience — one API call, grounded answer, citations included — and it is what most teams use until they need the finer control of Retrieve. The citations are not a nice-to-have: they are what make the answer auditable and let your UI link users back to the source.
Every chunk can carry metadata — fields like document type, author, date, department, product line, or tenant ID — supplied via a sidecar metadata file in S3 or pulled from the source connector. At query time you can apply metadata filters so retrieval only considers chunks matching a condition (e.g. department = "finance", year >= 2024, or tenant = "acme"). This does two things: it sharpens relevance by excluding irrelevant chunks before similarity search, and it is a key building block for access control and multi-tenancy — you scope each user's queries to the data they are allowed to see. Combined with Bedrock Guardrails for content safety, metadata filtering is how a single Knowledge Base safely serves many users or tenants.
Use RetrieveAndGenerate to ship a cited "chat with your data" experience fast — one call does retrieval + generation + citations. Use Retrieve when you need control of the generation step — your own prompt, model routing, re-ranking, or feeding an agent/Flow. Apply metadata filters on either for precision and per-user/tenant access scoping.
Knowledge Bases is not the only way to do RAG on AWS. You can build it yourself — with your own loaders, a framework like LangChain or LlamaIndex, an embeddings call, and a vector store — and for some teams that is the right call. Here is the honest trade-off.
Managed Knowledge Bases wins on time-to-value and operational burden: you get a synced, parsed, chunked, embedded, queryable corpus with citations in hours, not weeks, and AWS operates the pipeline. DIY RAG wins on control and flexibility: you can use any embeddings model (including ones not on Bedrock), implement custom retrieval logic like hybrid keyword+vector search or sophisticated re-ranking, do unusual chunking, or integrate retrieval steps that the managed pipeline does not expose. The middle path is common too — use Retrieve for managed ingestion and retrieval, but own the generation and orchestration yourself.
A useful way to decide: if your RAG needs are standard — index documents, retrieve relevant chunks, generate cited answers — managed Knowledge Bases will be faster, cheaper to operate, and good enough, and you should reach for DIY only when you hit a specific wall. If you already know you need exotic retrieval (custom re-rankers, hybrid search tuned a particular way, an embeddings model only available elsewhere) or you are building a RAG platform rather than a RAG feature, DIY gives you the control. Most product teams should start managed and graduate specific pieces to custom as concrete requirements emerge — see the rag-on-aws sibling for the full architectural picture.
| Dimension | Managed Knowledge Bases | DIY RAG (your own stack) |
|---|---|---|
| Time to first answer | Hours — declare source + store, sync | Days–weeks — build loaders, pipeline, query loop |
| Who operates it | AWS manages the pipeline + sync | You operate every component |
| Parsing + chunking | Built-in (incl. FM parsing, semantic/hierarchical) | You implement or wire a framework |
| Embeddings model | Titan or Cohere on Bedrock | Any model, anywhere |
| Vector store | OpenSearch Serverless / Aurora / Pinecone / Redis / Neptune | Any store you choose and run |
| Retrieval control | Retrieve / RetrieveAndGenerate + metadata filters | Fully custom (hybrid search, re-rankers, etc.) |
| Citations + sync | Built-in | You build them |
| Best for | RAG as a feature; standard needs; fast launch | RAG as a platform; exotic retrieval; max control |
A Knowledge Base does not have a single price; it has a cost stack of three layers plus parsing. Understanding the layers tells you where the money goes, how to keep it small, and why AWS credits cover all of it during the build.
The three recurring cost layers are: (1) embeddings — you pay the embeddings model per input token to embed your corpus at ingest and to embed every query (very cheap per token, but it scales with corpus size and re-ingestion); (2) the vector store — this is usually the largest standing cost, because OpenSearch Serverless and the managed alternatives carry an ongoing capacity charge whether or not you are querying (Aurora Serverless v2 with pgvector is often the cheapest at low volume); and (3) inference — when you use RetrieveAndGenerate, you pay the normal Bedrock token cost for the model that writes the answer, including the retrieved context as input tokens. On top of those, FM parsing adds a per-page model charge at ingestion time for complex documents, and you pay for the underlying S3 storage and any data-source connector costs.
Two cost patterns are worth internalizing. First, the retrieved context dominates inference input cost — every answer ships several chunks of your documents into the model as input tokens, so retrieval tuning (returning fewer, better chunks) is a cost lever, not just a quality lever; prompt caching on stable instructions helps too. Second, the vector store is the cost you pay even when idle — for a small or bursty workload, a serverless Postgres/pgvector store is frequently cheaper than always-on managed search. At prototype scale the whole stack is typically single-digit to low-tens of dollars a month; it grows with corpus size and query volume.
Which is exactly why so many teams build this on AWS credits and pay nothing out of pocket. Every layer here — embeddings, the vector store, FM parsing, and the generation inference — is credit-eligible and draws down your AWS credits automatically. The relevant pools are AWS Activate (commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / generative-AI POC pool ($10K–$50K) aimed squarely at proving out a use case exactly like a RAG assistant, and the competitive Generative AI Accelerator (up to $1M). Most of these pools are partner-filed through the AWS Partner Network rather than a public form — which is the gap CloudRoute fills: we match you to the right pool for your stage and to a vetted AWS DevOps/ML partner who files the credit application and builds the Knowledge Base (data-source wiring, chunking and parsing tuning, vector-store selection, the Retrieve/RetrieveAndGenerate integration). The customer pays $0 — AWS funds the credits, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. See AWS credits for generative-AI startups and Bedrock POC funding for the full mechanics.
embeddings (per token) + vector store (standing capacity — usually the biggest line) + inference (per token on RetrieveAndGenerate) + FM parsing (per page, if used) + S3. All of it is AWS-credit-eligible — which is why the build can be $0 while you prove the workload out.
The vector store is the one component you pick and pay for directly, and it is the most consequential infrastructure decision in a Knowledge Base. Here is how the five supported options compare on the dimensions that actually drive the choice. Cost notes are representative as of 2026 — confirm current pricing on the relevant AWS or vendor pricing page.
| Vector store | Managed by | Setup effort | Cost shape | Standout strength | Pick it when |
|---|---|---|---|---|---|
| OpenSearch Serverless | AWS (Bedrock can auto-create) | Lowest — one click | Serverless capacity, baseline minimum | Zero-setup, auto-scaling | You want it to just work / most production |
| Aurora PostgreSQL (pgvector) | AWS (you run Aurora) | Low–medium | Aurora Serverless v2 scales low | Reuse your Postgres; cheap at low volume | You already run Postgres/Aurora |
| Pinecone | Pinecone (third-party) | Medium — external account | Pinecone pricing (pods/serverless) | Purpose-built vector DB at scale | You already use / prefer Pinecone |
| Redis Enterprise Cloud | Redis (third-party) | Medium — external account | Redis Enterprise pricing | Very low query latency | Latency-critical / existing Redis |
| Neptune Analytics | AWS | Medium | Neptune Analytics capacity | Vector + graph (GraphRAG) | Relationship-rich, connected data |
Situation: The team wanted a customer-facing support assistant that answered from their own documentation with citations — not a generic chatbot. Their content was split between a Confluence wiki and a pile of layout-heavy product PDFs in S3 (tables, diagrams, multi-column specs). An earlier DIY attempt with a hand-built pipeline had stalled: parsing the PDFs was producing garbage, retrieval was returning irrelevant chunks, and nobody owned operating the embedding/sync infrastructure. They also did not want to spend runway on inference and a vector database while still proving the feature out.
What CloudRoute did: CloudRoute matched them in under 24 hours to an EU AWS partner with RAG experience. The partner built it on managed Knowledge Bases: connected both the Confluence and S3 data sources to a single Knowledge Base; turned on FM parsing for the complex PDFs and hierarchical chunking so retrieval stayed precise while returning enough context; used Titan Text Embeddings into an Aurora pgvector store (the team already ran Postgres, keeping the standing cost low); shipped the assistant on RetrieveAndGenerate for one-call cited answers, with metadata filtering to scope answers by product line. In parallel, the partner filed a Bedrock POC credit application plus an Activate Portfolio application to fund the build.
Outcome: A cited, grounded support assistant was live in under three weeks, answering from the real corpus with source links — and the entire cost stack (embeddings, the Aurora vector store, RetrieveAndGenerate inference, FM parsing) was covered by the approved credits, so the team paid $0 during the build and early rollout. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
corpus: ~40k docs across Confluence + S3 · time to live: < 3 weeks · credits secured: POC + Activate · out-of-pocket during build: $0
Whatever a Knowledge Base would cost — embeddings, the vector store, FM parsing, inference — AWS credits can cover it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to wire the data sources, tune chunking and parsing, pick the vector store, and ship the Retrieve/RetrieveAndGenerate integration. Customer pays $0.