A "chat with your documents" system answers questions from your own PDFs, contracts, manuals, and reports — with citations, so every answer points back to the page it came from. This is the full build guide for document Q&A on AWS: the document-centric pipeline end to end (ingest → parse → chunk → embed → store → retrieve → answer with citations), the parsing layer that decides everything (Amazon Bedrock Data Automation and Amazon Textract for scanned pages and tables), the two build paths — managed with Amazon Bedrock Knowledge Bases vs a DIY pipeline — how to keep answers accurate and grounded, how to enforce who can read which document, and what it actually costs.
A document Q&A system lets a person ask a plain-language question and get an answer drawn from a specific body of documents — a contract set, a policy library, a stack of research reports, a product manual — with a citation showing exactly where the answer came from. Under the hood it is retrieval-augmented generation (RAG), but the document corpus is what makes it distinct, and what makes it hard.
The promise is simple: instead of a human scrolling through 400 pages of supplier agreements to find a termination clause, they ask "what is the notice period to terminate the Acme contract?" and get a one-line answer with a link to clause 14.2 on page 37. A foundation model on its own cannot do this — it has never seen your contracts, and if you paste all 400 pages into a prompt you blow the context window, pay enormously, and still get vague answers. Document Q&A solves it by retrieving only the few passages that matter and asking the model to answer from those, with a citation.
What separates document Q&A from a generic chatbot is the corpus. Your documents are not clean Markdown. They are PDFs exported from Word, decades-old scanned agreements, spreadsheets with merged cells, slide decks, forms, and tables that span pages. The single biggest reason a document Q&A system gives bad answers is not the language model — it is that the document was parsed badly before it ever reached the model. A table read top-to-bottom as one run of text, a two-column page interleaved line by line, a scanned page that was never OCR-ed: each of these poisons retrieval at the source.
The second hard part is trust. For document Q&A to be useful in a business — legal, finance, compliance, support — every answer has to be verifiable. A confident-but-wrong answer about a contract term is worse than no answer. That is why citations are not a nice-to-have here; they are the product. The system must return the source document, page, and ideally the exact passage, so a human can confirm before acting.
The third is access control. The same system that answers "what does the Acme MSA say about liability?" must never surface a document the asker is not cleared to read. In a multi-tenant product, one customer must never retrieve another customer's files. This has to be enforced at retrieval time, not bolted on after the answer is written.
On AWS, each of these — parsing, retrieval, citations, access control — maps to a managed service, which is why AWS is a common home for document Q&A. The rest of this guide walks the pipeline stage by stage, starting with the architecture.
Document Q&A on AWS = parse your documents (including scans and tables) into clean, layout-aware text, retrieve the passages relevant to a question, and have a foundation model on Amazon Bedrock answer from them with citations back to the source page — so the answer is grounded, verifiable, and access-controlled.
Every document Q&A system — managed or DIY — runs the same six logical stages. The difference from a generic RAG pipeline is the weight that falls on stage two: parsing. Get the document-handling stages right and the rest is well-trodden retrieval engineering.
Split the six stages into two phases. The indexing phase (ingest → parse → chunk → embed → store) runs offline whenever documents are added or change. The query phase (retrieve → answer) runs in real time on every question. The table below maps each stage to the AWS service that typically implements it.
1. Ingest. Land raw documents in an Amazon S3 bucket — uploaded by users, synced from a document management system or SharePoint, or dropped by an upstream process. S3 is the staging area and the durable system of record for the originals, which you keep so citations can deep-link back to the real file.
2. Parse. Convert each document into clean, layout-aware text. This is the stage that decides document Q&A quality. Amazon Bedrock Data Automation extracts text, tables, and structure from PDFs, images, and documents in one managed step; Amazon Textract handles OCR for scanned pages and pulls tables and form key-value pairs out of documents that have no text layer. Parsing is covered in depth in section IV — it is the highest-leverage stage in the entire system.
3. Chunk. Split the parsed text into passages small enough to embed precisely but large enough to carry meaning — typically 300–800 tokens with 10–20% overlap. For documents, chunk on structure: respect section headings, keep a table with its caption, and carry the document title and page number in each chunk's metadata so the citation can name them later.
4. Embed. Convert each chunk into a vector with an embedding model on Bedrock — Amazon Titan Text Embeddings v2 or Cohere Embed. The model and its dimensionality are fixed for the life of the index; changing the embedding model means re-embedding every document.
5. Store. Write each vector plus its source text and metadata (document ID, title, page, ACL tags, timestamps) into a vector store with an approximate-nearest-neighbour index — most commonly Amazon OpenSearch Serverless, or Aurora PostgreSQL with pgvector if you already run Postgres.
6a. Retrieve. Embed the incoming question with the same embedding model and query the vector store for the most relevant chunks, applying an access-control metadata filter so the user only ever retrieves documents they are entitled to. Re-ranking the top candidates with Amazon Rerank or Cohere Rerank sharpens precision before the model sees them.
6b. Answer with citations. Build the prompt — a system instruction that says "answer only from the supplied passages, and say you do not know if they do not contain the answer" + the retrieved chunks + the question — and send it to a generation model on Bedrock (Claude, Amazon Nova, Llama, Mistral). Return the answer together with the citations carried in each chunk's metadata, so the UI can show "Source: Acme_MSA.pdf, p.37" and link to the original. A Bedrock Guardrail screens both the question and the answer.
| Stage | Phase | What it does | Typical AWS service |
|---|---|---|---|
| 1. Ingest | Indexing (offline) | Land + store original documents | Amazon S3 |
| 2. Parse | Indexing (offline) | Documents → clean, layout-aware text + tables | Bedrock Data Automation / Amazon Textract |
| 3. Chunk | Indexing (offline) | Split into passages, keep page + title metadata | Bedrock KB built-in, or Lambda/Glue (DIY) |
| 4. Embed | Indexing (offline) | Text → vectors | Titan Text Embeddings v2 / Cohere Embed (Bedrock) |
| 5. Store | Indexing (offline) | Index vectors + metadata for ANN search | OpenSearch Serverless / Aurora pgvector |
| 6. Retrieve + answer | Query (real time) | Find relevant chunks, answer with citations | Bedrock Retrieve / RetrieveAndGenerate (+ Guardrails) |
The first real decision for document Q&A is not which model or vector store — it is whether to let Amazon Bedrock Knowledge Bases run the pipeline, or to assemble it yourself. For document Q&A specifically, the managed path is unusually strong because it now handles the hardest stage — parsing — and returns citations natively.
The honest framing: start managed, move to DIY only when a specific document requirement forces it. Bedrock Knowledge Bases was practically built for document Q&A — point it at an S3 bucket of PDFs, and it parses (with Bedrock Data Automation as the parser), chunks, embeds, indexes, retrieves, and returns cited answers. Most internal-knowledge, policy, and contract-lookup use cases never need more than that. Teams typically reach for DIY when their documents are unusually structured (dense financial tables, scientific papers, forms) and they want bespoke parsing and chunking, or when access control is more granular than metadata filtering can express.
You point a knowledge base at an S3 bucket (or a connector like SharePoint, Confluence, Salesforce, or a web crawler), choose how documents are parsed — default parsing, Bedrock Data Automation, or a foundation-model parser for complex layouts — pick an embedding model and a vector store, and Bedrock handles ingestion, chunking, embedding, indexing, retrieval, and optional re-ranking. You then call Retrieve (get relevant passages) or RetrieveAndGenerate (get a cited answer in one call). It syncs incrementally when documents change, returns source citations automatically, and integrates with Bedrock Guardrails.
Choose managed when: you want a cited answer engine in hours not weeks; your documents live in S3 or a supported connector; built-in parsing (including Data Automation for tables and scans) and fixed/semantic/hierarchical chunking are good enough; and access control fits metadata filtering. This covers the large majority of document Q&A use cases.
In the DIY path you still call Bedrock for embeddings and generation, but you own every stage: your own parsing (calling Textract or Bedrock Data Automation directly, with custom post-processing of tables and layout), your own chunker (Lambda, AWS Glue, or Step Functions), direct writes to your chosen vector store, your own retrieval and re-ranking, and your own prompt assembly and citation formatting. Orchestration frameworks like LangChain or LlamaIndex are common but optional.
Choose DIY when: your documents need parsing logic beyond the presets (e.g. extracting line items from invoices, preserving deeply nested tables, or splitting scientific papers by section and figure); you need table-aware or document-structure-aware chunking the managed path doesn't express; you require strict per-document or row-level access control beyond metadata filters; or you are squeezing cost and latency hard enough that the managed convenience premium matters. DIY costs more engineering time and ongoing maintenance — that is the trade.
Prototype on Bedrock Knowledge Bases with Data Automation parsing to get cited answers over your real documents fast, and find out where parsing struggles on your hardest files. Graduate only the stages that need it to DIY — usually parsing and chunking for unusual document types — when a concrete requirement forces it. Many production document Q&A systems are a hybrid: managed KB for the bulk corpus, a DIY parsing path for one demanding document class.
If the right text never makes it cleanly out of the document, no embedding model, vector store, or LLM can recover it. Parsing is to document Q&A what retrieval is to general RAG: the dominant source of quality, and the first place to look when answers are wrong. AWS gives you two purpose-built tools — Bedrock Data Automation and Amazon Textract — plus foundation-model parsing for the hardest layouts.
The failure modes are specific and predictable. A scanned PDF (an image of a page, no text layer) returns nothing at all from a naive text extractor — it must be OCR-ed first. A table read as a flat run of text loses the row/column relationships, so "the 2025 figure for EMEA" becomes unanswerable because the number is no longer associated with its row and column. A multi-column page read left-to-right across both columns interleaves two unrelated streams of text into nonsense. A form loses the link between a field label and its value. Each of these is invisible until someone asks a question that depends on the mangled content and gets a confidently wrong answer.
Amazon Bedrock Data Automation is a managed service that turns unstructured content — documents, images, audio, and video — into structured output with a single API, and it is the parser Bedrock Knowledge Bases can use for ingestion. For documents it extracts text while preserving layout, pulls out tables and figures, and can return structured fields, which is exactly what document Q&A needs: clean, layout-aware text where a table stays a table and a heading stays a heading. Because it is managed and integrated, it is the right default — you get good parsing without building or operating an OCR and layout-analysis pipeline yourself.
Amazon Textract is AWS's document-text-and-data extraction service. It performs OCR on scanned pages and images (recovering text from documents that have no text layer at all), and crucially it has dedicated capabilities for tables (preserving cell, row, and column structure) and forms (extracting key-value pairs like "Invoice number: 10432"). Reach for Textract directly in a DIY pipeline when you need that structured table and form output, when your corpus is heavily scanned, or when you want fine-grained control over how extracted tables are serialized into the text that gets embedded.
For documents where structure carries the meaning and standard extraction still struggles — dense financial statements, complex scientific papers, intricate forms — a multimodal foundation model can parse the page directly, "reading" the layout the way a person would. Bedrock Knowledge Bases offers an FM-based parsing option for exactly these cases, and in a DIY pipeline you can call a vision-capable model on Bedrock to convert a page image into clean structured text or Markdown. It costs more per page than Textract or Data Automation, so use it selectively for the document classes that need it rather than across the whole corpus.
Extraction is only half the job; how you turn an extracted table into text the model can reason over matters just as much. Two patterns work well. Markdown tables keep rows and columns aligned in the chunk so the model can read across them. Row-as-sentence serialization — rendering each row as "For EMEA in 2025, revenue was $4.2M" — embeds far better for lookup-style questions because each fact becomes its own retrievable statement. Keep the table's caption and surrounding heading in the same chunk so a question like "what were EMEA sales?" can find it.
| Tool | Best for | Tables | Scans / OCR | Where it fits |
|---|---|---|---|---|
| Bedrock Data Automation | General document parsing, managed | Yes (layout-aware) | Yes | Default parser in Bedrock KB; managed pipelines |
| Amazon Textract | OCR, tables, forms with structured output | Yes (cell/row/column) | Yes (core strength) | DIY pipelines; scan-heavy or form-heavy corpora |
| FM parsing (multimodal model) | Complex layouts where structure is meaning | Yes (reads visually) | Yes | Hardest document classes; use selectively (higher cost) |
| Naive text extraction | Clean, born-digital text-only docs | Poor | No | Only simple text PDFs — avoid for real corpora |
For document Q&A, an uncited answer is a liability. The whole point is that a person can check the source before acting on a contract term, a policy, or a financial figure. Two things make the system trustworthy: citations that point to the real source, and grounding discipline that stops the model from inventing answers the documents do not support.
Citations come almost for free if the pipeline carries the right metadata. Because each chunk stored at index time includes its document ID, title, and page number, the answer can name and link them. Bedrock Knowledge Bases returns citations natively from RetrieveAndGenerate — each cited span maps back to the source chunk and document — and a DIY pipeline produces the same result by attaching the metadata of every chunk it passed into the prompt to the response. The UI then shows "Source: Acme_MSA.pdf, p.37" and deep-links to the original in S3.
The generation prompt is where accuracy is enforced. Instruct the model explicitly to answer only from the supplied passages and to say "I could not find this in the documents" when the retrieved context does not contain the answer — an empty-but-honest answer is vastly better than a confident fabrication about a legal clause. Pass only re-ranked, high-precision chunks so the model is not distracted by loosely related passages. Where exact wording matters (contracts, policies), ask the model to quote the relevant sentence and cite it, rather than paraphrase.
Attach a Bedrock Guardrail to screen both the question and the answer. Guardrails can filter harmful content, block disallowed topics, and redact or block sensitive data (PII), and they include contextual-grounding checks that flag answers not supported by the retrieved source — a second line of defence against hallucination on top of the prompt discipline. For document Q&A over sensitive corpora (HR files, medical records, financial documents) the PII and grounding controls are especially valuable.
Build a fixed evaluation set: 50–200 real questions paired with the correct answer and the source passage that contains it. Score faithfulness (does the answer follow from the cited passages, or did the model add unsupported claims?), answer relevance (does it address the question asked?), and context precision/recall (did retrieval surface the right passages?). Amazon Bedrock includes RAG evaluation in its model-evaluation suite — supply a dataset and it runs an LLM-as-a-judge to score retrieval and response quality — and open-source frameworks like Ragas do the same for DIY pipelines. Run the set on every change so a new parser or chunk size proves itself with a number instead of a demo.
A trustworthy document Q&A answer: cites its source document and page · is grounded (the model was told to answer only from context and to admit when it cannot) · passes through a Guardrail with grounding + PII checks · and is backed by a golden evaluation set scoring faithfulness and relevance. Citations without grounding still hallucinate; grounding without citations cannot be verified. You need both.
Document Q&A almost always runs over documents with different audiences: HR files only HR may read, a customer's contracts only that customer's team, board materials only executives. The cardinal rule is that access control lives in retrieval, not in a post-filter on the answer — by the time the model has written an answer, restricted content has already been used.
The pattern is the same on both build paths. Tag every chunk with ACL metadata at index time — user, group, role, tenant, document classification — derived from the source document's permissions. Then apply a metadata filter on every query so a user can only ever retrieve chunks they are entitled to. Bedrock Knowledge Bases supports metadata filtering on retrieval; DIY stores express the same with OpenSearch filters, pgvector SQL predicates, or Pinecone metadata filters. The filter must be derived from the authenticated user's identity on the server, never from anything the client can set.
For multi-tenant SaaS, isolate tenants at minimum with a per-tenant filter on every query so one customer can never retrieve another's documents; for hard isolation, give each tenant a separate index or knowledge base so there is no shared surface to misconfigure. When document permissions change — a person leaves a team, a file is reclassified — the ACL metadata in the index must be updated too, so re-sync permission changes, not just content changes. A document that was readable yesterday and restricted today must drop out of that user's retrievable set immediately.
Two more controls round it out. Keep the original documents in S3 behind IAM and bucket policies so the deep-link in a citation is itself authorization-checked — a user who is shown a citation should still be blocked at S3 if they are not entitled to open the file. And log every query with the user identity, the filter applied, and the documents retrieved, so access can be audited after the fact. In regulated settings this audit trail is often a hard requirement, not an option.
Enforce access at retrieval, never as a post-filter on the answer. Tag chunks with ACL metadata at index time, derive the query filter from the authenticated user on the server, isolate tenants (per-tenant filter or separate index), re-sync permission changes as well as content changes, and protect the original files in S3 with IAM so even the citation link is authorization-checked.
Here is the fastest credible path from zero to a cited, production-leaning document Q&A system on AWS using Bedrock Knowledge Bases. The DIY path follows the same logical order with each stage hand-built — most often diverging at parsing and chunking.
A document Q&A bill has six line items. The one that surprises teams coming from generic RAG is parsing — paid per page at index time — which can dominate the upfront cost for a large or scan-heavy corpus. Here is the full stack and the lever on each.
The figures below are representative as of 2026 to show the shape of the bill, not a quote — always check the AWS pricing page (and any third-party vendor) for current rates. Upfront, parsing and embedding dominate (both scale with corpus size and run mostly once); at steady state, generation tokens and the always-on vector-store baseline dominate.
| Cost line | When you pay | Driver | Main lever to control it |
|---|---|---|---|
| Parsing | One-time per document + on updates | Pages parsed × method | Use Data Automation/Textract by default; reserve costly FM parsing for hard docs; parse changed pages only |
| Embeddings (indexing) | One-time per corpus + on updates | Total tokens embedded | Chunk size; smaller embedding dimensions; only re-embed changed documents |
| Vector store | Continuous (baseline) | Corpus size + index type + engine | Right-size the engine; pgvector if Postgres already runs; tune dimensions |
| Query embeddings | Per query | Question volume | Negligible per call; cache embeddings for repeated questions |
| Re-ranking | Per query | Candidates re-ranked × queries | Re-rank top-30/50, not top-500; skip on trivial queries |
| Generation | Per query (usually the largest at steady state) | Input + output tokens × model price | Cheaper model for easy questions; fewer chunks; prompt caching; tight max-tokens |
This is the comparison that decides your architecture. Read it as "default to managed; move a stage to DIY only when a row in the right column is a hard requirement for your documents." For document Q&A the managed path is strong because it now owns parsing and citations.
| Dimension | Bedrock Knowledge Bases (managed) | DIY (Bedrock + your stack) |
|---|---|---|
| Time to first cited answer | Hours — point at S3, sync, call an API | Days to weeks — build every stage |
| Document parsing | Built-in: default / Data Automation / FM parsing | Your own Textract / Data Automation calls + custom post-processing |
| Pipeline you maintain | Almost none — AWS runs parse→retrieve | All of it — parser, chunker, retriever, re-ranker |
| Chunking control | Fixed / semantic / hierarchical presets | Anything — table-aware, document-structure-aware |
| Citations | Returned natively from RetrieveAndGenerate | You assemble from passed-chunk metadata |
| Access control | Metadata filtering | Anything — row-level, per-tenant, external policy engine |
| Cost control | Less granular; convenience premium | Maximum — tune parsing and every stage |
| Best for | Most contract / policy / knowledge-lookup use cases | Unusual document types, bespoke parsing, strict multi-tenancy, cost-squeeze |
Situation: The operations team spent hours hunting through PDF agreements for specific clauses — notice periods, fee schedules, liability caps — and many of the oldest contracts were scanned images with no text layer. They wanted a "chat with our contracts" tool that answered with a citation to the exact page, never mixed one counterparty's documents with another's, and could prove where every answer came from for compliance. A first internal attempt returned garbage on scanned files and tables and had no access-control story, and the two engineers who could fix it were committed to the core lending product.
What CloudRoute did: Routed within 24 hours to a US-region AWS partner with a GenAI/ML and document-processing track record. The partner scoped a Bedrock Knowledge Bases build in us-east-1: S3 ingestion of the contract corpus, Bedrock Data Automation plus Amazon Textract for parsing scanned pages and preserving fee and rate tables, hierarchical chunking that kept clause numbers and page numbers in metadata, Titan v2 embeddings, OpenSearch Serverless as the vector store, Cohere Rerank for precision, Claude for grounded generation with quote-and-cite prompting, per-counterparty metadata filtering for isolation, a Bedrock Guardrail with grounding and PII checks, and a 150-question golden set scored with Bedrock RAG evaluation. The whole engagement was funded by AWS credits the partner filed for — Activate Portfolio plus a Bedrock POC allocation.
Outcome: A cited contract-Q&A assistant in production in about 6 weeks. Scanned agreements and fee tables parsed cleanly; faithfulness and context-precision scores cleared the team's bar on the golden set; per-counterparty isolation was enforced at retrieval and every answer deep-linked to the source page for audit. The build and the first months of inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.
engagement window: ~6 weeks · founder time: ~7 hours · stack: Bedrock KB + Data Automation + Textract + OpenSearch Serverless + Titan v2 + Cohere Rerank + Claude · cost to customer: $0
CloudRoute routes you to a vetted AWS GenAI/ML partner who designs and ships the pipeline — Bedrock Knowledge Bases or a custom DIY stack, document parsing for scans and tables, the right vector store, embeddings, re-ranking, citations, access control, Guardrails, and evaluation. AWS credits fund the build and the inference. You pay $0.