document q&a on aws · the 2026 build guide

How to build a document Q&A system on AWS (2026).

A "chat with your documents" system answers questions from your own PDFs, contracts, manuals, and reports — with citations, so every answer points back to the page it came from. This is the full build guide for document Q&A on AWS: the document-centric pipeline end to end (ingest → parse → chunk → embed → store → retrieve → answer with citations), the parsing layer that decides everything (Amazon Bedrock Data Automation and Amazon Textract for scanned pages and tables), the two build paths — managed with Amazon Bedrock Knowledge Bases vs a DIY pipeline — how to keep answers accurate and grounded, how to enforce who can read which document, and what it actually costs.

pipeline stages
6
build paths
2
managed RAG service
Bedrock KB
credits to fund it
up to $100K
TL;DR
  • A document Q&A system is retrieval-augmented generation pointed at a document corpus: you parse each file into clean text, embed it into a vector store, retrieve the passages most relevant to a question, and have a foundation model answer from those passages with citations back to the source page. On AWS the pipeline is ingest → parse → chunk → embed → store → retrieve → answer.
  • The make-or-break stage is parsing, not the model. Real documents are PDFs, scans, and spreadsheets full of tables, multi-column layouts, and images — naive text extraction silently mangles them. Amazon Bedrock Data Automation and Amazon Textract turn messy documents (including scanned pages and tables) into structured, layout-aware text that retrieval can actually use; get this wrong and every downstream answer is wrong.
  • Two ways to build it. Managed: Amazon Bedrock Knowledge Bases runs chunking, embedding, storage, retrieval, and citations for you — point it at an S3 bucket of documents and you get a cited RetrieveAndGenerate API in hours, with Bedrock Guardrails for safety. DIY: orchestrate the stages yourself when you need custom parsing, table-aware chunking, or strict per-document access control. GenAI inference and document parsing bills add up fast; CloudRoute routes you to AWS credits (Activate Portfolio up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and vetted ML partners who build it — you pay $0.
the use case

IWhat a document Q&A system is — and why it is harder than it looks

A document Q&A system lets a person ask a plain-language question and get an answer drawn from a specific body of documents — a contract set, a policy library, a stack of research reports, a product manual — with a citation showing exactly where the answer came from. Under the hood it is retrieval-augmented generation (RAG), but the document corpus is what makes it distinct, and what makes it hard.

The promise is simple: instead of a human scrolling through 400 pages of supplier agreements to find a termination clause, they ask "what is the notice period to terminate the Acme contract?" and get a one-line answer with a link to clause 14.2 on page 37. A foundation model on its own cannot do this — it has never seen your contracts, and if you paste all 400 pages into a prompt you blow the context window, pay enormously, and still get vague answers. Document Q&A solves it by retrieving only the few passages that matter and asking the model to answer from those, with a citation.

What separates document Q&A from a generic chatbot is the corpus. Your documents are not clean Markdown. They are PDFs exported from Word, decades-old scanned agreements, spreadsheets with merged cells, slide decks, forms, and tables that span pages. The single biggest reason a document Q&A system gives bad answers is not the language model — it is that the document was parsed badly before it ever reached the model. A table read top-to-bottom as one run of text, a two-column page interleaved line by line, a scanned page that was never OCR-ed: each of these poisons retrieval at the source.

The second hard part is trust. For document Q&A to be useful in a business — legal, finance, compliance, support — every answer has to be verifiable. A confident-but-wrong answer about a contract term is worse than no answer. That is why citations are not a nice-to-have here; they are the product. The system must return the source document, page, and ideally the exact passage, so a human can confirm before acting.

The third is access control. The same system that answers "what does the Acme MSA say about liability?" must never surface a document the asker is not cleared to read. In a multi-tenant product, one customer must never retrieve another customer's files. This has to be enforced at retrieval time, not bolted on after the answer is written.

On AWS, each of these — parsing, retrieval, citations, access control — maps to a managed service, which is why AWS is a common home for document Q&A. The rest of this guide walks the pipeline stage by stage, starting with the architecture.

the one-sentence definition

Document Q&A on AWS = parse your documents (including scans and tables) into clean, layout-aware text, retrieve the passages relevant to a question, and have a foundation model on Amazon Bedrock answer from them with citations back to the source page — so the answer is grounded, verifiable, and access-controlled.

end to end

IIThe document Q&A architecture on AWS, stage by stage

Every document Q&A system — managed or DIY — runs the same six logical stages. The difference from a generic RAG pipeline is the weight that falls on stage two: parsing. Get the document-handling stages right and the rest is well-trodden retrieval engineering.

Split the six stages into two phases. The indexing phase (ingest → parse → chunk → embed → store) runs offline whenever documents are added or change. The query phase (retrieve → answer) runs in real time on every question. The table below maps each stage to the AWS service that typically implements it.

Indexing phase — ingest, parse, chunk, embed, store

1. Ingest. Land raw documents in an Amazon S3 bucket — uploaded by users, synced from a document management system or SharePoint, or dropped by an upstream process. S3 is the staging area and the durable system of record for the originals, which you keep so citations can deep-link back to the real file.

2. Parse. Convert each document into clean, layout-aware text. This is the stage that decides document Q&A quality. Amazon Bedrock Data Automation extracts text, tables, and structure from PDFs, images, and documents in one managed step; Amazon Textract handles OCR for scanned pages and pulls tables and form key-value pairs out of documents that have no text layer. Parsing is covered in depth in section IV — it is the highest-leverage stage in the entire system.

3. Chunk. Split the parsed text into passages small enough to embed precisely but large enough to carry meaning — typically 300–800 tokens with 10–20% overlap. For documents, chunk on structure: respect section headings, keep a table with its caption, and carry the document title and page number in each chunk's metadata so the citation can name them later.

4. Embed. Convert each chunk into a vector with an embedding model on Bedrock — Amazon Titan Text Embeddings v2 or Cohere Embed. The model and its dimensionality are fixed for the life of the index; changing the embedding model means re-embedding every document.

5. Store. Write each vector plus its source text and metadata (document ID, title, page, ACL tags, timestamps) into a vector store with an approximate-nearest-neighbour index — most commonly Amazon OpenSearch Serverless, or Aurora PostgreSQL with pgvector if you already run Postgres.

Query phase — retrieve, answer with citations

6a. Retrieve. Embed the incoming question with the same embedding model and query the vector store for the most relevant chunks, applying an access-control metadata filter so the user only ever retrieves documents they are entitled to. Re-ranking the top candidates with Amazon Rerank or Cohere Rerank sharpens precision before the model sees them.

6b. Answer with citations. Build the prompt — a system instruction that says "answer only from the supplied passages, and say you do not know if they do not contain the answer" + the retrieved chunks + the question — and send it to a generation model on Bedrock (Claude, Amazon Nova, Llama, Mistral). Return the answer together with the citations carried in each chunk's metadata, so the UI can show "Source: Acme_MSA.pdf, p.37" and link to the original. A Bedrock Guardrail screens both the question and the answer.

the document Q&A stages mapped to AWS services · representative as of 2026
StagePhaseWhat it doesTypical AWS service
1. IngestIndexing (offline)Land + store original documentsAmazon S3
2. ParseIndexing (offline)Documents → clean, layout-aware text + tablesBedrock Data Automation / Amazon Textract
3. ChunkIndexing (offline)Split into passages, keep page + title metadataBedrock KB built-in, or Lambda/Glue (DIY)
4. EmbedIndexing (offline)Text → vectorsTitan Text Embeddings v2 / Cohere Embed (Bedrock)
5. StoreIndexing (offline)Index vectors + metadata for ANN searchOpenSearch Serverless / Aurora pgvector
6. Retrieve + answerQuery (real time)Find relevant chunks, answer with citationsBedrock Retrieve / RetrieveAndGenerate (+ Guardrails)
Bedrock Knowledge Bases collapses stages 3–6 behind two API calls (Retrieve and RetrieveAndGenerate) and can run the parsing in stage 2 via Bedrock Data Automation. A DIY pipeline implements each stage yourself for control. Both paths use the same Bedrock embedding + generation models and both can return citations.
the central decision

IIITwo paths: managed (Bedrock Knowledge Bases) vs DIY

The first real decision for document Q&A is not which model or vector store — it is whether to let Amazon Bedrock Knowledge Bases run the pipeline, or to assemble it yourself. For document Q&A specifically, the managed path is unusually strong because it now handles the hardest stage — parsing — and returns citations natively.

The honest framing: start managed, move to DIY only when a specific document requirement forces it. Bedrock Knowledge Bases was practically built for document Q&A — point it at an S3 bucket of PDFs, and it parses (with Bedrock Data Automation as the parser), chunks, embeds, indexes, retrieves, and returns cited answers. Most internal-knowledge, policy, and contract-lookup use cases never need more than that. Teams typically reach for DIY when their documents are unusually structured (dense financial tables, scientific papers, forms) and they want bespoke parsing and chunking, or when access control is more granular than metadata filtering can express.

Path A — Amazon Bedrock Knowledge Bases (managed)

You point a knowledge base at an S3 bucket (or a connector like SharePoint, Confluence, Salesforce, or a web crawler), choose how documents are parsed — default parsing, Bedrock Data Automation, or a foundation-model parser for complex layouts — pick an embedding model and a vector store, and Bedrock handles ingestion, chunking, embedding, indexing, retrieval, and optional re-ranking. You then call Retrieve (get relevant passages) or RetrieveAndGenerate (get a cited answer in one call). It syncs incrementally when documents change, returns source citations automatically, and integrates with Bedrock Guardrails.

Choose managed when: you want a cited answer engine in hours not weeks; your documents live in S3 or a supported connector; built-in parsing (including Data Automation for tables and scans) and fixed/semantic/hierarchical chunking are good enough; and access control fits metadata filtering. This covers the large majority of document Q&A use cases.

Path B — DIY pipeline (Bedrock + your own stack)

In the DIY path you still call Bedrock for embeddings and generation, but you own every stage: your own parsing (calling Textract or Bedrock Data Automation directly, with custom post-processing of tables and layout), your own chunker (Lambda, AWS Glue, or Step Functions), direct writes to your chosen vector store, your own retrieval and re-ranking, and your own prompt assembly and citation formatting. Orchestration frameworks like LangChain or LlamaIndex are common but optional.

Choose DIY when: your documents need parsing logic beyond the presets (e.g. extracting line items from invoices, preserving deeply nested tables, or splitting scientific papers by section and figure); you need table-aware or document-structure-aware chunking the managed path doesn't express; you require strict per-document or row-level access control beyond metadata filters; or you are squeezing cost and latency hard enough that the managed convenience premium matters. DIY costs more engineering time and ongoing maintenance — that is the trade.

the pragmatic rule

Prototype on Bedrock Knowledge Bases with Data Automation parsing to get cited answers over your real documents fast, and find out where parsing struggles on your hardest files. Graduate only the stages that need it to DIY — usually parsing and chunking for unusual document types — when a concrete requirement forces it. Many production document Q&A systems are a hybrid: managed KB for the bulk corpus, a DIY parsing path for one demanding document class.

where it is won or lost

IVParsing documents — tables, scans, and layout (the stage that decides everything)

If the right text never makes it cleanly out of the document, no embedding model, vector store, or LLM can recover it. Parsing is to document Q&A what retrieval is to general RAG: the dominant source of quality, and the first place to look when answers are wrong. AWS gives you two purpose-built tools — Bedrock Data Automation and Amazon Textract — plus foundation-model parsing for the hardest layouts.

The failure modes are specific and predictable. A scanned PDF (an image of a page, no text layer) returns nothing at all from a naive text extractor — it must be OCR-ed first. A table read as a flat run of text loses the row/column relationships, so "the 2025 figure for EMEA" becomes unanswerable because the number is no longer associated with its row and column. A multi-column page read left-to-right across both columns interleaves two unrelated streams of text into nonsense. A form loses the link between a field label and its value. Each of these is invisible until someone asks a question that depends on the mangled content and gets a confidently wrong answer.

Amazon Bedrock Data Automation — the managed default

Amazon Bedrock Data Automation is a managed service that turns unstructured content — documents, images, audio, and video — into structured output with a single API, and it is the parser Bedrock Knowledge Bases can use for ingestion. For documents it extracts text while preserving layout, pulls out tables and figures, and can return structured fields, which is exactly what document Q&A needs: clean, layout-aware text where a table stays a table and a heading stays a heading. Because it is managed and integrated, it is the right default — you get good parsing without building or operating an OCR and layout-analysis pipeline yourself.

Amazon Textract — OCR, tables, and forms

Amazon Textract is AWS's document-text-and-data extraction service. It performs OCR on scanned pages and images (recovering text from documents that have no text layer at all), and crucially it has dedicated capabilities for tables (preserving cell, row, and column structure) and forms (extracting key-value pairs like "Invoice number: 10432"). Reach for Textract directly in a DIY pipeline when you need that structured table and form output, when your corpus is heavily scanned, or when you want fine-grained control over how extracted tables are serialized into the text that gets embedded.

Foundation-model parsing for the hardest layouts

For documents where structure carries the meaning and standard extraction still struggles — dense financial statements, complex scientific papers, intricate forms — a multimodal foundation model can parse the page directly, "reading" the layout the way a person would. Bedrock Knowledge Bases offers an FM-based parsing option for exactly these cases, and in a DIY pipeline you can call a vision-capable model on Bedrock to convert a page image into clean structured text or Markdown. It costs more per page than Textract or Data Automation, so use it selectively for the document classes that need it rather than across the whole corpus.

Serializing tables so retrieval can use them

Extraction is only half the job; how you turn an extracted table into text the model can reason over matters just as much. Two patterns work well. Markdown tables keep rows and columns aligned in the chunk so the model can read across them. Row-as-sentence serialization — rendering each row as "For EMEA in 2025, revenue was $4.2M" — embeds far better for lookup-style questions because each fact becomes its own retrievable statement. Keep the table's caption and surrounding heading in the same chunk so a question like "what were EMEA sales?" can find it.

document parsing options on AWS · representative as of 2026
ToolBest forTablesScans / OCRWhere it fits
Bedrock Data AutomationGeneral document parsing, managedYes (layout-aware)YesDefault parser in Bedrock KB; managed pipelines
Amazon TextractOCR, tables, forms with structured outputYes (cell/row/column)Yes (core strength)DIY pipelines; scan-heavy or form-heavy corpora
FM parsing (multimodal model)Complex layouts where structure is meaningYes (reads visually)YesHardest document classes; use selectively (higher cost)
Naive text extractionClean, born-digital text-only docsPoorNoOnly simple text PDFs — avoid for real corpora
Default to Bedrock Data Automation for managed builds; use Textract directly when you need structured table/form output or heavy OCR; reserve FM-based parsing for the document classes where standard extraction still fails. Naive extraction is the most common silent cause of bad document Q&A — do not ship it for scans or tables.
making it trustworthy

VCitations and accuracy — making every answer verifiable

For document Q&A, an uncited answer is a liability. The whole point is that a person can check the source before acting on a contract term, a policy, or a financial figure. Two things make the system trustworthy: citations that point to the real source, and grounding discipline that stops the model from inventing answers the documents do not support.

Citations come almost for free if the pipeline carries the right metadata. Because each chunk stored at index time includes its document ID, title, and page number, the answer can name and link them. Bedrock Knowledge Bases returns citations natively from RetrieveAndGenerate — each cited span maps back to the source chunk and document — and a DIY pipeline produces the same result by attaching the metadata of every chunk it passed into the prompt to the response. The UI then shows "Source: Acme_MSA.pdf, p.37" and deep-links to the original in S3.

Grounding the model so it does not invent answers

The generation prompt is where accuracy is enforced. Instruct the model explicitly to answer only from the supplied passages and to say "I could not find this in the documents" when the retrieved context does not contain the answer — an empty-but-honest answer is vastly better than a confident fabrication about a legal clause. Pass only re-ranked, high-precision chunks so the model is not distracted by loosely related passages. Where exact wording matters (contracts, policies), ask the model to quote the relevant sentence and cite it, rather than paraphrase.

Bedrock Guardrails

Attach a Bedrock Guardrail to screen both the question and the answer. Guardrails can filter harmful content, block disallowed topics, and redact or block sensitive data (PII), and they include contextual-grounding checks that flag answers not supported by the retrieved source — a second line of defence against hallucination on top of the prompt discipline. For document Q&A over sensitive corpora (HR files, medical records, financial documents) the PII and grounding controls are especially valuable.

Measuring accuracy before you trust it

Build a fixed evaluation set: 50–200 real questions paired with the correct answer and the source passage that contains it. Score faithfulness (does the answer follow from the cited passages, or did the model add unsupported claims?), answer relevance (does it address the question asked?), and context precision/recall (did retrieval surface the right passages?). Amazon Bedrock includes RAG evaluation in its model-evaluation suite — supply a dataset and it runs an LLM-as-a-judge to score retrieval and response quality — and open-source frameworks like Ragas do the same for DIY pipelines. Run the set on every change so a new parser or chunk size proves itself with a number instead of a demo.

the trust checklist

A trustworthy document Q&A answer: cites its source document and page · is grounded (the model was told to answer only from context and to admit when it cannot) · passes through a Guardrail with grounding + PII checks · and is backed by a golden evaluation set scoring faithfulness and relevance. Citations without grounding still hallucinate; grounding without citations cannot be verified. You need both.

who can read what

VIAccess control — making sure users only see documents they are allowed to

Document Q&A almost always runs over documents with different audiences: HR files only HR may read, a customer's contracts only that customer's team, board materials only executives. The cardinal rule is that access control lives in retrieval, not in a post-filter on the answer — by the time the model has written an answer, restricted content has already been used.

The pattern is the same on both build paths. Tag every chunk with ACL metadata at index time — user, group, role, tenant, document classification — derived from the source document's permissions. Then apply a metadata filter on every query so a user can only ever retrieve chunks they are entitled to. Bedrock Knowledge Bases supports metadata filtering on retrieval; DIY stores express the same with OpenSearch filters, pgvector SQL predicates, or Pinecone metadata filters. The filter must be derived from the authenticated user's identity on the server, never from anything the client can set.

For multi-tenant SaaS, isolate tenants at minimum with a per-tenant filter on every query so one customer can never retrieve another's documents; for hard isolation, give each tenant a separate index or knowledge base so there is no shared surface to misconfigure. When document permissions change — a person leaves a team, a file is reclassified — the ACL metadata in the index must be updated too, so re-sync permission changes, not just content changes. A document that was readable yesterday and restricted today must drop out of that user's retrievable set immediately.

Two more controls round it out. Keep the original documents in S3 behind IAM and bucket policies so the deep-link in a citation is itself authorization-checked — a user who is shown a citation should still be blocked at S3 if they are not entitled to open the file. And log every query with the user identity, the filter applied, and the documents retrieved, so access can be audited after the fact. In regulated settings this audit trail is often a hard requirement, not an option.

the access-control rule

Enforce access at retrieval, never as a post-filter on the answer. Tag chunks with ACL metadata at index time, derive the query filter from the authenticated user on the server, isolate tenants (per-tenant filter or separate index), re-sync permission changes as well as content changes, and protect the original files in S3 with IAM so even the citation link is authorization-checked.

the build, in order

VIIA step-by-step build outline (managed path)

Here is the fastest credible path from zero to a cited, production-leaning document Q&A system on AWS using Bedrock Knowledge Bases. The DIY path follows the same logical order with each stage hand-built — most often diverging at parsing and chunking.

  • Step 1 — Stage your documents in S3 — Land your PDFs, scans, and office files in an S3 bucket and keep the originals — citations will deep-link back to them. Note which documents are scanned or table-heavy; those will lean on parsing the hardest.
  • Step 2 — Enable Bedrock model access — In the Bedrock console, request access to an embedding model (Titan Text Embeddings v2 or Cohere Embed) and a generation model (Claude, Nova, Llama, or Mistral) in your chosen Region.
  • Step 3 — Create the Knowledge Base with the right parser — Create a Bedrock Knowledge Base, point it at the S3 bucket, and choose the parsing option — Bedrock Data Automation for tables and scans, or FM-based parsing for the most complex layouts. Pick the embedding model and a vector store (OpenSearch Serverless is the default). Choose semantic or hierarchical chunking.
  • Step 4 — Sync and inspect parsing first — Run the initial ingestion job, then spot-check the hardest documents: did tables survive, did scanned pages get OCR-ed, did multi-column pages parse in the right order? Use the Retrieve API on a handful of known questions. Fix parsing and chunking now, before anyone sees an answer — this is where most quality is won.
  • Step 5 — Wire RetrieveAndGenerate with citations — Call RetrieveAndGenerate with a system prompt that instructs the model to answer only from the retrieved passages and to admit when the answer is not in the documents. Enable re-ranking. Attach a Bedrock Guardrail (grounding + PII checks). Surface the returned citations in the UI with deep-links to the source page.
  • Step 6 — Add access control + freshness — Tag chunks with ACL metadata from each document's permissions and apply per-query metadata filters derived from the authenticated user. Protect the originals in S3 with IAM. Wire incremental sync so new and changed documents — and permission changes — re-index automatically.
  • Step 7 — Evaluate, then iterate — Build a 50–200 question golden set with source passages and score faithfulness, answer relevance, and context precision/recall with Bedrock RAG evaluation. Tune the parser, chunk size, top-K, and re-ranking against the numbers. Add full query/answer logging and a human-review sample before scaling traffic.
what it costs

VIIIThe document Q&A cost stack on AWS — where the money goes

A document Q&A bill has six line items. The one that surprises teams coming from generic RAG is parsing — paid per page at index time — which can dominate the upfront cost for a large or scan-heavy corpus. Here is the full stack and the lever on each.

The figures below are representative as of 2026 to show the shape of the bill, not a quote — always check the AWS pricing page (and any third-party vendor) for current rates. Upfront, parsing and embedding dominate (both scale with corpus size and run mostly once); at steady state, generation tokens and the always-on vector-store baseline dominate.

document Q&A cost stack on AWS · representative shape as of 2026 — check the AWS pricing page for current rates
Cost lineWhen you payDriverMain lever to control it
ParsingOne-time per document + on updatesPages parsed × methodUse Data Automation/Textract by default; reserve costly FM parsing for hard docs; parse changed pages only
Embeddings (indexing)One-time per corpus + on updatesTotal tokens embeddedChunk size; smaller embedding dimensions; only re-embed changed documents
Vector storeContinuous (baseline)Corpus size + index type + engineRight-size the engine; pgvector if Postgres already runs; tune dimensions
Query embeddingsPer queryQuestion volumeNegligible per call; cache embeddings for repeated questions
Re-rankingPer queryCandidates re-ranked × queriesRe-rank top-30/50, not top-500; skip on trivial queries
GenerationPer query (usually the largest at steady state)Input + output tokens × model priceCheaper model for easy questions; fewer chunks; prompt caching; tight max-tokens
The document-specific line is parsing — it is paid per page and can be the biggest upfront cost for a scan-heavy corpus, so match the method to the document (cheap for clean PDFs, FM parsing only where needed). At steady state, prompt caching and re-ranking to a few tight chunks cut the largest running line — generation. Batch any offline parsing or generation for roughly half price.
the central decision, side by side

Managed (Bedrock Knowledge Bases) vs DIY document Q&A — which to build

This is the comparison that decides your architecture. Read it as "default to managed; move a stage to DIY only when a row in the right column is a hard requirement for your documents." For document Q&A the managed path is strong because it now owns parsing and citations.

DimensionBedrock Knowledge Bases (managed)DIY (Bedrock + your stack)
Time to first cited answerHours — point at S3, sync, call an APIDays to weeks — build every stage
Document parsingBuilt-in: default / Data Automation / FM parsingYour own Textract / Data Automation calls + custom post-processing
Pipeline you maintainAlmost none — AWS runs parse→retrieveAll of it — parser, chunker, retriever, re-ranker
Chunking controlFixed / semantic / hierarchical presetsAnything — table-aware, document-structure-aware
CitationsReturned natively from RetrieveAndGenerateYou assemble from passed-chunk metadata
Access controlMetadata filteringAnything — row-level, per-tenant, external policy engine
Cost controlLess granular; convenience premiumMaximum — tune parsing and every stage
Best forMost contract / policy / knowledge-lookup use casesUnusual document types, bespoke parsing, strict multi-tenancy, cost-squeeze
Both paths call the same Bedrock embedding and generation models and both can return citations — the difference is who orchestrates parsing, chunking, and retrieval in between. A common production shape is a hybrid: managed KB for the bulk corpus, a DIY parsing path for one demanding document class (e.g. financial statements or forms).
building this for real?
Have a vetted AWS partner build your document Q&A — and let AWS credits pay for it
Start in 3 minutes →
a recent match

A contract-Q&A assistant for a lending team — anonymized

inquiry · series-a fintech lender, contract intelligence, US
Series-A fintech lender, 22 people, ~40k loan agreements and supplier contracts as PDFs (a large share scanned), US data-residency requirement

Situation: The operations team spent hours hunting through PDF agreements for specific clauses — notice periods, fee schedules, liability caps — and many of the oldest contracts were scanned images with no text layer. They wanted a "chat with our contracts" tool that answered with a citation to the exact page, never mixed one counterparty's documents with another's, and could prove where every answer came from for compliance. A first internal attempt returned garbage on scanned files and tables and had no access-control story, and the two engineers who could fix it were committed to the core lending product.

What CloudRoute did: Routed within 24 hours to a US-region AWS partner with a GenAI/ML and document-processing track record. The partner scoped a Bedrock Knowledge Bases build in us-east-1: S3 ingestion of the contract corpus, Bedrock Data Automation plus Amazon Textract for parsing scanned pages and preserving fee and rate tables, hierarchical chunking that kept clause numbers and page numbers in metadata, Titan v2 embeddings, OpenSearch Serverless as the vector store, Cohere Rerank for precision, Claude for grounded generation with quote-and-cite prompting, per-counterparty metadata filtering for isolation, a Bedrock Guardrail with grounding and PII checks, and a 150-question golden set scored with Bedrock RAG evaluation. The whole engagement was funded by AWS credits the partner filed for — Activate Portfolio plus a Bedrock POC allocation.

Outcome: A cited contract-Q&A assistant in production in about 6 weeks. Scanned agreements and fee tables parsed cleanly; faithfulness and context-precision scores cleared the team's bar on the golden set; per-counterparty isolation was enforced at retrieval and every answer deep-linked to the source page for audit. The build and the first months of inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.

engagement window: ~6 weeks · founder time: ~7 hours · stack: Bedrock KB + Data Automation + Textract + OpenSearch Serverless + Titan v2 + Cohere Rerank + Claude · cost to customer: $0

faq

Common questions

What is a document Q&A system, and how is it different from generic RAG?
A document Q&A system is retrieval-augmented generation (RAG) pointed at a document corpus: it answers plain-language questions from your PDFs, contracts, manuals, or reports with citations back to the source page. The mechanics are the same as any RAG system (embed, retrieve, generate), but document Q&A is defined by its hardest stage — parsing real documents (scans, tables, multi-column layouts) into clean text — and by its non-negotiable need for citations and per-document access control. On AWS the pipeline is ingest → parse → chunk → embed → store → retrieve → answer with citations.
Should I use Amazon Bedrock Knowledge Bases or build my own document Q&A pipeline?
Start with Bedrock Knowledge Bases. It now handles the hardest stage — parsing — via Bedrock Data Automation (and an FM-based parser for complex layouts), then chunks, embeds, retrieves, and returns cited answers from a RetrieveAndGenerate call, so you can ship a cited document Q&A prototype in hours. Move a stage to a DIY pipeline only when a concrete document requirement forces it: bespoke parsing for unusual document types (invoices, scientific papers, dense forms), table-aware or structure-aware chunking, strict per-document or multi-tenant access control, or aggressive cost/latency tuning. Most contract, policy, and knowledge-lookup use cases never need to leave the managed path.
How do I handle scanned PDFs and tables so the system can answer from them?
This is the make-or-break part of document Q&A. For scanned PDFs (page images with no text layer) you need OCR — Amazon Textract performs OCR and Amazon Bedrock Data Automation includes it, so both recover text that a naive extractor would miss entirely. For tables, use a parser that preserves cell/row/column structure (Textract has dedicated table extraction; Data Automation is layout-aware), then serialize each table into the chunk as a Markdown table or as one sentence per row so a lookup question like "what was the 2025 EMEA figure?" can retrieve the right fact. For the most complex layouts, a multimodal foundation model on Bedrock can parse the page visually. Never ship naive text extraction for a corpus with scans or tables — it is the most common cause of wrong answers.
How does the system show where each answer came from (citations)?
Citations come from metadata carried through the pipeline. At index time each chunk stores its document ID, title, and page number; at answer time those are returned alongside the response. Amazon Bedrock Knowledge Bases returns citations natively from RetrieveAndGenerate — each cited span maps back to the source chunk and document — and a DIY pipeline produces the same by attaching the metadata of every chunk it passed into the prompt. The UI then displays "Source: filename.pdf, p.37" and can deep-link to the original document in S3. For exact-wording use cases like contracts, instruct the model to quote and cite the relevant sentence rather than paraphrase.
How do I stop a document Q&A system from giving wrong or made-up answers?
Accuracy is enforced in two places. First, retrieval: most wrong answers are a retrieval or parsing failure (the right passage never reached the model), so fix parsing and chunking first and add re-ranking so only high-precision passages are passed. Second, generation: instruct the model to answer only from the supplied passages and to say "I could not find this in the documents" when the context lacks the answer, return citations so answers are verifiable, and attach a Bedrock Guardrail with contextual-grounding checks that flag unsupported answers. Then measure faithfulness and relevance on a golden set with Bedrock RAG evaluation so regressions are caught by a number, not a complaint.
How do I make sure users only see documents they are allowed to read?
Enforce access control in retrieval, never as a post-filter on the generated answer — by the time the model has written the answer, restricted content has already been used. Tag every chunk with ACL metadata (user, group, role, tenant, classification) at index time from the document's permissions, and apply a metadata filter on every query derived from the authenticated user on the server. Bedrock Knowledge Bases supports metadata filtering; DIY stores use OpenSearch filters, pgvector SQL predicates, or Pinecone metadata filters. For multi-tenant SaaS, filter by tenant at minimum or use a separate index per tenant for hard isolation, protect the original files in S3 with IAM so even citation links are authorization-checked, and re-sync permission changes as well as content changes.
What does a document Q&A system cost to run on AWS?
Six line items: parsing (per page, at index time — the document-specific cost, and often the biggest upfront for scan-heavy corpora), one-time embedding of the corpus, a continuous vector-store baseline, per-query question embeddings (negligible), per-query re-ranking, and per-query generation (usually the largest at steady state). The biggest levers: match the parsing method to the document (cheap extraction for clean PDFs, reserve costly FM parsing for hard layouts), route easy questions to a cheaper generation model, pass fewer re-ranked chunks, use Bedrock prompt caching for static context, and batch offline parsing/generation (~50% cheaper). Figures are representative as of 2026 — check the AWS pricing page for current rates.
How long does it take to build a document Q&A system on AWS?
A managed Bedrock Knowledge Bases prototype that returns cited answers over your documents can stand up in hours to a day. Getting to genuinely production-ready — clean parsing of your hardest documents, citations wired into the UI, access control, freshness sync, Guardrails, a golden evaluation set, logging, and cost controls — is typically 4–6 weeks, driven mostly by document complexity (scans, tables, and unusual layouts) rather than AWS wiring. A fully custom DIY pipeline takes longer. The slowest part is almost always parsing the real corpus well. A specialist ML/document-processing partner compresses this materially, which is the engagement CloudRoute routes — funded by AWS credits, so the customer pays $0.

Build your document Q&A on AWS — funded by AWS credits

CloudRoute routes you to a vetted AWS GenAI/ML partner who designs and ships the pipeline — Bedrock Knowledge Bases or a custom DIY stack, document parsing for scans and tables, the right vector store, embeddings, re-ranking, citations, access control, Guardrails, and evaluation. AWS credits fund the build and the inference. You pay $0.

matched within< 24h
credits to fund itup to $100K
cost to you$0
How to build a document Q&A system on AWS — 2026 guide · CloudRoute