Intelligent document processing (IDP) takes the invoices, contracts, claims, statements, and forms a business receives and turns them into clean structured data — a JSON record per document that flows straight into a system. This is the full build guide for data extraction on AWS in 2026: the three engines that do the extracting and exactly when each one wins (Amazon Textract, Amazon Bedrock Data Automation, and an LLM-with-schema using Bedrock structured output), the end-to-end pipeline (ingest → parse → extract to a JSON schema → validate → human review for low-confidence fields), how to handle tables, forms, and handwriting, how confidence scores and accuracy actually work, and what it costs at scale once batch processing is in play.
Intelligent document processing (IDP) is the discipline of turning documents a business receives — invoices, purchase orders, contracts, ID documents, bank statements, insurance claims, medical forms, shipping paperwork — into structured data that software can act on. The deliverable is not text; it is a record. The clearest definition: AI data extraction takes an unstructured document and returns a structured object (almost always JSON) whose fields match a schema you defined in advance.
Most business data arrives unstructured, and most business systems can only consume structured data. An accounts-payable system does not want a PDF of an invoice; it wants { "invoice_number": "INV-10432", "vendor": "Acme Ltd", "total": 4820.00, "currency": "USD", "due_date": "2026-07-15", "line_items": [...] }. The gap between "a PDF a human can read" and "a JSON record a database can ingest" is exactly the gap IDP closes — reliably, at volume, from documents that vary wildly in layout and quality.
The word that does the work is schema. A generic "read this document" call returns text or a loose summary — useful for search, useless for posting a transaction. A schema-driven extraction call returns named, typed fields in a fixed shape: this string is the invoice number, this number is the total, this array is the line items. Because the output is shaped to a contract you control, the record drops into a database, ERP, or ledger without a human re-keying anything. The schema is what turns a document reader into a data-extraction system.
IDP is distinct from two neighbours that also start with documents. It is not document Q&A / RAG (asking a corpus questions and getting cited answers) — that retrieves and answers; IDP extracts a complete structured record from each individual document. And it is not generic OCR — OCR gives you the characters on the page; IDP gives you the meaning, mapped to fields, validated, and ready to commit. OCR is one possible step inside an IDP pipeline, not the whole thing.
What makes IDP hard is the long tail of real documents: the same invoice "type" in hundreds of vendor layouts, tables that span pages with merged cells, forms that put the label above the value on one document and beside it on another, pages that are crisp born-digital PDFs or faxed, photographed, skewed, or handwritten. A system that works on a clean sample and falls apart on the messy 20% is not a system. The rest of this guide is about building one that survives the tail — starting with the most consequential decision: which engine does the extracting.
AI data extraction on AWS = take an unstructured document, parse it into layout-aware text, and extract its contents into a structured JSON record that matches a schema you defined — validated, confidence-scored, and routed to a human when the model is unsure — so the record commits straight into your systems without manual data entry.
AWS gives you three distinct ways to perform the extract step, and the architecture of your whole pipeline follows from which one (or which combination) you pick. They are not interchangeable: they differ in how they read a document, how they generalize across layouts, how predictable their cost is, and how much reasoning they can do. The honest framing is that each wins a different region of the problem, and mature systems route different documents to different engines.
Picture the three on a spectrum from deterministic and specialized to flexible and generative. Textract sits at the specialized end — a purpose-built model for OCR, forms, and tables. An LLM-with-schema sits at the flexible end — a general foundation model that reasons over the document and emits your schema. Bedrock Data Automation sits in between — a managed generative extractor you configure with a blueprint, getting much of the LLM's layout-robustness through a managed, repeatable interface. Where a document falls on the difficulty curve tells you which engine to reach for.
Amazon Textract is AWS's dedicated document-text-and-data extraction service. It does OCR (recovering characters from scanned and photographed pages with no text layer) and has dedicated capabilities for forms (key-value pairs like "Invoice number: 10432"), tables (preserving cell/row/column structure), signatures, and queries (ask "what is the total?" and get the value plus a confidence score). Its specialized analyzers — purpose-built extraction for invoices and receipts, IDs, and lending documents — return common fields out of the box for those classes.
Its strengths are speed, low and predictable per-page cost, and deterministic behaviour — the same document yields the same output, which auditors and high-volume operations value. Its limit is generalization: excellent on the structures it knows, weaker on highly variable, free-form, or reasoning-heavy documents where the meaning is not a labelled field on the page. Reach for Textract when you have high volume on a relatively known document type (forms, IDs, standardized invoices), need cheap deterministic OCR/table extraction, or want a fast, well-understood building block to feed the next stage.
Amazon Bedrock Data Automation (BDA) is a managed, generative-AI-powered service that turns unstructured content into structured output through one API. For extraction you define a blueprint — a schema declaring the fields you want (invoice_number, vendor_name, line_items, total, due_date), each described in natural language with a type, plus optional normalization and validation. Because BDA reads the document with a foundation model rather than matching pixel positions, one blueprint generalizes across hundreds of layouts of the same document type — exactly where Textract's template-style approach strains. AWS ships sample blueprints for common documents (invoices, receipts, IDs, bank statements) you can clone and adapt.
BDA's sweet spot is production IDP where formats vary across senders and you want named fields without operating your own model pipeline — managed, consistent structured JSON, with normalization and validation inside the blueprint. It costs more per page than raw Textract and is less open-ended than a hand-driven LLM. Reach for BDA when you need specific fields reliably across variable layouts, you want a managed service rather than a DIY model harness, and your documents are recognizably "the same kind of thing" even though no two look alike. (The dedicated BDA page in this cluster covers blueprints and projects in depth.)
The third engine is a general foundation model on Amazon Bedrock — Claude, Amazon Nova, Llama, Mistral — prompted to extract data and constrained to emit your schema. The mechanism that makes it reliable is structured output: instead of hoping the model returns clean JSON in its prose, you use its tool-use / function-calling interface (or a JSON-schema-constrained response) so the output conforms to a schema you supply. You pass the document text (often parsed first by Textract or BDA so the model reads clean, layout-aware input) and a tool definition describing your fields, and the model returns a populated, schema-valid object.
This engine is the most flexible by a wide margin. Because the model reasons, it handles documents the other two struggle with: free-form contracts where a clause must be interpreted, edge cases needing judgement ("is this a credit note or an invoice?"), derived fields ("compute the total tax across line items"), rare document types not worth a blueprint, and extraction that combines information from different parts of the page. The trade-offs are higher per-document cost (input + output tokens), latency, and the need to guard against hallucinated fields — which is why grounding on parsed text, validating the output, and confidence-gating matter even more here. Reach for an LLM-with-schema when the document needs reasoning or interpretation, the layouts are too varied or rare to template, or you need derived/computed fields rather than values printed verbatim on the page.
Use Textract for high-volume, known document types where you want cheap, fast, deterministic OCR/forms/tables. Use Bedrock Data Automation for production extraction of named fields across many varying layouts of the same document type, fully managed. Use an LLM-with-schema for documents that need reasoning, interpretation, derived fields, or one-off types too varied to template. Most real systems route documents to different engines — and often chain them (Textract/BDA to parse, an LLM to reason over the result).
Whatever engine you pick, a production data-extraction system runs the same five logical stages: ingest → parse → extract → validate → human review for low-confidence. The engine choice from the previous section slots into stages two and three; the stages on either side — getting documents in, validating what comes out, and catching the cases the model got wrong — are what separate a demo from a system you can trust with payments or compliance.
The table below maps each stage to the AWS services that implement it. Read the pipeline as a conveyor: a document lands, is converted to clean text, has its fields pulled into your schema, is checked against rules, and either commits automatically (high confidence, passes validation) or diverts to a person (low confidence or a failed rule). The art is setting the thresholds so the right share auto-commits and the review queue stays small.
1. Ingest. Land raw documents in Amazon S3 — uploaded by users, emailed in (S3 + SES), dropped by an upstream system, or synced from a document store. S3 is the durable record of the originals; you keep them so any extracted field can be traced to its exact source page during review or audit. An S3 event triggers the pipeline (EventBridge / Lambda / Step Functions) the moment a document arrives.
2. Parse. Convert each document into clean, layout-aware text and structure before extraction — OCR the scanned pages, recover tables with their row/column structure, read multi-column layouts in the right order. This is where Textract or BDA earns its place even when an LLM does the final field mapping: feeding an LLM raw, badly-ordered text is the most common cause of bad extraction, and clean parsed text is the single biggest quality lever. For BDA blueprints and Textract analyzers, parse and extract collapse into one call; when an LLM extracts, parse is a separate, cheaper first step.
3. Extract. Map the parsed content to your target schema — the engine step from section II: a Textract analyzer or query, a BDA blueprint, or an LLM-with-schema call, each producing a structured object whose fields match the contract your downstream system expects. The schema is defined once and reused: field names, types, required-vs-optional, formats (ISO dates, decimal amounts, enums). Crucially, every field should come back with a confidence score — all three engines provide one — because that score drives validation and review in the next two stages.
4. Validate. Run the extracted record through rules before trusting it. Two layers: schema validation (every required field present, correctly typed, in the right format — a date that parses, a total that is a number) and business validation (the math adds up — line items sum to the subtotal, the invoice date is not in the future, the vendor exists in your master data, the currency is one you accept). A record that fails validation is never silently committed.
5. Human review for low-confidence. Route low-confidence or failed-validation records to a person; auto-commit the rest. Amazon Augmented AI (A2I) is the managed way to do this on AWS: it builds human-review workflows with a worker UI, integrates directly with Textract (and wraps any model's output), and lets you set the confidence threshold below which a document is sent for review. A reviewer confirms or corrects the flagged fields; the corrected record commits and becomes labelled data you can use to improve the system. This is what makes IDP safe for high-stakes work — you are not betting the model is always right, you are catching the cases where it is not.
| Stage | What it does | Typical AWS service(s) | Output |
|---|---|---|---|
| 1. Ingest | Land + store original documents, trigger the pipeline | Amazon S3, EventBridge, Lambda / Step Functions | Document in S3 + an event |
| 2. Parse | Document → clean, layout-aware text + tables (OCR scans) | Amazon Textract / Bedrock Data Automation | Structured text, tables, key-values |
| 3. Extract | Map content to your JSON schema, with per-field confidence | Textract analyzers/queries · BDA blueprints · Bedrock LLM + structured output | A schema-shaped JSON record |
| 4. Validate | Schema + business rules; flag failures | Lambda / Step Functions (your rules) | Pass/fail + flagged fields |
| 5. Human review | Route low-confidence / failed records to a person; auto-commit the rest | Amazon Augmented AI (A2I) | Confirmed/corrected record |
The three input types that break naive extraction are tables, forms, and handwriting. Each has a specific failure mode and a specific way to handle it on AWS. Getting these right is most of what separates an IDP system that works on the easy 80% from one that survives the messy tail.
The failures are predictable. A table read as a flat run of text loses the row/column relationships, so a value is no longer associated with its row label and column header — "$4.2M" stops meaning "EMEA revenue, 2025". A form read without structure loses the link between a label and its value, so "Invoice number" and "10432" become two unrelated tokens. Handwriting returns nothing useful from a plain text extractor, and even with OCR it is the lowest-confidence content on the page. Each of these is invisible until a downstream consumer trusts a mangled value.
Use a parser that preserves table structure. Amazon Textract has dedicated table extraction that returns cells with their row and column positions; Bedrock Data Automation is layout-aware and preserves tables as it reads. Once extracted, how you serialize the table for the extract step matters: rendering it as a Markdown table keeps rows and columns aligned so an LLM can read across them, while row-as-record serialization (each row turned into a small object or "for EMEA in 2025, revenue was $4.2M") is often easier to extract specific values from. For line-item extraction (invoices, statements), map each table row to an object in a line_items array in your schema and validate that the rows sum to the stated subtotal — a cheap, powerful correctness check.
Forms are Textract's home turf: its forms analysis returns key-value pairs directly, and its Queries feature lets you ask for a specific value ("what is the policy number?") and get the answer with a confidence score, which is ideal when you know exactly which fields you need. BDA blueprints handle forms by declaring each field in the schema. For unusual or semi-structured forms where the label-value relationship is ambiguous, an LLM-with-schema reading the parsed text can disambiguate using context the deterministic parsers miss. The general rule: known, standardized forms → Textract; variable forms → BDA blueprint; ambiguous or reasoning-heavy forms → LLM.
Handwriting is the hardest input and should always be treated as low-confidence. Textract recognizes handwritten text (handprint) alongside typed text, and BDA and multimodal models can read it too, but accuracy is inherently lower and varies with legibility. The right posture is not to chase perfect handwriting OCR but to lean on the confidence-and-review machinery: extract the handwritten fields, expect lower confidence scores on them, and let the threshold route those records to a human far more often than typed ones. For forms that are mostly typed with a few handwritten fields (a signature, a handwritten amount, a checkbox), this means only the handwritten fields trigger review — not the whole document. Designing the schema so handwritten fields are isolated keeps the review burden proportional.
Preserve table structure with Textract or BDA and validate that rows sum; pull form key-values with Textract Queries (known forms) or a blueprint/LLM (variable forms); treat handwriting as inherently low-confidence and let the review threshold catch it rather than expecting perfect OCR. Isolate handwritten fields in the schema so only they trigger review, not the whole document.
No extraction engine is perfect, so a production IDP system is not built on the assumption of perfection — it is built on knowing, per field, when it might be wrong, and doing something about it. Three mechanisms make extraction trustworthy: per-field confidence scores, validation rules, and a human-in-the-loop path. Together they let you auto-commit the majority of documents safely while catching the minority that need a person.
Start with confidence scores. Every AWS extraction engine returns a confidence signal per field: Textract attaches a numeric confidence to each detected value; BDA returns confidence-style metadata; an LLM can be asked for one or have it derived from agreement across passes. The score is not a guarantee of correctness, but it is a strong sorting signal — a field at 0.99 confidence is almost always right; a field at 0.55 needs a human. The core design move is to set a threshold per field (or per field-importance class) and route everything below it to review.
Confidence catches fields the model is unsure about; validation catches fields the model is confidently wrong about. They are independent and you need both. Schema validation enforces structure (required fields present, correct types, valid formats); business validation enforces meaning — line items sum to the subtotal, subtotal plus tax equals the total, the IBAN passes its checksum, the date is plausible, the vendor matches master data, the amount is in an expected range. A record can be high-confidence and still fail validation (the model read the number correctly but the document is internally inconsistent) — precisely the error a human should see. Cross-field "do the numbers reconcile?" checks are some of the highest-value validation you can write.
Amazon Augmented AI (A2I) is the managed service for the review stage. You define a workflow: a worker task template (the reviewer UI), a workforce (your own team, a vendor, or Mechanical Turk), and the conditions that send a document for review — most commonly a confidence threshold, plus failed validation rules and a random sample for ongoing quality measurement. A2I integrates natively with Textract and wraps any model's output. Reviewers confirm or correct flagged fields, the corrected record proceeds, and the corrections become a dataset for measuring real-world accuracy and improving prompts, blueprints, or fine-tuning. Reviewing a sample of even the auto-committed records — not just the flagged ones — is how you discover whether the thresholds are actually catching errors.
You cannot tune thresholds without measuring. Build a golden set of a few hundred real documents with hand-verified extractions and score every engine and configuration against it: field-level accuracy (fraction of fields exactly right), document-level accuracy (fraction of documents entirely correct — the metric for straight-through processing), and the auto-commit rate at a given error budget (how many documents you can process without review while keeping errors under tolerance). These numbers turn "which engine is better for our documents?" into an experiment with an answer, and let you set the confidence threshold deliberately — high enough that auto-committed errors stay under budget, low enough that the review queue stays affordable.
| Mechanism | What it catches | How it works | AWS service |
|---|---|---|---|
| Confidence scores | Fields the model is unsure about | Per-field score; threshold routes low scores to review | Built into Textract / BDA; derived for LLM |
| Schema validation | Wrong type, missing required field, bad format | Validate the JSON against the schema contract | Your code (Lambda / Step Functions) |
| Business validation | Confidently-wrong but internally-inconsistent records | Cross-field rules: sums reconcile, checksums, master-data lookups | Your code (Lambda / Step Functions) |
| Human-in-the-loop | Everything the above flag — the final safety net | Reviewer confirms/corrects flagged fields; corrections feed improvement | Amazon Augmented AI (A2I) |
IDP economics are about volume. A pilot on a few hundred documents costs almost nothing; a production system processing hundreds of thousands of documents a month is where the per-document cost of your engine choice, the share you can auto-commit, and the decision to run in batch start to dominate the bill. Understanding the shape lets you architect for cost from the start instead of being surprised by it.
The bill has four parts. Extraction is the headline line and it scales with volume and engine: Textract is priced per page by feature (text, forms, tables, queries, specialized analyzers), BDA per page with a premium for custom-output blueprints, and an LLM by input + output tokens — which is why a reasoning-heavy LLM extraction can cost meaningfully more per document than a Textract call, and why routing only the documents that need reasoning to the LLM matters. Parsing (if separate from extraction) adds a per-page OCR cost. Human review is a real and often-underestimated cost — reviewer time per flagged document — which is why your auto-commit rate is an economic lever, not just a quality one. And the usual storage and orchestration (S3, Lambda, Step Functions) rounds it out. (Figures here describe the shape; rates are representative as of 2026 — check the AWS pricing page for current numbers, since they differ by feature and change over time.)
The lever that changes the math at scale is batch (asynchronous) processing. Most IDP is not real-time — invoices, claims, and statements can be processed minutes or hours after they arrive — and AWS prices asynchronous work substantially cheaper. Amazon Bedrock batch inference runs at roughly 50% of on-demand token pricing, so an LLM-with-schema extraction job submitted as a batch costs about half what the same volume would cost called synchronously. Textract's asynchronous APIs are likewise built for processing large volumes of multi-page documents without holding a connection open per file. Unless a document genuinely needs an answer in seconds, run extraction as batch — it cuts both the unit cost and the operational load of managing a high-throughput synchronous service. Two more levers compound the savings: route each document to the cheapest engine that handles it (Textract for the standardized bulk, LLM only for the reasoning tail), and push the auto-commit rate up by investing in validation, so human-review cost grows slower than volume.
| Cost line | Driver | Engine sensitivity | Main lever to control it |
|---|---|---|---|
| Extraction | Documents/pages × engine | High — Textract (per page) < BDA (per page + blueprint premium) < LLM (per token) | Route to the cheapest engine that handles the doc; run as batch (~50% off for Bedrock) |
| Parsing (if separate) | Pages OCR-ed | Medium — per page by feature | Parse changed pages only; reuse parsed text across passes |
| Human review | Flagged documents × reviewer time | Indirect — driven by auto-commit rate | Raise auto-commit rate via validation + tuned thresholds; isolate handwritten fields |
| Storage + orchestration | Document volume + pipeline runs | Low | S3 lifecycle policies; efficient Step Functions / Lambda; right-size retries |
Here is the fastest credible path from zero to a production-leaning data-extraction pipeline on AWS. The order is deliberate: define the contract first, prove the engine on real documents second, and only then wire validation, review, and scale. Most teams that struggle did the stages in the wrong order — they built orchestration before they knew their documents.
Most IDP projects fail in a handful of predictable ways, and nearly all of them are avoidable with the structure this guide describes. Knowing the failure modes up front is the cheapest way to skip them.
This is the comparison that decides your architecture. Read it as "match each document type to the engine whose row fits it best" — and remember that production systems routinely combine them: Textract or BDA to parse and pull the standardized fields, an LLM to reason over the hard tail. Cost notes describe the shape and are representative as of 2026 — confirm current rates on the AWS pricing page.
| Dimension | Amazon Textract | Bedrock Data Automation (BDA) | LLM-with-schema (Bedrock) |
|---|---|---|---|
| What it is | Specialized OCR / forms / tables service | Managed generative extractor (blueprints) | Foundation model + structured output |
| How it reads | Purpose-built document ML | Foundation model, configured by blueprint | General model reasoning over parsed text |
| Generalizes across layouts | Good on known structures; strains on high variation | Strong — one blueprint spans many layouts | Strongest — handles novel and free-form docs |
| Reasoning / derived fields | No — extracts what is on the page | Limited — schema-driven extraction | Yes — interpret, compute, disambiguate |
| Output | Text, key-values, tables, query answers + confidence | Schema-shaped JSON (blueprint) + confidence | Your JSON schema (tool-use constrained) + derived confidence |
| Relative cost / doc | Lowest (per page by feature) | Medium (per page + blueprint premium) | Highest (per input + output token) |
| Determinism | High — same in, same out | Medium | Lower — guard with validation + grounding |
| Best for | High-volume known forms, IDs, standardized invoices, cheap OCR/tables | Production named-field extraction across varying layouts, managed | Reasoning-heavy, free-form, rare, or derived-field documents |
Situation: Their operations team was manually keying invoice and remittance data into the ledger, and volume was outgrowing headcount. The documents came in every conceivable format — clean PDFs, scans, photos, a long tail of one-off layouts, and some with handwritten amounts and notes — and an earlier template-based OCR attempt broke whenever a new format appeared, so people re-keyed the exceptions anyway. They needed extracted records to land in their schema with line items that reconciled, they could not afford to auto-post a wrong amount, and the two engineers who could build it were committed to the core payments product. They also did not want to burn runway on per-document processing fees while still proving the workflow out.
What CloudRoute did: CloudRoute matched them in under 24 hours to a US-region AWS partner with a document-processing and GenAI/ML track record. The partner built a routed IDP pipeline on AWS: S3 ingestion with an EventBridge + Step Functions orchestrator; Amazon Textract for OCR and table extraction on the standardized bulk and for the handwritten-amount fields; a Bedrock Data Automation blueprint for the common invoice and remittance formats so one schema spanned hundreds of layouts; and a Claude-on-Bedrock structured-output call (tool-use constrained to the target JSON schema) reserved for the reasoning-heavy long-tail documents and for computing derived fields. Validation in Lambda enforced the schema and the business rules — line items summing to the subtotal, subtotal-plus-tax equalling the total, vendor matched to master data — and Amazon A2I routed low-confidence and failed-validation records (including most handwritten ones) to a human-review queue while the rest auto-committed. Extraction ran as Bedrock batch and async Textract to halve the unit cost. The partner filed a Bedrock POC credit application plus an Activate Portfolio application to fund the build.
Outcome: Within about five weeks, the majority of invoices and remittances flowed straight through as validated structured records into the ledger, manual keying dropped to the low-confidence share that hit the A2I queue, and reconciliation rules caught inconsistent documents before they posted. Routing the standardized bulk to Textract/BDA and reserving Claude for the tail — all in batch — kept the processing cost low, and the entire build and early production run was covered by the approved AWS credits, so the team paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
volume: ~120k docs/mo across hundreds of formats · time to live: ~5 weeks · stack: S3 + Step Functions + Textract + Bedrock Data Automation + Claude (structured output) + A2I · run mode: batch · cost to customer: $0
CloudRoute routes you to a vetted AWS GenAI/ML partner who designs and ships the IDP pipeline — the right engine mix (Amazon Textract, Bedrock Data Automation, and an LLM-with-schema), the S3 ingest and Step Functions orchestration, structured-output extraction to your JSON schema, schema and business validation, an Amazon A2I human-review path, and batch processing for cost at scale. AWS credits fund the build and the inference. You pay $0.