A complete, neutral cost breakdown for retrieval-augmented generation on Amazon Bedrock in 2026: the five line items that make up a RAG bill, the one-time cost to embed your corpus, the always-on vector store (the part that surprises everyone — OpenSearch Serverless has a minimum), per-query retrieval and generation tokens, the cost to re-embed when data changes, and a master table that prices real systems across corpus sizes and query volumes. Plus managed Knowledge Bases vs DIY cost, and how AWS credits make all of it $0 to build.
Almost every team budgets a RAG system by pricing the generation model and stops there. That misses more than half the bill. A RAG system has five distinct cost lines, each priced on a different basis, and two of them — the vector store and re-embedding — are the ones that surprise people. Get all five on the table before you build.
It helps to separate the five lines by when you pay them, because the timing is what makes the bill hard to reason about. Two are tied to your data (you pay when you index or re-index, regardless of traffic), one is a continuous baseline (you pay every hour the system exists, even at zero queries), and two are per-query (they scale with how much the system is used). A RAG bill that looks wrong is almost always a confusion between these three timing buckets.
Line 1 — embeddings (indexing). Before anyone asks a question, every chunk of your corpus is turned into a vector by an embedding model (Amazon Titan Text Embeddings v2 or Cohere Embed on Bedrock). You pay per input token embedded, once, when you first index the corpus. Embedding is among the cheapest things on Bedrock — representatively cents-to-low-dollars per million tokens — so even a large corpus is usually a one-time charge in single-to-low-double-digit dollars. This line is almost never the problem; people just forget it exists.
Line 2 — the vector store (continuous baseline). The vectors have to live somewhere that can answer nearest-neighbour queries, and that store runs 24/7 whether or not anyone uses it. This is the line that breaks budgets, because the common default on AWS — Amazon OpenSearch Serverless — has a minimum capacity floor (see §III). It is a fixed monthly cost decoupled from both corpus size at small scale and from traffic. For a small system it is frequently the largest line on the entire bill.
Line 3 — query embeddings (per query). Every incoming question is embedded with the same model before retrieval. This is one short embedding call per query — genuinely negligible (fractions of a cent), even at high volume. Mentioned for completeness; it never moves the total.
Line 4 — re-ranking (per query). If you re-rank (and you usually should — it is the highest-leverage retrieval-quality step), each query sends its top-K candidate chunks through a cross-encoder re-ranker (Amazon Rerank or Cohere Rerank on Bedrock), billed per query and per the volume of text re-ranked. Modest per query, but it scales with query volume and with how wide a net you re-rank.
Line 5 — generation (per query, usually the largest at scale). The final step sends the system prompt plus the re-ranked chunks plus the question to a generation model (Claude, Amazon Nova, Llama, Mistral on Bedrock) and pays for input and output tokens. In RAG, the input is inflated by the retrieved context — often several thousand tokens of chunks per question — so generation is typically the dominant cost once query volume is non-trivial. It is also the line with the most levers (model choice, chunk count, prompt caching, max-output).
Data-tied (Lines 1, re-embedding): paid when you index or re-index, regardless of traffic. Continuous baseline (Line 2, vector store): paid every hour the system exists, even at zero queries — this is the one that surprises people. Per-query (Lines 3, 4, 5): scale with usage; generation dominates. Confusing these three is why RAG bills feel unpredictable.
The first real number to compute is what it costs to embed your corpus once. It is almost always small, but it is the cleanest place to start because the math is exact: corpus size in tokens × the embedding price per token. The subtler cost is re-embedding, which turns a one-time charge into a recurring one whenever your data changes.
Start by sizing the corpus in tokens, not documents. A rough rule: 1,000 tokens ≈ 750 words, and a typical text-heavy page is 500–800 words, so call it ~1,000 tokens per page as a planning figure. A 10,000-document corpus averaging 5 pages each is therefore on the order of 50 million tokens. Chunking adds a little overhead (chunk boundaries and any repeated titles in metadata), but for budgeting, total corpus tokens is the right input.
Now multiply by the embedding price. Representatively in 2026, Amazon Titan Text Embeddings v2 lands around $0.00002 per 1K tokens (i.e. ~$0.02 per million); Cohere Embed is in a similar low band. So embedding that 50M-token corpus costs roughly 50 × $0.02 ≈ $1 — a rounding error. Even a very large corpus of 1 billion tokens is on the order of $20 as a one-time charge. This is why embedding is never the line to optimise for cost; it is cheap by design because AWS wants your data indexed and resident.
Embedding dimensionality is a lever, but it is a storage and latency lever, not an embedding-cost one. Titan v2 lets you choose output dimensions (e.g. 256 / 512 / 1024); smaller vectors cost the same to generate but take less room in the vector store and search faster. So dimensionality affects Line 2 (the vector store) and retrieval latency — not the one-time embedding bill in any meaningful way.
A corpus is never static. Documents are added, edited, and removed, and every changed chunk must be re-embedded to stay searchable. Incremental re-embedding — only re-embedding what changed — keeps this trivial: if 5% of a 50M-token corpus changes each month, that is 2.5M tokens, or about 5 cents. With incremental sync (Bedrock Knowledge Bases does this automatically; a DIY pipeline triggers it on an S3 change event), re-embedding is a non-event.
The expensive version is a full re-embed, and there is exactly one situation that forces it: changing the embedding model or its version. Because queries and documents must be embedded by the same model, switching from Titan v2 to Cohere (or to a new Titan version) means re-embedding the entire corpus from scratch. For a 1B-token corpus that is a ~$20 one-time hit — still cheap in absolute terms, but it also means rebuilding the whole index and re-validating retrieval quality, which is the real cost. Treat the embedding-model choice as semi-permanent for this reason, and benchmark before you commit.
One-time embedding ≈ (corpus tokens ÷ 1,000) × ~$0.00002. A 50M-token corpus ≈ $1; a 1B-token corpus ≈ $20. Incremental re-embedding on updates is a few cents. A full re-embed (only forced by changing the embedding model) repeats the one-time cost and rebuilds the index. Representative 2026 figures — confirm on the AWS Bedrock pricing page.
This is the section that changes how teams budget RAG. Unlike embeddings and inference, the vector store is a continuous cost you pay every hour the system exists — and the most common option on AWS, OpenSearch Serverless, has a minimum floor that makes a tiny, idle RAG system cost far more than its token usage suggests. Understanding the cost shape of each store is the difference between a $30/month surprise and a $700/month one.
There are four mainstream vector stores on AWS, and they have fundamentally different cost shapes. Two bill on reserved/baseline capacity (you pay for headroom whether you use it or not), and two bill closer to actual usage. The store you pick is the single biggest swing on a small RAG system's monthly bill — bigger than the model choice — because at low traffic the store baseline dominates everything else.
Amazon OpenSearch Serverless is the default vector store behind Bedrock Knowledge Bases, and the one most teams land on. It bills by OpenSearch Compute Units (OCUs) — separate units for indexing and for search — plus storage. The catch is the minimum: a collection requires a baseline of OCUs to exist, and that baseline is representatively on the order of a few hundred dollars per month even for a small corpus serving little traffic. (As of 2026, AWS lowered the entry point for dev/test collections, but a production-grade, redundant collection still carries a meaningful floor — check the current OpenSearch Serverless pricing page for the exact OCU minimum and hourly rate.)
The practical consequence: for a small internal knowledge base — a few thousand documents, a few hundred queries a day — the OpenSearch Serverless baseline is frequently the largest single line on the bill, dwarfing the dollar or two of embeddings and the handful of dollars of monthly inference. This is the number one "why is my RAG POC so expensive?" answer. It does not mean OpenSearch is wrong — it scales beautifully and gives you native hybrid search — but it means a tiny RAG system is paying for capacity it is not using.
Aurora PostgreSQL with the pgvector extension stores vectors in a relational database you may already operate. Cost is the Aurora instance or, with Aurora Serverless v2, Aurora Capacity Units (ACUs) that scale with load (down to a low minimum), plus storage and I/O. If you already run Aurora for application data, the marginal cost of adding vectors can be near zero — no new system, no separate baseline. Even standalone, a small Aurora Serverless v2 configuration can sit below the OpenSearch Serverless floor for small corpora, which is why pgvector is the usual recommendation when cost-at-small-scale matters and you do not need OpenSearch's native hybrid search.
Pinecone is a vector-native managed database available through the AWS Marketplace and selectable in Bedrock Knowledge Bases. Its serverless tier bills closer to actual usage (storage + reads + writes) with a low or no idle floor, which can make a small index genuinely cheap, while large or high-QPS workloads are priced on consumption. It is a third-party service billed separately (Marketplace billing can route through your AWS invoice and, usefully, can be covered by AWS credits in many cases). The trade is that data lives in a non-AWS-native service.
Redis-based vector search (Amazon MemoryDB) is in-memory, so it delivers single-digit-millisecond retrieval — ideal for real-time chat and agent loops. The cost shape is per node, priced by RAM, and because your entire index must fit in memory, cost scales directly with corpus size and is the most expensive option for large archival corpora. Reach for it when latency is the binding constraint, not when cost is.
| Vector store | Billed on | Idle floor | Small corpus (~thousands of docs) | Large corpus / high QPS | When it is the cheap choice |
|---|---|---|---|---|---|
| OpenSearch Serverless | OCUs (index + search) + storage | Meaningful (a few hundred $/mo class) | Floor dominates — often the biggest line | Scales well; cost grows with OCUs | Mid/large corpora; want native hybrid search |
| Aurora pgvector | Instance or ACU + storage + I/O | Low (Serverless v2 scales down) | Often cheapest; ~zero marginal if Aurora already runs | Good into millions of vectors; specialist wins beyond | You already run Postgres; cost-at-small-scale matters |
| Pinecone (Marketplace) | Usage: storage + reads + writes | Low / none (serverless tier) | Can be very cheap when idle | Consumption-priced; predictable at scale | Small idle indexes; want zero infra to operate |
| MemoryDB / Redis | Per node, priced by RAM | High (must keep nodes hot) | Overkill unless latency-critical | Most expensive — whole index in RAM | Latency is the binding constraint, not cost |
The per-query lines are what turn a fixed monthly baseline into a bill that grows with usage. Three things happen on every question — a query embedding, an optional re-rank, and a generation call — and the third one dominates. The key insight for RAG specifically: retrieved context inflates the input token count, so generation in RAG costs more per query than a bare chatbot.
Walk a single query through the pipeline and price each step. Query embedding: one short embedding call (the question, ~50–200 tokens) at ~$0.00002/1K — a small fraction of a cent. Across 100,000 queries a month that is still pennies. Ignore it in any budget.
Re-ranking: the retriever returns a wide net (say top-30 to top-50 chunks), and the re-ranker scores each against the question. Re-rank pricing is representatively per query (sometimes per the volume of text scored), and lands in the small-fraction-of-a-cent to low-cents range per query depending on how wide a net you re-rank. At 100,000 queries/month this is typically a few dollars to low tens of dollars — real but rarely dominant. The lever is obvious: re-rank top-30, not top-300, and skip re-ranking on trivial queries.
Generation — the line that matters. This is where RAG diverges from a plain chatbot. The input to the generation model is not just the question; it is the system prompt + the re-ranked chunks + the question. Those chunks are commonly 2,000–4,000 tokens of retrieved context per query — far more than the question itself. So a RAG query's input is dominated by context, and since you pay per input token, the amount of context you pass is a direct cost lever, not just a quality one. Output is whatever the model writes (often 300–800 tokens for a grounded answer).
Make it concrete with a representative mid-tier setup — Claude Sonnet-class generation, ~3,500 input tokens (≈3,000 of context + 500 question) and ~600 output tokens per query. At representative Sonnet rates ($3/1M input, $15/1M output): input ≈ $0.0105, output ≈ $0.009 → roughly $0.02 per query. At 20,000 queries/month that is ~$400; at 100,000 it is ~$2,000 — generation alone. Swap to a cheaper model for the easy questions (Nova Lite, Claude Haiku) and the per-query cost drops by 5–20×; that is the single biggest per-query lever.
A bare chatbot pays for a short prompt; a RAG query pays for the retrieved context too — commonly 2,000–4,000 input tokens of chunks per question on top of the prompt. Because you pay per input token, how much context you pass is a cost decision. Fewer, tighter, re-ranked chunks and prompt caching on the static system prompt are the two levers that cut the biggest line.
This is the table most people come for: representative all-in monthly cost for RAG systems of different sizes, broken into the lines that move. It shows the two regimes clearly — small systems are dominated by the vector-store baseline, large systems by generation tokens — and where the crossover happens. Figures are illustrative 2026 estimates to show shape and order of magnitude, not quotes.
Read the table as four representative systems, smallest to largest. The embedding column is the one-time corpus embedding (amortised, it is a rounding error monthly). The vector store column assumes OpenSearch Serverless near its floor for the small systems and growing capacity for the large ones — choosing pgvector instead would cut the small-system rows substantially (shown in the note). The generation column assumes a sensible model mix (cheap model for most queries, a frontier model for the hard minority) at ~$0.01–$0.02 blended per query. The takeaway is the shape: the bill barely moves with corpus size but moves a lot with query volume and model choice.
| System | Corpus | Queries / mo | Embedding (one-time) | Vector store / mo | Re-rank / mo | Generation / mo | Approx. total / mo |
|---|---|---|---|---|---|---|---|
| Small internal KB | ~5K docs (~25M tok) | ~3,000 | ~$0.50 once | ~$350 (OSS floor) | ~$1 | ~$30 (Haiku/Nova) | ~$380 / mo |
| Team support assistant | ~50K docs (~250M tok) | ~20,000 | ~$5 once | ~$400 (OSS) | ~$8 | ~$400 (Sonnet mix) | ~$810 / mo |
| Mid-size production | ~500K docs (~2.5B tok) | ~100,000 | ~$50 once | ~$700 (OSS, more OCUs) | ~$40 | ~$2,000 (mixed) | ~$2,740 / mo |
| Large multi-tenant | ~5M docs (~25B tok) | ~500,000 | ~$500 once | ~$2,000 (OSS at scale) | ~$200 | ~$9,000 (mixed) | ~$11,200 / mo |
A common question is whether Amazon Bedrock Knowledge Bases (the managed RAG service) costs more than building the pipeline yourself. The honest answer surprises people: the managed service does not add a large markup on the AWS resources — you pay for the same embeddings, the same vector store, and the same generation tokens either way. The real cost difference is engineering time and the freedom to optimise.
Bedrock Knowledge Bases is largely a pass-through on the underlying resources. When you use it, you still pay Bedrock for embedding tokens, you still provision and pay for the vector store (OpenSearch Serverless, Aurora pgvector, Pinecone, or Redis — your choice), and you still pay generation tokens on every RetrieveAndGenerate call. The managed convenience is not priced as a big per-query premium; it is priced as you do not get to hand-tune every stage. So on pure infrastructure dollars, managed and DIY are close for the same architecture.
Where they diverge is on the two things that are not on the AWS invoice. First, engineering cost: managed ships in hours with almost no pipeline to maintain, while DIY is days-to-weeks of build plus ongoing maintenance of the parser, chunker, retriever, and re-ranker. For most teams that engineering time is the larger real cost, and it favours managed decisively. Second, optimisation headroom: DIY lets you squeeze the AWS bill in ways managed cannot — custom chunking to cut context tokens, your own hybrid-search tuning, sharing one vector store across many indexes to amortise the baseline, aggressive caching, and routing each query to the cheapest adequate model. At very high volume, that headroom can pay for the extra engineering.
The practical rule mirrors the build decision: start managed (the infrastructure cost is essentially the same and you ship far faster), and move specific stages to DIY only when volume is high enough that the optimisation headroom — not the managed markup, which is small — justifies the engineering. The cross-over is about scale and the value of fine-grained cost control, not about Knowledge Bases being expensive.
Managed Bedrock Knowledge Bases does not add a large markup — you pay for the same embeddings, vector store, and generation tokens either way. The true difference is engineering time (managed wins big — hours vs weeks) versus optimisation headroom (DIY wins — you can hand-tune every stage to cut the AWS bill at high volume). Choose on scale and control, not on a per-query premium.
Most RAG budget overruns trace to a short list of avoidable surprises. None is exotic; each is a place where the cost is decoupled from what you think you are paying for. Knowing them in advance is worth more than any single price figure.
Every number above prices RAG if you pay AWS directly. For most startups and many companies the relevant figure is different, because all five RAG cost lines are credit-eligible — and AWS will frequently fund the entire build with credits that draw down before your card is ever touched.
All five lines — embeddings, the vector store, query embeddings, re-ranking, and generation — plus the supporting services draw down AWS credits automatically against your bill until exhausted. (Pinecone via Marketplace can often be covered too.) The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed squarely at proving out a GenAI use case like RAG; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). For a mid-size RAG system at a few thousand dollars a month, even the POC pool alone funds many months of operation.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams route through an AWS partner rather than applying alone, and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS GenAI/ML partner who both files the credit application and builds the RAG system — the ingestion, chunking, embeddings, vector store, re-ranking, grounded-answer prompting, access control, and evaluation. The customer pays $0: AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.
Put together with the cost levers above, the picture for a startup is: build the RAG system aggressively on Bedrock, draw down a $25K–$100K credit pool while you prove the use case and find product-market fit, and only start paying real money once usage — and ideally revenue — has scaled past the credits. Related: see the cross-cluster pages on AWS credits for generative-AI startups and on Bedrock POC funding for the full credit mechanics.
The most useful thing to internalise about RAG cost is that the dominant line changes completely as you scale. A small system and a large system have the same five line items but a totally different cost profile — and optimising the wrong one wastes effort. This is the same five lines, shown as a share of the bill in each regime. Figures are representative 2026 illustrations.
| Cost line | Small RAG (idle-ish POC) | Mid-size production | Large / high-volume | Primary lever |
|---|---|---|---|---|
| Embeddings (one-time) | Rounding error | Rounding error | Small (~hundreds once) | Incremental sync; avoid full re-embeds |
| Vector store (baseline) | Dominant — the whole bill | Significant | Smaller share of a big bill | pgvector at small scale; right-size OCUs |
| Query embeddings | Negligible | Negligible | Negligible | None needed |
| Re-ranking | Tiny | Small | Modest | Re-rank a narrow net; skip trivial queries |
| Generation | Tiny | Largest line | Dominant — the whole bill | Model routing; fewer chunks; prompt caching |
Situation: The team had modeled a Bedrock RAG build and landed at roughly $2.6K/month — and they were alarmed less by the generation tokens than by the OpenSearch Serverless baseline, which a first proof-of-concept had run at several hundred dollars a month while serving almost no traffic. A frontier model was being called for every query, the full retrieved context was passed uncapped, and the whole thing was on-demand. They wanted both to bring the number down and to avoid spending runway on it during the prove-out.
What CloudRoute did: CloudRoute matched them in under 24 hours to a US AWS partner with GenAI cost-engineering experience. The partner (1) moved the early-stage vector store to Aurora pgvector to escape the OpenSearch floor while the corpus was still small; (2) added a tiered model router — Nova Lite / Claude Haiku for the easy ~80% of questions, Sonnet only for the hard ones; (3) turned on prompt caching for the static system prompt and re-ranked down to the best 4 chunks to cut context tokens; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the entire build and early operation.
Outcome: Modeled all-in cost fell from ~$2.6K to ~$650/month — most of the drop from escaping the vector-store floor and model-routing the generation line — and even that was fully covered by the approved credits, so the team paid $0 during the build and early launch. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
cost cut: ~$2.6K → ~$650/mo modeled · biggest win: vector-store floor + model routing · credits secured: POC + Activate · out-of-pocket during build: $0
Whatever your Bedrock RAG system would cost — the vector-store baseline, the embeddings, the generation tokens — AWS credits can cover all of it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to build and cost-tune the pipeline. Customer pays $0.