for AWS partners →Have AWS credits cover your RAG bill →

bedrock rag cost · the worked example · 2026

How much does a Bedrock RAG system actually cost?

A complete, neutral cost breakdown for retrieval-augmented generation on Amazon Bedrock in 2026: the five line items that make up a RAG bill, the one-time cost to embed your corpus, the always-on vector store (the part that surprises everyone — OpenSearch Serverless has a minimum), per-query retrieval and generation tokens, the cost to re-embed when data changes, and a master table that prices real systems across corpus sizes and query volumes. Plus managed Knowledge Bases vs DIY cost, and how AWS credits make all of it $0 to build.

Have AWS credits cover your RAG bill →→ jump to the cost-by-size table

cost line items

biggest surprise

vector-store floor

dominant cost

generation tokens

cost with credits

TL;DR

A Bedrock RAG bill is five line items: (1) a one-time cost to embed your corpus, (2) the always-on vector store, (3) per-query question embeddings (negligible), (4) per-query re-ranking, and (5) per-query generation — which is almost always the largest. Embeddings are cheap; the vector store is a fixed monthly baseline you pay even at zero traffic; generation scales with tokens × query volume.
The number that surprises everyone is the vector store. OpenSearch Serverless bills by OpenSearch Compute Units (OCUs) with a minimum floor — representatively on the order of a few hundred dollars a month even for a tiny corpus that is idle. That floor often dwarfs the embedding and inference cost of a small RAG system, and it is the single biggest reason a "small" RAG proof-of-concept costs more than people expect.
Rough shape, representative 2026: a small internal RAG (a few thousand docs, low query volume) is dominated by the vector-store floor and lands in the low-to-mid hundreds of dollars a month; a mid-size production system (hundreds of thousands of docs, tens of thousands of queries) runs into the high hundreds to low thousands, generation-dominated. That gap is exactly what AWS credits cover — CloudRoute routes you to the credit pool (Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted partner who builds it, so you pay $0.

the anatomy of the bill

IThe five line items in a Bedrock RAG bill

Almost every team budgets a RAG system by pricing the generation model and stops there. That misses more than half the bill. A RAG system has five distinct cost lines, each priced on a different basis, and two of them — the vector store and re-embedding — are the ones that surprise people. Get all five on the table before you build.

It helps to separate the five lines by when you pay them, because the timing is what makes the bill hard to reason about. Two are tied to your data (you pay when you index or re-index, regardless of traffic), one is a continuous baseline (you pay every hour the system exists, even at zero queries), and two are per-query (they scale with how much the system is used). A RAG bill that looks wrong is almost always a confusion between these three timing buckets.

Line 1 — embeddings (indexing). Before anyone asks a question, every chunk of your corpus is turned into a vector by an embedding model (Amazon Titan Text Embeddings v2 or Cohere Embed on Bedrock). You pay per input token embedded, once, when you first index the corpus. Embedding is among the cheapest things on Bedrock — representatively cents-to-low-dollars per million tokens — so even a large corpus is usually a one-time charge in single-to-low-double-digit dollars. This line is almost never the problem; people just forget it exists.

Line 2 — the vector store (continuous baseline). The vectors have to live somewhere that can answer nearest-neighbour queries, and that store runs 24/7 whether or not anyone uses it. This is the line that breaks budgets, because the common default on AWS — Amazon OpenSearch Serverless — has a minimum capacity floor (see §III). It is a fixed monthly cost decoupled from both corpus size at small scale and from traffic. For a small system it is frequently the largest line on the entire bill.

Line 3 — query embeddings (per query). Every incoming question is embedded with the same model before retrieval. This is one short embedding call per query — genuinely negligible (fractions of a cent), even at high volume. Mentioned for completeness; it never moves the total.

Line 4 — re-ranking (per query). If you re-rank (and you usually should — it is the highest-leverage retrieval-quality step), each query sends its top-K candidate chunks through a cross-encoder re-ranker (Amazon Rerank or Cohere Rerank on Bedrock), billed per query and per the volume of text re-ranked. Modest per query, but it scales with query volume and with how wide a net you re-rank.

Line 5 — generation (per query, usually the largest at scale). The final step sends the system prompt plus the re-ranked chunks plus the question to a generation model (Claude, Amazon Nova, Llama, Mistral on Bedrock) and pays for input and output tokens. In RAG, the input is inflated by the retrieved context — often several thousand tokens of chunks per question — so generation is typically the dominant cost once query volume is non-trivial. It is also the line with the most levers (model choice, chunk count, prompt caching, max-output).

the three timing buckets

Data-tied (Lines 1, re-embedding): paid when you index or re-index, regardless of traffic. Continuous baseline (Line 2, vector store): paid every hour the system exists, even at zero queries — this is the one that surprises people. Per-query (Lines 3, 4, 5): scale with usage; generation dominates. Confusing these three is why RAG bills feel unpredictable.

line 1 + re-embedding

IIThe one-time embedding cost — and re-embedding on updates

The first real number to compute is what it costs to embed your corpus once. It is almost always small, but it is the cleanest place to start because the math is exact: corpus size in tokens × the embedding price per token. The subtler cost is re-embedding, which turns a one-time charge into a recurring one whenever your data changes.

Start by sizing the corpus in tokens, not documents. A rough rule: 1,000 tokens ≈ 750 words, and a typical text-heavy page is 500–800 words, so call it ~1,000 tokens per page as a planning figure. A 10,000-document corpus averaging 5 pages each is therefore on the order of 50 million tokens. Chunking adds a little overhead (chunk boundaries and any repeated titles in metadata), but for budgeting, total corpus tokens is the right input.

Now multiply by the embedding price. Representatively in 2026, Amazon Titan Text Embeddings v2 lands around $0.00002 per 1K tokens (i.e. ~$0.02 per million); Cohere Embed is in a similar low band. So embedding that 50M-token corpus costs roughly 50 × $0.02 ≈ $1 — a rounding error. Even a very large corpus of 1 billion tokens is on the order of $20 as a one-time charge. This is why embedding is never the line to optimise for cost; it is cheap by design because AWS wants your data indexed and resident.

Embedding dimensionality is a lever, but it is a storage and latency lever, not an embedding-cost one. Titan v2 lets you choose output dimensions (e.g. 256 / 512 / 1024); smaller vectors cost the same to generate but take less room in the vector store and search faster. So dimensionality affects Line 2 (the vector store) and retrieval latency — not the one-time embedding bill in any meaningful way.

Re-embedding — the recurring cost hiding inside the one-time one

A corpus is never static. Documents are added, edited, and removed, and every changed chunk must be re-embedded to stay searchable. Incremental re-embedding — only re-embedding what changed — keeps this trivial: if 5% of a 50M-token corpus changes each month, that is 2.5M tokens, or about 5 cents. With incremental sync (Bedrock Knowledge Bases does this automatically; a DIY pipeline triggers it on an S3 change event), re-embedding is a non-event.

The expensive version is a full re-embed, and there is exactly one situation that forces it: changing the embedding model or its version. Because queries and documents must be embedded by the same model, switching from Titan v2 to Cohere (or to a new Titan version) means re-embedding the entire corpus from scratch. For a 1B-token corpus that is a ~$20 one-time hit — still cheap in absolute terms, but it also means rebuilding the whole index and re-validating retrieval quality, which is the real cost. Treat the embedding-model choice as semi-permanent for this reason, and benchmark before you commit.

embedding cost, in one line

One-time embedding ≈ (corpus tokens ÷ 1,000) × ~$0.00002. A 50M-token corpus ≈ $1; a 1B-token corpus ≈ $20. Incremental re-embedding on updates is a few cents. A full re-embed (only forced by changing the embedding model) repeats the one-time cost and rebuilds the index. Representative 2026 figures — confirm on the AWS Bedrock pricing page.

line 2 — the surprise

IIIThe vector store — the always-on cost that surprises everyone

This is the section that changes how teams budget RAG. Unlike embeddings and inference, the vector store is a continuous cost you pay every hour the system exists — and the most common option on AWS, OpenSearch Serverless, has a minimum floor that makes a tiny, idle RAG system cost far more than its token usage suggests. Understanding the cost shape of each store is the difference between a $30/month surprise and a $700/month one.

There are four mainstream vector stores on AWS, and they have fundamentally different cost shapes. Two bill on reserved/baseline capacity (you pay for headroom whether you use it or not), and two bill closer to actual usage. The store you pick is the single biggest swing on a small RAG system's monthly bill — bigger than the model choice — because at low traffic the store baseline dominates everything else.

OpenSearch Serverless — the default, and the one with the floor

Amazon OpenSearch Serverless is the default vector store behind Bedrock Knowledge Bases, and the one most teams land on. It bills by OpenSearch Compute Units (OCUs) — separate units for indexing and for search — plus storage. The catch is the minimum: a collection requires a baseline of OCUs to exist, and that baseline is representatively on the order of a few hundred dollars per month even for a small corpus serving little traffic. (As of 2026, AWS lowered the entry point for dev/test collections, but a production-grade, redundant collection still carries a meaningful floor — check the current OpenSearch Serverless pricing page for the exact OCU minimum and hourly rate.)

The practical consequence: for a small internal knowledge base — a few thousand documents, a few hundred queries a day — the OpenSearch Serverless baseline is frequently the largest single line on the bill, dwarfing the dollar or two of embeddings and the handful of dollars of monthly inference. This is the number one "why is my RAG POC so expensive?" answer. It does not mean OpenSearch is wrong — it scales beautifully and gives you native hybrid search — but it means a tiny RAG system is paying for capacity it is not using.

Aurora PostgreSQL (pgvector) — pay for the database you may already run

Aurora PostgreSQL with the pgvector extension stores vectors in a relational database you may already operate. Cost is the Aurora instance or, with Aurora Serverless v2, Aurora Capacity Units (ACUs) that scale with load (down to a low minimum), plus storage and I/O. If you already run Aurora for application data, the marginal cost of adding vectors can be near zero — no new system, no separate baseline. Even standalone, a small Aurora Serverless v2 configuration can sit below the OpenSearch Serverless floor for small corpora, which is why pgvector is the usual recommendation when cost-at-small-scale matters and you do not need OpenSearch's native hybrid search.

Pinecone — usage-based, third-party, via Marketplace

Pinecone is a vector-native managed database available through the AWS Marketplace and selectable in Bedrock Knowledge Bases. Its serverless tier bills closer to actual usage (storage + reads + writes) with a low or no idle floor, which can make a small index genuinely cheap, while large or high-QPS workloads are priced on consumption. It is a third-party service billed separately (Marketplace billing can route through your AWS invoice and, usefully, can be covered by AWS credits in many cases). The trade is that data lives in a non-AWS-native service.

Amazon MemoryDB / Redis — fast, but RAM is the cost

Redis-based vector search (Amazon MemoryDB) is in-memory, so it delivers single-digit-millisecond retrieval — ideal for real-time chat and agent loops. The cost shape is per node, priced by RAM, and because your entire index must fit in memory, cost scales directly with corpus size and is the most expensive option for large archival corpora. Reach for it when latency is the binding constraint, not when cost is.

vector-store cost shapes for rag · representative as of 2026 — check the AWS pricing page for current rates

Vector store	Billed on	Idle floor	Small corpus (~thousands of docs)	Large corpus / high QPS	When it is the cheap choice
OpenSearch Serverless	OCUs (index + search) + storage	Meaningful (a few hundred $/mo class)	Floor dominates — often the biggest line	Scales well; cost grows with OCUs	Mid/large corpora; want native hybrid search
Aurora pgvector	Instance or ACU + storage + I/O	Low (Serverless v2 scales down)	Often cheapest; ~zero marginal if Aurora already runs	Good into millions of vectors; specialist wins beyond	You already run Postgres; cost-at-small-scale matters
Pinecone (Marketplace)	Usage: storage + reads + writes	Low / none (serverless tier)	Can be very cheap when idle	Consumption-priced; predictable at scale	Small idle indexes; want zero infra to operate
MemoryDB / Redis	Per node, priced by RAM	High (must keep nodes hot)	Overkill unless latency-critical	Most expensive — whole index in RAM	Latency is the binding constraint, not cost

For a SMALL RAG system the vector store is usually the largest line on the bill, and the choice between OpenSearch Serverless and pgvector can swing the monthly cost by hundreds of dollars. For a LARGE system, generation tokens take over as the dominant cost and the store becomes a smaller fraction. Pinecone Marketplace billing can often be covered by AWS credits.

lines 3–5

IVPer-query cost — retrieval, re-ranking, and generation

The per-query lines are what turn a fixed monthly baseline into a bill that grows with usage. Three things happen on every question — a query embedding, an optional re-rank, and a generation call — and the third one dominates. The key insight for RAG specifically: retrieved context inflates the input token count, so generation in RAG costs more per query than a bare chatbot.

Walk a single query through the pipeline and price each step. Query embedding: one short embedding call (the question, ~50–200 tokens) at ~$0.00002/1K — a small fraction of a cent. Across 100,000 queries a month that is still pennies. Ignore it in any budget.

Re-ranking: the retriever returns a wide net (say top-30 to top-50 chunks), and the re-ranker scores each against the question. Re-rank pricing is representatively per query (sometimes per the volume of text scored), and lands in the small-fraction-of-a-cent to low-cents range per query depending on how wide a net you re-rank. At 100,000 queries/month this is typically a few dollars to low tens of dollars — real but rarely dominant. The lever is obvious: re-rank top-30, not top-300, and skip re-ranking on trivial queries.

Generation — the line that matters. This is where RAG diverges from a plain chatbot. The input to the generation model is not just the question; it is the system prompt + the re-ranked chunks + the question. Those chunks are commonly 2,000–4,000 tokens of retrieved context per query — far more than the question itself. So a RAG query's input is dominated by context, and since you pay per input token, the amount of context you pass is a direct cost lever, not just a quality one. Output is whatever the model writes (often 300–800 tokens for a grounded answer).

Make it concrete with a representative mid-tier setup — Claude Sonnet-class generation, ~3,500 input tokens (≈3,000 of context + 500 question) and ~600 output tokens per query. At representative Sonnet rates ($3/1M input, $15/1M output): input ≈ $0.0105, output ≈ $0.009 → roughly $0.02 per query. At 20,000 queries/month that is ~$400; at 100,000 it is ~$2,000 — generation alone. Swap to a cheaper model for the easy questions (Nova Lite, Claude Haiku) and the per-query cost drops by 5–20×; that is the single biggest per-query lever.

why RAG generation costs more than a chatbot

A bare chatbot pays for a short prompt; a RAG query pays for the retrieved context too — commonly 2,000–4,000 input tokens of chunks per question on top of the prompt. Because you pay per input token, how much context you pass is a cost decision. Fewer, tighter, re-ranked chunks and prompt caching on the static system prompt are the two levers that cut the biggest line.

the worked numbers

VRAG cost across corpus sizes and query volumes

This is the table most people come for: representative all-in monthly cost for RAG systems of different sizes, broken into the lines that move. It shows the two regimes clearly — small systems are dominated by the vector-store baseline, large systems by generation tokens — and where the crossover happens. Figures are illustrative 2026 estimates to show shape and order of magnitude, not quotes.

Read the table as four representative systems, smallest to largest. The embedding column is the one-time corpus embedding (amortised, it is a rounding error monthly). The vector store column assumes OpenSearch Serverless near its floor for the small systems and growing capacity for the large ones — choosing pgvector instead would cut the small-system rows substantially (shown in the note). The generation column assumes a sensible model mix (cheap model for most queries, a frontier model for the hard minority) at ~$0.01–$0.02 blended per query. The takeaway is the shape: the bill barely moves with corpus size but moves a lot with query volume and model choice.

representative all-in monthly bedrock rag cost by corpus size × query volume · 2026 illustration — check the AWS pricing page

System	Corpus	Queries / mo	Embedding (one-time)	Vector store / mo	Re-rank / mo	Generation / mo	Approx. total / mo
Small internal KB	~5K docs (~25M tok)	~3,000	~$0.50 once	~$350 (OSS floor)	~$1	~$30 (Haiku/Nova)	~$380 / mo
Team support assistant	~50K docs (~250M tok)	~20,000	~$5 once	~$400 (OSS)	~$8	~$400 (Sonnet mix)	~$810 / mo
Mid-size production	~500K docs (~2.5B tok)	~100,000	~$50 once	~$700 (OSS, more OCUs)	~$40	~$2,000 (mixed)	~$2,740 / mo
Large multi-tenant	~5M docs (~25B tok)	~500,000	~$500 once	~$2,000 (OSS at scale)	~$200	~$9,000 (mixed)	~$11,200 / mo

Illustrative 2026 figures, OpenSearch Serverless assumed for the vector store. Two big swings: (1) on the small rows, switching to Aurora pgvector can cut the vector-store line from ~$350 to well under $100 — often halving the small-system total; (2) on the large rows, generation dominates, so model routing (cheap model for the easy 80%) and prompt caching move the total far more than the store does. Re-embedding on updates (incremental) is pennies and omitted; a full re-embed repeats the one-time embedding cost.

who orchestrates the stages

VIManaged Knowledge Bases vs DIY — the cost difference

A common question is whether Amazon Bedrock Knowledge Bases (the managed RAG service) costs more than building the pipeline yourself. The honest answer surprises people: the managed service does not add a large markup on the AWS resources — you pay for the same embeddings, the same vector store, and the same generation tokens either way. The real cost difference is engineering time and the freedom to optimise.

Bedrock Knowledge Bases is largely a pass-through on the underlying resources. When you use it, you still pay Bedrock for embedding tokens, you still provision and pay for the vector store (OpenSearch Serverless, Aurora pgvector, Pinecone, or Redis — your choice), and you still pay generation tokens on every RetrieveAndGenerate call. The managed convenience is not priced as a big per-query premium; it is priced as you do not get to hand-tune every stage. So on pure infrastructure dollars, managed and DIY are close for the same architecture.

Where they diverge is on the two things that are not on the AWS invoice. First, engineering cost: managed ships in hours with almost no pipeline to maintain, while DIY is days-to-weeks of build plus ongoing maintenance of the parser, chunker, retriever, and re-ranker. For most teams that engineering time is the larger real cost, and it favours managed decisively. Second, optimisation headroom: DIY lets you squeeze the AWS bill in ways managed cannot — custom chunking to cut context tokens, your own hybrid-search tuning, sharing one vector store across many indexes to amortise the baseline, aggressive caching, and routing each query to the cheapest adequate model. At very high volume, that headroom can pay for the extra engineering.

The practical rule mirrors the build decision: start managed (the infrastructure cost is essentially the same and you ship far faster), and move specific stages to DIY only when volume is high enough that the optimisation headroom — not the managed markup, which is small — justifies the engineering. The cross-over is about scale and the value of fine-grained cost control, not about Knowledge Bases being expensive.

the managed-vs-DIY cost myth

Managed Bedrock Knowledge Bases does not add a large markup — you pay for the same embeddings, vector store, and generation tokens either way. The true difference is engineering time (managed wins big — hours vs weeks) versus optimisation headroom (DIY wins — you can hand-tune every stage to cut the AWS bill at high volume). Choose on scale and control, not on a per-query premium.

the gotchas

VIIThe cost surprises — what blows up a RAG budget

Most RAG budget overruns trace to a short list of avoidable surprises. None is exotic; each is a place where the cost is decoupled from what you think you are paying for. Knowing them in advance is worth more than any single price figure.

The OpenSearch Serverless minimum — The number one surprise. A production OSS collection carries a baseline of OCUs you pay 24/7 — representatively a few hundred dollars a month — independent of corpus size at small scale and of traffic. A "small" RAG POC can cost more in idle vector-store capacity than in everything else combined. Fix: use Aurora pgvector for small/early systems, share one collection across indexes, or use a usage-priced store (Pinecone serverless) until volume justifies OSS.
Context bloat in generation — Every query pays for the retrieved chunks as input tokens. Passing 8–10 fat chunks "to be safe" can triple the input cost per query versus 3–5 tight, re-ranked ones — across 100K queries that is real money. Fix: re-rank down to the few best chunks, use hierarchical chunking, and cap retrieved context.
Frontier model for every query — Using a top-tier model for all traffic when most questions are easy is the most expensive default. The easy 80% of queries answer fine on a cheap model. Fix: a tiered router — cheap model first, frontier model only for the hard minority — routinely cuts generation cost 5–10×.
Not using prompt caching on the system prompt — RAG sends the same long system prompt (and sometimes shared context) on every call. Without prompt caching you re-pay full input price for it every single query. Fix: enable Bedrock prompt caching for the static prefix — on high-volume systems it cuts a large slice of input cost.
Full re-embeds you did not budget for — Incremental re-embedding is pennies, but changing the embedding model forces a full corpus re-embed and an index rebuild. The token cost is small; the surprise is the operational work and the re-validation. Fix: treat the embedding model as semi-permanent and benchmark before committing.
Forgetting the supporting services — S3 storage, data-transfer, Lambda/Glue for a DIY ingestion pipeline, CloudWatch logging, and Textract for PDF parsing all add cost around the core five lines. Individually small, collectively a tail. Fix: include them in the budget from the start.
Over-provisioned vector store at launch — Sizing the vector store for "peak imagination" rather than actual corpus size means paying for capacity you will not use for months. Fix: right-size to current corpus and traffic, and scale the store as the corpus grows — both OSS and Aurora Serverless scale up later.

how it becomes $0

VIIIHow AWS credits make the whole RAG bill $0 to build

Every number above prices RAG if you pay AWS directly. For most startups and many companies the relevant figure is different, because all five RAG cost lines are credit-eligible — and AWS will frequently fund the entire build with credits that draw down before your card is ever touched.

All five lines — embeddings, the vector store, query embeddings, re-ranking, and generation — plus the supporting services draw down AWS credits automatically against your bill until exhausted. (Pinecone via Marketplace can often be covered too.) The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed squarely at proving out a GenAI use case like RAG; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). For a mid-size RAG system at a few thousand dollars a month, even the POC pool alone funds many months of operation.

The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams route through an AWS partner rather than applying alone, and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS GenAI/ML partner who both files the credit application and builds the RAG system — the ingestion, chunking, embeddings, vector store, re-ranking, grounded-answer prompting, access control, and evaluation. The customer pays $0: AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.

Put together with the cost levers above, the picture for a startup is: build the RAG system aggressively on Bedrock, draw down a $25K–$100K credit pool while you prove the use case and find product-market fit, and only start paying real money once usage — and ideally revenue — has scaled past the credits. Related: see the cross-cluster pages on AWS credits for generative-AI startups and on Bedrock POC funding for the full credit mechanics.

the regime shift

Small RAG vs large RAG — where the money actually goes

The most useful thing to internalise about RAG cost is that the dominant line changes completely as you scale. A small system and a large system have the same five line items but a totally different cost profile — and optimising the wrong one wastes effort. This is the same five lines, shown as a share of the bill in each regime. Figures are representative 2026 illustrations.

Cost line	Small RAG (idle-ish POC)	Mid-size production	Large / high-volume	Primary lever
Embeddings (one-time)	Rounding error	Rounding error	Small (~hundreds once)	Incremental sync; avoid full re-embeds
Vector store (baseline)	Dominant — the whole bill	Significant	Smaller share of a big bill	pgvector at small scale; right-size OCUs
Query embeddings	Negligible	Negligible	Negligible	None needed
Re-ranking	Tiny	Small	Modest	Re-rank a narrow net; skip trivial queries
Generation	Tiny	Largest line	Dominant — the whole bill	Model routing; fewer chunks; prompt caching

The crossover: at small scale, fix the vector store (it IS the bill — pgvector or a usage-priced store). At large scale, fix generation (model routing + prompt caching + tight context move the number far more than the store). Optimising generation on a tiny idle POC, or the store on a high-volume system, is effort spent on the wrong line.

before you provision a single OCU

Get AWS credits that cover the whole RAG bill — and a partner to build it (you pay $0)

Get matched in 24h →

a recent match

A RAG system that priced out at ~$2.6K/month — built on $0 — anonymized

inquiry · Series-A vertical SaaS, knowledge assistant, US

Series-A vertical SaaS, 26 people, ~400K support + product docs, planning an in-product knowledge assistant

Situation: The team had modeled a Bedrock RAG build and landed at roughly $2.6K/month — and they were alarmed less by the generation tokens than by the OpenSearch Serverless baseline, which a first proof-of-concept had run at several hundred dollars a month while serving almost no traffic. A frontier model was being called for every query, the full retrieved context was passed uncapped, and the whole thing was on-demand. They wanted both to bring the number down and to avoid spending runway on it during the prove-out.

What CloudRoute did: CloudRoute matched them in under 24 hours to a US AWS partner with GenAI cost-engineering experience. The partner (1) moved the early-stage vector store to Aurora pgvector to escape the OpenSearch floor while the corpus was still small; (2) added a tiered model router — Nova Lite / Claude Haiku for the easy ~80% of questions, Sonnet only for the hard ones; (3) turned on prompt caching for the static system prompt and re-ranked down to the best 4 chunks to cut context tokens; and (4) filed a Bedrock POC credit application plus an Activate Portfolio application to fund the entire build and early operation.

Outcome: Modeled all-in cost fell from ~$2.6K to ~$650/month — most of the drop from escaping the vector-store floor and model-routing the generation line — and even that was fully covered by the approved credits, so the team paid $0 during the build and early launch. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

cost cut: ~$2.6K → ~$650/mo modeled · biggest win: vector-store floor + model routing · credits secured: POC + Activate · out-of-pocket during build: $0

faq

Common questions

How much does a Bedrock RAG system cost?

It depends mostly on query volume and which generation model you use, not on corpus size. Representative 2026 shape: a small internal RAG (a few thousand docs, low traffic) is dominated by the vector-store baseline and lands in the low-to-mid hundreds of dollars a month; a mid-size production system (hundreds of thousands of docs, tens of thousands of queries) runs into the high hundreds to low thousands, dominated by generation tokens; a large high-volume system reaches five figures monthly. The five line items are embeddings (one-time, cheap), the vector store (always-on baseline), query embeddings (negligible), re-ranking (modest), and generation (usually the largest). Always confirm current rates on the AWS pricing page.

Why is my small RAG proof-of-concept so expensive?

Almost always the vector store. The common default, Amazon OpenSearch Serverless, bills by OpenSearch Compute Units (OCUs) with a minimum floor — representatively a few hundred dollars a month — that you pay 24/7 regardless of how small your corpus is or how little traffic you serve. On a tiny RAG POC that idle baseline frequently costs more than the embeddings and inference combined. The fix is to use Aurora PostgreSQL with pgvector for small/early systems (often well under $100/month, or near-zero marginal if you already run Aurora), share one OpenSearch collection across multiple indexes, or use a usage-priced store like Pinecone serverless until volume justifies OpenSearch.

How much does it cost to embed my documents for RAG?

Very little — embedding is one of the cheapest things on Bedrock. The math is exact: (corpus tokens ÷ 1,000) × the embedding price (representatively ~$0.00002 per 1K tokens for Amazon Titan Text Embeddings v2 in 2026). A 50-million-token corpus (~10,000 average documents) costs roughly $1 to embed once; a 1-billion-token corpus costs about $20. This is a one-time charge. Incremental re-embedding when documents change is a few cents. The only expensive case is a full re-embed, which is forced only by changing the embedding model or version.

What is the most expensive part of a RAG system on AWS?

It changes with scale. On a small, low-traffic system the vector-store baseline (especially OpenSearch Serverless's OCU minimum) is the dominant cost — often the whole bill. On a mid-size or large system, generation tokens take over and become the largest line, because each RAG query pays for the retrieved context (commonly 2,000–4,000 input tokens of chunks) plus the answer. The practical implication: optimise the vector store at small scale (use pgvector, right-size capacity) and optimise generation at large scale (route easy queries to a cheaper model, re-rank to fewer chunks, enable prompt caching).

How much does the vector store cost for RAG on AWS?

It depends entirely on which store. OpenSearch Serverless carries a meaningful minimum (a few hundred dollars/month class) because it bills on OCUs with a baseline floor — it scales well but stings small corpora. Aurora PostgreSQL with pgvector can be much cheaper at small scale (Aurora Serverless v2 scales down, and the marginal cost is near zero if you already run Aurora). Pinecone's serverless tier bills closer to usage with a low idle floor. Amazon MemoryDB/Redis is RAM-priced and the most expensive, justified only when retrieval latency is critical. For a small RAG system this choice is the single biggest swing on the monthly bill — often hundreds of dollars.

Does Bedrock Knowledge Bases cost more than building RAG myself?

Not in infrastructure dollars. Bedrock Knowledge Bases is largely a pass-through on the underlying resources — you pay for the same embedding tokens, the same vector store, and the same generation tokens whether you use the managed service or a DIY pipeline; there is no large per-query managed markup. The real difference is elsewhere: managed wins big on engineering time (ship in hours, almost no pipeline to maintain) while DIY wins on optimisation headroom (you can hand-tune chunking, caching, model routing, and a shared vector store to cut the AWS bill at high volume). Start managed; move stages to DIY only when scale makes the optimisation worth the engineering.

How much does re-embedding cost when my data changes?

Almost nothing for normal updates. With incremental re-embedding — re-embedding only the chunks that changed — updating, say, 5% of a 50-million-token corpus costs a few cents, and Bedrock Knowledge Bases does this automatically on sync (a DIY pipeline triggers it on an S3 change event). The only expensive case is a full re-embed of the entire corpus, which is forced solely by changing the embedding model or its version (because queries and documents must use the same model). Even then the token cost is small (~$20 for a 1B-token corpus); the real cost is rebuilding the index and re-validating retrieval quality, which is why you should treat the embedding model as semi-permanent.

Can AWS credits cover RAG costs on Bedrock?

Yes — all five RAG cost lines (embeddings, the vector store, query embeddings, re-ranking, and generation) plus supporting services are credit-eligible and draw down AWS credits automatically against your bill; Pinecone via Marketplace can often be covered too. The relevant pools are AWS Activate (up to $100K), a dedicated Bedrock/GenAI POC pool ($10K–$50K aimed at exactly this kind of use case), and the GenAI Accelerator (up to $1M for selected startups). These are largely partner-filed via the AWS Partner Network, which is why teams route through a partner. CloudRoute matches you to the right pool and a vetted AWS partner who files the application and builds the RAG system — customer pays $0, AWS funds it.

Stop pricing your RAG bill — get it funded

Whatever your Bedrock RAG system would cost — the vector-store baseline, the embeddings, the generation tokens — AWS credits can cover all of it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to build and cost-tune the pipeline. Customer pays $0.

Get matched in 24h →→ see the AI-team persona detail

matched within< 24h

GenAI credit ceilingup to $1M

cost to you$0