amazon bedrock embeddings models · the RAG decision · 2026

Bedrock embeddings models — Titan vs Cohere, and how to choose.

A neutral, build-focused reference for the embeddings models on Amazon Bedrock in 2026: what an embeddings model actually does, the choices on offer (Amazon Titan Text Embeddings V2, Cohere Embed English and Multilingual, and Titan Multimodal Embeddings), how they compare on dimensions, max tokens, languages, and normalization — and the part most guides skip: how that one choice quietly drives both your retrieval quality and your vector-store bill, how to match the model to the store you picked, and what re-embedding actually costs if you switch later.

main text models
Titan V2 · Cohere Embed
Titan V2 dimensions
256 / 512 / 1024
switching models
re-embed everything
cost with credits
$0
TL;DR
  • An embeddings model turns text into a vector — a list of numbers that captures meaning — so a vector store can retrieve passages by similarity. On Bedrock the practical choices are Amazon Titan Text Embeddings V2, Cohere Embed (English and Multilingual), and Titan Multimodal Embeddings for image+text. The model is the quiet foundation of every RAG and semantic-search build.
  • The choice drives two things at once: retrieval quality (does the right passage come back for a query — where Cohere Multilingual leads for many-language corpora, and Titan V2 is a strong, low-cost English default) and vector-store cost (vector dimension × number of chunks = how much you store and search). Titan V2 lets you pick 256, 512, or 1024 dimensions, so you can trade a little accuracy for materially cheaper storage; a 1024-dim model can cost ~4× the storage of a 256-dim one for the same corpus.
  • Match the model to the store you already chose, and decide deliberately up front — the embeddings model and its index are bound together, so changing models later means re-embedding the entire corpus (an embeddings-token bill plus a full re-ingestion) and rebuilding the index. The embeddings tokens, the vector store, and the inference are all AWS-credit-eligible; CloudRoute routes you to the right credit pool (Activate up to $100K, a Bedrock/GenAI POC pool $10K–$50K, the GenAI Accelerator up to $1M) and a vetted partner to build it, so the customer pays $0.
the concept

IWhat an embeddings model does — and why the choice matters

Before comparing models it is worth being precise about what an embeddings model is, because the comparison only makes sense once the job is clear. An embeddings model has one job: turn a piece of text into a vector — a fixed-length list of numbers — such that texts with similar meaning produce vectors that sit close together in space.

That single property is what powers semantic search and retrieval-augmented generation. When you ingest a corpus, every chunk of text is passed through the embeddings model and the resulting vector is stored in a vector store alongside the original text. At query time, the question is embedded with the same model, and the store returns the chunks whose vectors are nearest to the query vector — by cosine similarity or a related distance metric. The model never "understands" your documents the way a chat model answers questions; it simply places text in a geometric space where closeness means relatedness. Everything downstream — which passages a RAG system retrieves, how relevant a search result feels — rests on how well that placement reflects real meaning.

This is why the embeddings model is the quiet foundation of a RAG build, and why getting it wrong is expensive in a way that is easy to miss. A weak generation model produces an obviously bad answer you can see and fix. A weak embeddings model fails silently: it retrieves the wrong chunks, the generation model dutifully writes a fluent answer from irrelevant context, and the failure looks like a hallucination rather than a retrieval miss. Retrieval quality caps answer quality — a brilliant model cannot answer from context it never received.

On Amazon Bedrock, embeddings models are first-class: they are exposed through the same InvokeModel API as text models, they are what Bedrock Knowledge Bases uses under the hood to build a managed RAG index, and they are billed on the same per-token basis. The two big things this page helps you decide are therefore (1) which model gives the retrieval quality your corpus needs, and (2) at what vector dimension, because that number sets your storage and search cost. Those two decisions interact, and they are effectively permanent for a given index — which is the recurring theme of the sections below.

One clarification that saves confusion: an embeddings model and a generation (chat) model are different things, even when they share a brand name. "Amazon Titan" and "Cohere" both ship text-generation models and embeddings models; on this page "Titan" and "Cohere" refer to their embeddings families unless stated otherwise. You can freely mix vendors across the two roles — for example, embed with Cohere and generate with Claude — because the embeddings model only touches retrieval, never the final answer.

the one-sentence definition

An embeddings model converts text into a fixed-length vector whose position encodes meaning, so a vector store can retrieve the most relevant passages by similarity. On Bedrock the practical choices are Amazon Titan Text Embeddings V2, Cohere Embed (English / Multilingual), and Titan Multimodal Embeddings for image+text. Pick deliberately: the model and its index are bound together for the life of the corpus.

what is on offer

IIThe embeddings models available on Amazon Bedrock

Bedrock offers a small, well-chosen set of embeddings models rather than an overwhelming menu. For text retrieval the decision is effectively Amazon Titan Text Embeddings V2 versus Cohere Embed; for image-aware retrieval there is Titan Multimodal Embeddings. Here is what each one is and where it fits.

All of these run through the same Bedrock surface — your data stays in your account and region, is not used to train the base models, and the models are billed per input token (the output vector itself is not charged). What differs is dimensionality, token limits, language coverage, and the modality each one handles. The model list and exact specifications evolve, so confirm the current details on the AWS Bedrock model page when you scope a build; the families below are the stable ones to reason about in 2026.

  • Amazon Titan Text Embeddings V2 — Amazon's current general-purpose text embeddings model and the common default on Bedrock. Its defining feature is selectable output dimensions — 256, 512, or 1024 — so you can dial vector size (and therefore storage and search cost) against accuracy. It returns normalized vectors by default, supports a large input token limit per request, and covers many languages with a strong tilt toward English. Inexpensive and well-integrated; the sensible starting point for most English-first corpora.
  • Amazon Titan Text Embeddings (V1) — The earlier generation, fixed at 1536 dimensions. Still available, but for new builds V2 is the better choice — V2 generally matches or beats it on quality while letting you choose a smaller, cheaper dimension. Mostly relevant now if you are maintaining an existing V1 index (which you cannot mix with V2 vectors).
  • Cohere Embed — English — Cohere's English-optimized embeddings model, output dimension 1024, with strong retrieval quality and a useful feature: an "input type" parameter that lets you embed documents and search queries slightly differently (asymmetric embeddings), which can sharpen retrieval. A strong choice when English retrieval quality is the priority and you want a dedicated retrieval-tuned model.
  • Cohere Embed — Multilingual — The same family tuned across 100+ languages, output dimension 1024. This is the usual pick when your corpus or your users span many languages, or when you need cross-lingual retrieval (a query in one language matching a document in another). For non-English-dominant deployments this is frequently the quality winner.
  • Amazon Titan Multimodal Embeddings — Embeds images and text into a shared vector space, so you can search images by text (and vice versa) and build retrieval over mixed media — product catalogs, design libraries, screenshots, diagrams. Output is typically 1024-dim with smaller options. Reach for it only when images are part of what you retrieve; for pure-text RAG it is the wrong tool.

A practical way to read this menu: Titan Text Embeddings V2 is the default — cheap, flexible on dimension, good on English. Cohere Embed Multilingual is the reach-for model when languages matter, and Cohere Embed English is the alternative when you want a retrieval-specialist for an English corpus. Titan Multimodal is a different job entirely — only when images are in scope. Most teams will pick between Titan V2 and a Cohere model, which is exactly the comparison the next sections drill into.

the specs that matter

IIIDimensions, max tokens, languages, and normalization

Four technical attributes separate these models in ways that actually change a build: how big the output vector is, how much text fits in one call, which languages are covered, and whether vectors come out normalized. Each one has a downstream consequence you should choose on purpose, not by accident.

Dimensions — the single most consequential number

The output dimension is the length of the vector — 256, 512, 1024, or 1536 numbers depending on the model. Larger vectors can encode more nuance, which can lift retrieval quality on hard, subtle corpora. But the dimension is also, directly, your storage and search cost: a vector store holds one vector per chunk, so doubling the dimension roughly doubles the bytes stored and the work done per similarity search. A 1024-dimension model over the same corpus stores ~4× the vector data of a 256-dimension one. Titan Text Embeddings V2's selectable dimensions (256 / 512 / 1024) exist precisely so you can make this trade-off explicitly — many corpora retrieve almost as well at 512 as at 1024 for half the storage. Cohere Embed and Titan V1 are fixed (1024 and 1536 respectively), so with those the dimension is a consequence of the model choice rather than a separate dial.

Max input tokens — how this interacts with chunking

Each model accepts up to a maximum number of input tokens per embedding call. Titan Text Embeddings models accept a large window (thousands of tokens), while Cohere Embed has a smaller per-call token limit. In practice this rarely binds, because you almost always chunk documents into pieces far smaller than any of these limits before embedding — retrieval works best on focused chunks, not whole documents. The limit matters mainly in two cases: if you deliberately embed long passages (e.g. with hierarchical chunking returning large parents), confirm they fit; and if a model truncates over-long input silently, an oversized chunk loses its tail. The clean rule: size your chunks for retrieval quality (typically a few hundred tokens), and the token limit becomes a non-issue.

Languages — English-first vs genuinely multilingual

Language coverage is where the Titan-vs-Cohere choice most often gets decided. Titan Text Embeddings V2 supports many languages but is strongest in English. Cohere Embed Multilingual is purpose-built across 100+ languages and supports cross-lingual retrieval — a query in French can match a document in German because both map into the same shared space. If your corpus is English-dominant, Titan V2 is more than sufficient and cheaper. If your users or content are genuinely multilingual, or you need cross-lingual matching, Cohere Multilingual is usually the quality winner and worth the choice.

Normalization — and why it affects your distance metric

A vector is normalized when it is scaled to unit length. This matters because of how the vector store measures similarity. With normalized vectors, cosine similarity, dot product, and Euclidean distance all rank results identically, so the choice of metric is free. Titan Text Embeddings V2 returns normalized vectors by default (and offers an option to control this), and Cohere's vectors are well-suited to cosine similarity. The practical guidance: keep vectors normalized and configure your vector store's index for cosine similarity (or dot product on normalized vectors) — this is the safe default across all these models. The only time to think harder is if you deliberately turn off normalization or mix sources; then make sure the index metric matches what the model produces.

bedrock embeddings models — core specs · representative as of 2026 (confirm on the AWS Bedrock model page)
ModelModalityOutput dimensionsLanguagesNormalized by defaultNotable feature
Titan Text Embeddings V2Text256 / 512 / 1024 (selectable)Many; English-strongestYesPick your dimension to trade accuracy vs cost
Titan Text Embeddings V1Text1536 (fixed)Many; English-strongestYesLegacy; prefer V2 for new builds
Cohere Embed — EnglishText1024 (fixed)EnglishCosine-suitedInput-type (asymmetric query/doc) embeddings
Cohere Embed — MultilingualText1024 (fixed)100+; cross-lingualCosine-suitedBest for many-language / cross-lingual retrieval
Titan Multimodal EmbeddingsImage + text1024 (with smaller options)n/a (image+text)YesSearch images by text and vice versa
Dimension is the number that most directly sets vector-store cost. Titan V2 is the only one here that lets you choose it. For pure-text English RAG, Titan V2 at 512 or 1024 is a strong, cheap default; for many-language corpora, Cohere Multilingual; for image retrieval, Titan Multimodal. Exact specs evolve — verify current values in the AWS docs.
the quality lever

IVHow the embedding choice changes retrieval quality

Retrieval quality is the reason the embeddings model exists, so it deserves a clear-eyed treatment: where the model genuinely moves the needle, where it does not, and how to tell whether yours is good enough for your corpus rather than in the abstract.

The honest framing is that the embeddings model is a real but second-order lever compared with how you chunk and parse your documents. A good model embedding badly-chunked text retrieves poorly; an average model embedding clean, well-sized chunks retrieves well. So the first question is never "which is the best embeddings model" in the abstract — it is "is my model a good fit for this corpus and these queries." Two corpora with identical word counts can have very different best models depending on language mix, domain jargon, and how the questions are phrased.

Where the model choice clearly matters: language coverage (a model weak in your language will retrieve poorly no matter the dimension — this is the single biggest quality differentiator, and where Cohere Multilingual earns its place), domain fit (highly technical or specialized vocabularies separate models more than everyday prose does), and asymmetry (Cohere's input-type feature, embedding queries and documents differently, can sharpen retrieval because a short question and a long passage are not the same kind of text). Where it matters less than people expect: for general English prose with sensible chunking, Titan V2 and Cohere English are close enough that the dimension/cost trade-off and your vector store will influence the decision more than a small quality gap.

Dimension interacts with quality too, but with diminishing returns. Going from a very small dimension to a mid one usually helps; going from a mid one to the largest often adds little for typical corpora while multiplying cost. This is exactly why Titan V2's selectable dimension is useful: you can measure the trade-off on your own data instead of guessing. The right method is empirical — assemble a small set of representative questions with known-correct source passages, embed your corpus with each candidate model/dimension, and measure how often the correct passage appears in the top-k results (recall@k). The configuration that retrieves the right chunk most reliably on your queries wins; published benchmark leaderboards are a starting hypothesis, not the answer for your data.

A final quality note that is really an architecture note: even the best embeddings model returns an imperfect ranking, so high-quality RAG systems often add a re-ranking step (a cross-encoder that re-scores the top candidates) and/or hybrid search (combining vector similarity with keyword/BM25 matching to catch exact terms, names, and IDs that pure semantics can miss). These compensate for embedding limitations and frequently lift retrieval more than swapping embeddings models would — see the rag-on-aws sibling for how they fit into the full pipeline.

how to actually choose for quality

Do not pick by leaderboard. Build a small eval set — representative questions paired with the source passages that should answer them — then embed your corpus with each candidate model and dimension and measure recall@k (how often the right passage is in the top results). Let your own data decide. For many-language corpora start the bake-off with Cohere Multilingual; for English start with Titan V2 at 512 and 1024.

the cost lever

VHow the embedding choice changes vector-store cost

The embeddings model has a second, less-discussed effect: it sets how much your vector store costs to run. This is where the dimension number stops being abstract and starts showing up on the bill — and where a thoughtful choice can cut standing cost by a multiple.

There are two distinct costs tied to embeddings, and they behave very differently. The first is the embedding compute: you pay the model per input token to embed your corpus once at ingest, again for any re-ingestion, and a tiny amount per query to embed each incoming question. This is genuinely cheap — embeddings token rates are a fraction of generation rates — and for most corpora the one-time ingest embedding is a small, bounded cost. The second is the vector-store cost, and this is the one that recurs every month whether or not anyone is querying: the store holds one vector per chunk, forever, and bills for the capacity to keep and search them.

That standing cost scales with dimension × number of chunks. The number of chunks comes from your corpus size and chunking strategy; the dimension comes from your embeddings model. This is the precise mechanism by which the model choice drives infrastructure cost: a 1024-dim model stores four times the vector bytes of a 256-dim model for the identical corpus, which means more storage, more memory, and more compute per similarity search — across every vector, every month. For a small corpus the absolute numbers are tiny either way; for a large or fast-growing corpus, the dimension you chose at the start becomes one of the largest lines in the RAG bill.

Hence the practical cost playbook. Right-size the dimension: with Titan V2, test 256 and 512 before defaulting to 1024 — if recall@k holds on your eval set at a smaller dimension, you have just cut storage cost by 2–4× for free. Control chunk count: over-aggressive chunking inflates vector count (and thus cost) as much as a big dimension does, so chunking and dimension should be tuned together. Match the store to the volume: at low or bursty volume, a serverless Postgres/pgvector store is often cheaper than always-on managed search; at large scale a purpose-built vector DB may search more cost-effectively. And remember the asymmetry: embedding tokens are a small one-time-ish cost; the vector store is the standing cost, so optimize dimension and chunk count first.

how vector dimension drives relative vector-store footprint (illustrative — 1M chunks)
Embeddings modelDimensionRelative vectors storedRelative storage / search costWhen the cost is worth it
Titan Text Embeddings V22561M × 2561× (baseline)Large corpora where recall holds at 256
Titan Text Embeddings V25121M × 512~2×Common sweet spot — small accuracy gain
Titan Text Embeddings V2 / Cohere Embed10241M × 1024~4×Hard corpora where 1024 measurably lifts recall
Titan Text Embeddings V115361M × 1536~6×Legacy indexes only; prefer V2 for new builds
Numbers are relative footprint for the same corpus, not dollar figures — exact pricing depends on the vector store. The point: dimension is a multiplier on standing cost. Right-size it on your own eval data before paying for 1024+ everywhere. Embedding tokens (compute) are separately cheap and largely one-time at ingest.
model meets store

VIMatching the embedding model to your vector store

The embeddings model and the vector store are two halves of one decision. The model produces vectors of a certain dimension and shape; the store has to hold them, index them, and search them well at your volume and budget. Picking them in isolation is how teams end up paying too much or retrieving too slowly.

The hard constraint is simple: the store's index must be configured for the dimension your model outputs and the distance metric your vectors expect (cosine similarity for the normalized/cosine-suited models here). You cannot put 1024-dim vectors into an index built for 1536, and an index using the wrong metric will rank results subtly wrong. Once those match, the open question is cost and performance at scale — and that is where dimension and store interact. A large dimension is more punishing on an always-on managed store (you pay for that capacity continuously) than on a serverless store that scales down when idle; conversely, a purpose-built vector DB may handle high-dimensional search at large scale more efficiently than a general database.

On Bedrock, if you use Knowledge Bases the managed pipeline wires the embeddings model to the store for you, but you still choose both — so the matching logic still applies. If you build RAG yourself, you own the wiring end to end. Either way, the pairing heuristics below cover the large majority of builds; for the full menu of stores and their trade-offs, the amazon-bedrock-knowledge-bases sibling goes deeper on each option.

  • Aurora PostgreSQL (pgvector) + a right-sized dimension — If you already run Postgres, pgvector is attractive and cheap at low/bursty volume via Aurora Serverless v2. Because you may pay for some standing capacity, keep the dimension modest — Titan V2 at 256 or 512 pairs especially well here, minimizing both storage and index size while reusing infrastructure you already operate.
  • OpenSearch Serverless + any dimension (the default) — The zero-setup default for Bedrock Knowledge Bases. It carries a baseline capacity cost, so dimension still matters for the bill, but it scales and is the fastest path to a working index. Fine with Titan V2 at any dimension or Cohere's 1024 — choose dimension on quality/cost, not store compatibility.
  • Pinecone / a purpose-built vector DB + larger dimensions at scale — When the corpus is large and search performance is paramount, a dedicated vector DB is built to index and query high-dimensional vectors efficiently. This is where a 1024-dim Cohere or Titan V2 index is most comfortable at scale — the store is optimized for exactly that workload.
  • Redis + lower dimensions for latency — If ultra-low query latency matters and you already run Redis, its vector search is fast; keeping the dimension smaller (Titan V2 256/512) reduces per-query work and memory, reinforcing the latency win.
  • Neptune Analytics (graph + vector) — model choice is secondary — For GraphRAG over relationship-rich data, the value is the graph plus vectors together; pick the text embeddings model on language/quality as usual (Titan V2 or Cohere), since the graph dimension does the differentiated work.

The synthesis: choose the model on language and quality, choose the dimension on your accuracy-vs-cost trade-off, and then make sure your store is configured for that dimension and cosine similarity — and let the store's cost shape (always-on vs serverless vs purpose-built) push you toward a smaller or larger dimension at the margin. Get those three aligned and the embeddings layer is both accurate and economical.

the expensive part to avoid

VIIMigration and re-embedding cost — why you choose once

The most important operational fact about embeddings models is also the easiest to overlook until it hurts: you cannot change your embeddings model in place. Vectors from one model are meaningless to another, so switching means re-embedding the entire corpus and rebuilding the index. This is why "choose deliberately up front" is not a platitude — it is the whole game.

The reason is fundamental, not a Bedrock limitation. Each embeddings model defines its own vector space; a vector from Titan V2 and a vector from Cohere Embed are simply different coordinate systems, not comparable in the slightest. The same is true across versions and even across dimensions of the same model — Titan V1 (1536-dim) and Titan V2 (1024-dim) are incompatible, and a Titan V2 index built at 512 cannot be queried with 1024-dim vectors. An index is permanently tied to the exact model and dimension that built it. Mixing is not "degraded," it is broken: similarity scores become noise.

So a migration is a full re-ingestion. Concretely it means: re-embed every chunk in the corpus with the new model (an embeddings-token bill proportional to total corpus tokens — cheap per token, but it is the whole corpus, and large corpora make this non-trivial); stand up a new index sized for the new dimension/metric; write all the new vectors; and cut over queries from the old index to the new one. Until cutover you are paying for two indexes. None of these steps is individually hard, but together they are real work and real cost, and they recur every time you change your mind about the model.

The good news is that the things you tune most often do not require re-embedding. Your generation model is independent — you can switch the chat model in RetrieveAndGenerate (or your own pipeline) from Claude to Nova to anything else without touching the index, because generation happens after retrieval. Your prompt, your top-k, your metadata filters, and adding a re-ranking step are all query-time changes that leave the embeddings untouched. Re-embedding is forced only by changing the embeddings model itself, its version, or its dimension. That clean separation is exactly why the embeddings decision deserves the up-front rigor and the generation decision can stay flexible.

Two practical mitigations. First, do the bake-off before you commit at scale: run the recall@k comparison on a representative subset so the expensive full-corpus embedding only happens once, on the winner. Second, when a migration is genuinely warranted (a markedly better model, or a hard language requirement you missed), treat it as a planned re-ingestion with a parallel index and a clean cutover rather than an in-place tweak — and note that the entire re-embedding token cost is itself AWS-credit-eligible, so even a migration can be funded.

what forces a re-embed vs what does not

Forces a full re-embed + new index: changing the embeddings model, its version (V1↔V2), or its dimension (512↔1024). Free, query-time changes: swapping the generation model, editing the prompt, changing top-k, adding metadata filters, adding a re-ranker. Choose the embeddings model once; keep everything after retrieval flexible.

building it for $0

VIIIBuilding the embeddings layer on AWS credits — for $0

Everything in this decision — the embedding tokens at ingest, the vector store that holds them, the inference that generates answers, even a future re-embedding migration — is AWS spend. And all of it is AWS-credit-eligible, which is why teams routinely build the whole RAG stack without paying out of pocket while they prove the use case.

The cost shape of an embeddings-backed RAG system is the stack covered above: embedding tokens (a small, largely one-time cost at ingest, plus a trivial per-query amount), the vector store (the standing cost, set by dimension × chunk count), and inference (normal Bedrock token cost when a model writes the grounded answer). Add the underlying S3 storage and, if you switch models later, a one-time re-embedding bill. At prototype scale this is typically single-digit to low-tens of dollars a month; it grows with corpus size and query volume — which is exactly the window where credits matter most.

Every one of those layers draws down AWS credits automatically. The relevant pools are AWS Activate (commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / generative-AI POC pool ($10K–$50K) aimed squarely at proving out exactly this kind of use case, and the competitive Generative AI Accelerator (up to $1M for selected AI-first companies). Most of these pools are partner-filed through the AWS Partner Network rather than available on a public form — which is the gap CloudRoute fills.

CloudRoute routes you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and builds the embeddings layer with you — running the model bake-off (Titan V2 vs Cohere on your own eval set), right-sizing the dimension against your vector-store cost, wiring the chosen store, and shipping the retrieval integration (whether managed Knowledge Bases or a custom pipeline with re-ranking and hybrid search). The customer pays $0: AWS funds the credits, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You are never in the payment loop. See AWS credits for generative-AI startups and Bedrock POC funding for the full mechanics.

the tie-in in one line

Embedding tokens + the vector store + inference (+ any re-embedding migration) are all AWS-credit-eligible. CloudRoute matches you to the right pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted partner who files the credits and builds the embeddings layer — so the build is $0 while you prove the workload out.

the decision table

Titan Text Embeddings V2 vs Cohere Embed — side by side

For text RAG the real decision is Titan Text Embeddings V2 versus Cohere Embed (English or Multilingual). Here is how they compare on the dimensions that actually drive the choice, with Titan Multimodal included for when images enter the picture. Specs are representative as of 2026 — confirm current values on the AWS Bedrock model page.

DimensionTitan Text Embeddings V2Cohere Embed — EnglishCohere Embed — MultilingualTitan Multimodal Embeddings
ModalityTextTextTextImage + text
Output dimensions256 / 512 / 1024 (you choose)1024 (fixed)1024 (fixed)1024 + smaller options
Language strengthMany; best in EnglishEnglish100+; cross-lingualn/a (image+text)
Normalized outputYes (default)Cosine-suitedCosine-suitedYes
Standout featureDimension dial → cost controlAsymmetric query/doc embeddingsBest many-language retrievalSearch across image + text
Relative cost postureLowest; tunable by dimensionLowLowPer image + text input
Reach for it whenEnglish-first RAG; want cheapest, flexible defaultEnglish retrieval quality is the priorityMultilingual / cross-lingual corpusImages are part of what you retrieve
Rule of thumb: default to Titan V2 (test 512 before 1024 to cut vector-store cost); switch to Cohere Multilingual when languages are a first-class requirement; use Cohere English when you want a retrieval-specialist for an English corpus; use Titan Multimodal only when images are in scope. Whichever you pick, the model + dimension are bound to the index — choose once. Verify specs and pricing in the AWS docs.
before you embed a single chunk
Get AWS credits that cover embeddings, the vector store, and inference — and a partner to build it (you pay $0)
Get matched in 24h →
a recent match

A multilingual search rebuild that cut vector-store cost — anonymized

inquiry · Series-A B2B SaaS, multilingual product search, Amsterdam
Series-A B2B SaaS, 22 people, a product + docs corpus of ~600,000 chunks across English, German, French, and Dutch

Situation: The team had shipped a first semantic-search feature on a default English-tuned embeddings model at its full 1536-dimension setting. Two problems showed up in production: non-English queries retrieved poorly (the model was English-first, so German and Dutch users got weak results), and the always-on vector store was already one of their larger AWS-adjacent line items because every one of 600k chunks carried a 1536-dim vector. They wanted better multilingual retrieval and a smaller standing bill — without spending runway on the rebuild or the inference while they validated it.

What CloudRoute did: CloudRoute matched them in under 24 hours to an EU AWS partner with RAG experience. The partner ran a proper bake-off on the team's own eval set (representative queries per language with known-correct passages), measuring recall@k across Titan Text Embeddings V2 at 512 and 1024 and Cohere Embed Multilingual. Cohere Multilingual won decisively on the non-English queries; on the cost side, the partner confirmed 1024-dim was sufficient (no measurable recall gain justified going higher) — already a ~33% smaller vector footprint than the old 1536-dim index. They re-embedded the full corpus into a fresh Aurora pgvector index (the team already ran Postgres) configured for cosine similarity, added a re-ranking step for the top candidates, and kept generation on the team's existing chat model untouched. In parallel, the partner filed a Bedrock POC credit application plus an Activate Portfolio application to fund the rebuild — re-embedding tokens, the new vector store, and inference included.

Outcome: Multilingual retrieval quality jumped (German, French, and Dutch queries now surfaced the right passages), and the new vector store ran materially cheaper thanks to the lower dimension and a serverless Postgres footprint. The entire rebuild — the full re-embedding of 600k chunks, the new index, the re-ranker, and inference during validation — was covered by the approved credits, so the team paid $0 during the migration and early rollout. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

corpus: ~600k chunks, 4 languages · model: Cohere Multilingual @ 1024 · re-embed + new index: credit-funded · out-of-pocket during rebuild: $0

faq

Common questions

What embeddings models does Amazon Bedrock offer?
As of 2026 the practical choices on Bedrock are Amazon Titan Text Embeddings V2 (text; selectable 256/512/1024 dimensions; the common low-cost default, strongest in English), Amazon Titan Text Embeddings V1 (text; fixed 1536-dim; legacy — prefer V2 for new builds), Cohere Embed English (text; 1024-dim; retrieval-specialist with asymmetric query/document embeddings), Cohere Embed Multilingual (text; 1024-dim; 100+ languages and cross-lingual retrieval), and Amazon Titan Multimodal Embeddings (image + text into a shared space). For pure-text RAG the real decision is Titan V2 vs Cohere Embed; verify the current model list and specs on the AWS Bedrock model page.
Titan Text Embeddings V2 vs Cohere Embed — which should I use?
Default to Titan Text Embeddings V2 for English-first corpora: it is inexpensive and lets you choose the output dimension (256/512/1024) to trade accuracy against vector-store cost. Choose Cohere Embed Multilingual when your corpus or users span many languages or you need cross-lingual matching — it is usually the quality winner there. Choose Cohere Embed English when English retrieval quality is the top priority and you want a dedicated retrieval-tuned model with asymmetric query/document embeddings. The reliable way to decide is a bake-off on your own data measuring recall@k, not a generic leaderboard.
How does the embeddings model affect vector-store cost?
The standing cost of a vector store scales with vector dimension × number of chunks, and the dimension comes from your embeddings model. A 1024-dim model stores ~4× the vector data of a 256-dim model for the same corpus, which means more storage, memory, and per-query search work — every month. That is why Titan V2's selectable dimension matters: testing 256 or 512 before defaulting to 1024 can cut vector-store cost 2–4× if recall holds on your eval set. Separately, embedding compute (per input token) is cheap and largely a one-time cost at ingest, so the vector store — not the embedding tokens — is the cost to optimize first.
What output dimensions do these models produce, and which should I pick?
Titan Text Embeddings V2 lets you choose 256, 512, or 1024; Titan V1 is fixed at 1536; Cohere Embed (English and Multilingual) is fixed at 1024; Titan Multimodal is typically 1024 with smaller options. Larger dimensions can capture more nuance but cost proportionally more to store and search, with diminishing returns. The practical advice with Titan V2 is to evaluate 512 first — many corpora retrieve almost as well at 512 as at 1024 for half the storage — and only move to 1024 if recall@k measurably improves on your own queries.
Can I change my embeddings model later without re-embedding everything?
No. Vectors from different models, versions, or even dimensions are incompatible — each defines its own vector space, so a Titan V2 vector and a Cohere vector (or a Titan V1 vector, or a 512-dim vs 1024-dim Titan V2 vector) cannot be compared. Changing the embeddings model means re-embedding the entire corpus and rebuilding the index from scratch. This is why you choose the embeddings model deliberately up front. The good news: query-time things are flexible — you can swap the generation model, change the prompt, adjust top-k, add metadata filters, or add a re-ranker without re-embedding anything.
What does normalization mean for embeddings, and why does it matter?
A normalized vector is scaled to unit length. It matters because of how the vector store measures similarity: with normalized vectors, cosine similarity, dot product, and Euclidean distance all rank results identically, so the metric choice is free. Titan Text Embeddings V2 returns normalized vectors by default (with an option to control it), and Cohere's vectors are well-suited to cosine similarity. The safe default across all these models is to keep vectors normalized and configure the vector store's index for cosine similarity (or dot product on normalized vectors).
How do I match the embeddings model to my vector store?
The hard constraint is that the store's index must be configured for the exact dimension your model outputs and the distance metric your vectors expect (cosine similarity for these models). Beyond that, let cost and scale guide the pairing: with Aurora pgvector (cheap at low/bursty volume) keep the dimension modest — Titan V2 at 256/512 pairs well; OpenSearch Serverless is the fastest default and works with any dimension; a purpose-built vector DB like Pinecone is most comfortable with larger 1024-dim indexes at scale; Redis favours smaller dimensions for latency. Choose the model on language/quality, the dimension on cost/accuracy, then configure the store to match.
Is the embeddings layer covered by AWS credits?
Yes. The embedding tokens (at ingest and per query), the vector store, the generation inference, and even a future re-embedding migration are all AWS spend and all AWS-credit-eligible — they draw down your credits automatically. The relevant pools are AWS Activate (up to $100K), a Bedrock/generative-AI POC pool ($10K–$50K) aimed at exactly this kind of use case, and the GenAI Accelerator (up to $1M). These are largely partner-filed via the AWS Partner Network, which is why teams route through a partner. CloudRoute matches you to the right pool and a vetted AWS partner who files the credits and builds the embeddings layer (model bake-off, dimension right-sizing, vector-store wiring, retrieval integration), so the customer pays $0. Confirm current rates on the AWS pricing page.

Build your RAG embeddings layer on AWS — funded

Whatever the build costs — embedding tokens, the vector store, inference, even a re-embedding migration — AWS credits can cover it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to run the Titan-vs-Cohere bake-off on your own data, right-size the dimension against vector-store cost, wire the store, and ship the retrieval integration. Customer pays $0.

matched within< 24h
GenAI credit ceilingup to $1M
cost to you$0
Bedrock Embeddings Models — Titan V2 vs Cohere (2026) · CloudRoute