A neutral, build-focused reference for the embeddings models on Amazon Bedrock in 2026: what an embeddings model actually does, the choices on offer (Amazon Titan Text Embeddings V2, Cohere Embed English and Multilingual, and Titan Multimodal Embeddings), how they compare on dimensions, max tokens, languages, and normalization — and the part most guides skip: how that one choice quietly drives both your retrieval quality and your vector-store bill, how to match the model to the store you picked, and what re-embedding actually costs if you switch later.
Before comparing models it is worth being precise about what an embeddings model is, because the comparison only makes sense once the job is clear. An embeddings model has one job: turn a piece of text into a vector — a fixed-length list of numbers — such that texts with similar meaning produce vectors that sit close together in space.
That single property is what powers semantic search and retrieval-augmented generation. When you ingest a corpus, every chunk of text is passed through the embeddings model and the resulting vector is stored in a vector store alongside the original text. At query time, the question is embedded with the same model, and the store returns the chunks whose vectors are nearest to the query vector — by cosine similarity or a related distance metric. The model never "understands" your documents the way a chat model answers questions; it simply places text in a geometric space where closeness means relatedness. Everything downstream — which passages a RAG system retrieves, how relevant a search result feels — rests on how well that placement reflects real meaning.
This is why the embeddings model is the quiet foundation of a RAG build, and why getting it wrong is expensive in a way that is easy to miss. A weak generation model produces an obviously bad answer you can see and fix. A weak embeddings model fails silently: it retrieves the wrong chunks, the generation model dutifully writes a fluent answer from irrelevant context, and the failure looks like a hallucination rather than a retrieval miss. Retrieval quality caps answer quality — a brilliant model cannot answer from context it never received.
On Amazon Bedrock, embeddings models are first-class: they are exposed through the same InvokeModel API as text models, they are what Bedrock Knowledge Bases uses under the hood to build a managed RAG index, and they are billed on the same per-token basis. The two big things this page helps you decide are therefore (1) which model gives the retrieval quality your corpus needs, and (2) at what vector dimension, because that number sets your storage and search cost. Those two decisions interact, and they are effectively permanent for a given index — which is the recurring theme of the sections below.
One clarification that saves confusion: an embeddings model and a generation (chat) model are different things, even when they share a brand name. "Amazon Titan" and "Cohere" both ship text-generation models and embeddings models; on this page "Titan" and "Cohere" refer to their embeddings families unless stated otherwise. You can freely mix vendors across the two roles — for example, embed with Cohere and generate with Claude — because the embeddings model only touches retrieval, never the final answer.
An embeddings model converts text into a fixed-length vector whose position encodes meaning, so a vector store can retrieve the most relevant passages by similarity. On Bedrock the practical choices are Amazon Titan Text Embeddings V2, Cohere Embed (English / Multilingual), and Titan Multimodal Embeddings for image+text. Pick deliberately: the model and its index are bound together for the life of the corpus.
Bedrock offers a small, well-chosen set of embeddings models rather than an overwhelming menu. For text retrieval the decision is effectively Amazon Titan Text Embeddings V2 versus Cohere Embed; for image-aware retrieval there is Titan Multimodal Embeddings. Here is what each one is and where it fits.
All of these run through the same Bedrock surface — your data stays in your account and region, is not used to train the base models, and the models are billed per input token (the output vector itself is not charged). What differs is dimensionality, token limits, language coverage, and the modality each one handles. The model list and exact specifications evolve, so confirm the current details on the AWS Bedrock model page when you scope a build; the families below are the stable ones to reason about in 2026.
A practical way to read this menu: Titan Text Embeddings V2 is the default — cheap, flexible on dimension, good on English. Cohere Embed Multilingual is the reach-for model when languages matter, and Cohere Embed English is the alternative when you want a retrieval-specialist for an English corpus. Titan Multimodal is a different job entirely — only when images are in scope. Most teams will pick between Titan V2 and a Cohere model, which is exactly the comparison the next sections drill into.
Four technical attributes separate these models in ways that actually change a build: how big the output vector is, how much text fits in one call, which languages are covered, and whether vectors come out normalized. Each one has a downstream consequence you should choose on purpose, not by accident.
The output dimension is the length of the vector — 256, 512, 1024, or 1536 numbers depending on the model. Larger vectors can encode more nuance, which can lift retrieval quality on hard, subtle corpora. But the dimension is also, directly, your storage and search cost: a vector store holds one vector per chunk, so doubling the dimension roughly doubles the bytes stored and the work done per similarity search. A 1024-dimension model over the same corpus stores ~4× the vector data of a 256-dimension one. Titan Text Embeddings V2's selectable dimensions (256 / 512 / 1024) exist precisely so you can make this trade-off explicitly — many corpora retrieve almost as well at 512 as at 1024 for half the storage. Cohere Embed and Titan V1 are fixed (1024 and 1536 respectively), so with those the dimension is a consequence of the model choice rather than a separate dial.
Each model accepts up to a maximum number of input tokens per embedding call. Titan Text Embeddings models accept a large window (thousands of tokens), while Cohere Embed has a smaller per-call token limit. In practice this rarely binds, because you almost always chunk documents into pieces far smaller than any of these limits before embedding — retrieval works best on focused chunks, not whole documents. The limit matters mainly in two cases: if you deliberately embed long passages (e.g. with hierarchical chunking returning large parents), confirm they fit; and if a model truncates over-long input silently, an oversized chunk loses its tail. The clean rule: size your chunks for retrieval quality (typically a few hundred tokens), and the token limit becomes a non-issue.
Language coverage is where the Titan-vs-Cohere choice most often gets decided. Titan Text Embeddings V2 supports many languages but is strongest in English. Cohere Embed Multilingual is purpose-built across 100+ languages and supports cross-lingual retrieval — a query in French can match a document in German because both map into the same shared space. If your corpus is English-dominant, Titan V2 is more than sufficient and cheaper. If your users or content are genuinely multilingual, or you need cross-lingual matching, Cohere Multilingual is usually the quality winner and worth the choice.
A vector is normalized when it is scaled to unit length. This matters because of how the vector store measures similarity. With normalized vectors, cosine similarity, dot product, and Euclidean distance all rank results identically, so the choice of metric is free. Titan Text Embeddings V2 returns normalized vectors by default (and offers an option to control this), and Cohere's vectors are well-suited to cosine similarity. The practical guidance: keep vectors normalized and configure your vector store's index for cosine similarity (or dot product on normalized vectors) — this is the safe default across all these models. The only time to think harder is if you deliberately turn off normalization or mix sources; then make sure the index metric matches what the model produces.
| Model | Modality | Output dimensions | Languages | Normalized by default | Notable feature |
|---|---|---|---|---|---|
| Titan Text Embeddings V2 | Text | 256 / 512 / 1024 (selectable) | Many; English-strongest | Yes | Pick your dimension to trade accuracy vs cost |
| Titan Text Embeddings V1 | Text | 1536 (fixed) | Many; English-strongest | Yes | Legacy; prefer V2 for new builds |
| Cohere Embed — English | Text | 1024 (fixed) | English | Cosine-suited | Input-type (asymmetric query/doc) embeddings |
| Cohere Embed — Multilingual | Text | 1024 (fixed) | 100+; cross-lingual | Cosine-suited | Best for many-language / cross-lingual retrieval |
| Titan Multimodal Embeddings | Image + text | 1024 (with smaller options) | n/a (image+text) | Yes | Search images by text and vice versa |
Retrieval quality is the reason the embeddings model exists, so it deserves a clear-eyed treatment: where the model genuinely moves the needle, where it does not, and how to tell whether yours is good enough for your corpus rather than in the abstract.
The honest framing is that the embeddings model is a real but second-order lever compared with how you chunk and parse your documents. A good model embedding badly-chunked text retrieves poorly; an average model embedding clean, well-sized chunks retrieves well. So the first question is never "which is the best embeddings model" in the abstract — it is "is my model a good fit for this corpus and these queries." Two corpora with identical word counts can have very different best models depending on language mix, domain jargon, and how the questions are phrased.
Where the model choice clearly matters: language coverage (a model weak in your language will retrieve poorly no matter the dimension — this is the single biggest quality differentiator, and where Cohere Multilingual earns its place), domain fit (highly technical or specialized vocabularies separate models more than everyday prose does), and asymmetry (Cohere's input-type feature, embedding queries and documents differently, can sharpen retrieval because a short question and a long passage are not the same kind of text). Where it matters less than people expect: for general English prose with sensible chunking, Titan V2 and Cohere English are close enough that the dimension/cost trade-off and your vector store will influence the decision more than a small quality gap.
Dimension interacts with quality too, but with diminishing returns. Going from a very small dimension to a mid one usually helps; going from a mid one to the largest often adds little for typical corpora while multiplying cost. This is exactly why Titan V2's selectable dimension is useful: you can measure the trade-off on your own data instead of guessing. The right method is empirical — assemble a small set of representative questions with known-correct source passages, embed your corpus with each candidate model/dimension, and measure how often the correct passage appears in the top-k results (recall@k). The configuration that retrieves the right chunk most reliably on your queries wins; published benchmark leaderboards are a starting hypothesis, not the answer for your data.
A final quality note that is really an architecture note: even the best embeddings model returns an imperfect ranking, so high-quality RAG systems often add a re-ranking step (a cross-encoder that re-scores the top candidates) and/or hybrid search (combining vector similarity with keyword/BM25 matching to catch exact terms, names, and IDs that pure semantics can miss). These compensate for embedding limitations and frequently lift retrieval more than swapping embeddings models would — see the rag-on-aws sibling for how they fit into the full pipeline.
Do not pick by leaderboard. Build a small eval set — representative questions paired with the source passages that should answer them — then embed your corpus with each candidate model and dimension and measure recall@k (how often the right passage is in the top results). Let your own data decide. For many-language corpora start the bake-off with Cohere Multilingual; for English start with Titan V2 at 512 and 1024.
The embeddings model has a second, less-discussed effect: it sets how much your vector store costs to run. This is where the dimension number stops being abstract and starts showing up on the bill — and where a thoughtful choice can cut standing cost by a multiple.
There are two distinct costs tied to embeddings, and they behave very differently. The first is the embedding compute: you pay the model per input token to embed your corpus once at ingest, again for any re-ingestion, and a tiny amount per query to embed each incoming question. This is genuinely cheap — embeddings token rates are a fraction of generation rates — and for most corpora the one-time ingest embedding is a small, bounded cost. The second is the vector-store cost, and this is the one that recurs every month whether or not anyone is querying: the store holds one vector per chunk, forever, and bills for the capacity to keep and search them.
That standing cost scales with dimension × number of chunks. The number of chunks comes from your corpus size and chunking strategy; the dimension comes from your embeddings model. This is the precise mechanism by which the model choice drives infrastructure cost: a 1024-dim model stores four times the vector bytes of a 256-dim model for the identical corpus, which means more storage, more memory, and more compute per similarity search — across every vector, every month. For a small corpus the absolute numbers are tiny either way; for a large or fast-growing corpus, the dimension you chose at the start becomes one of the largest lines in the RAG bill.
Hence the practical cost playbook. Right-size the dimension: with Titan V2, test 256 and 512 before defaulting to 1024 — if recall@k holds on your eval set at a smaller dimension, you have just cut storage cost by 2–4× for free. Control chunk count: over-aggressive chunking inflates vector count (and thus cost) as much as a big dimension does, so chunking and dimension should be tuned together. Match the store to the volume: at low or bursty volume, a serverless Postgres/pgvector store is often cheaper than always-on managed search; at large scale a purpose-built vector DB may search more cost-effectively. And remember the asymmetry: embedding tokens are a small one-time-ish cost; the vector store is the standing cost, so optimize dimension and chunk count first.
| Embeddings model | Dimension | Relative vectors stored | Relative storage / search cost | When the cost is worth it |
|---|---|---|---|---|
| Titan Text Embeddings V2 | 256 | 1M × 256 | 1× (baseline) | Large corpora where recall holds at 256 |
| Titan Text Embeddings V2 | 512 | 1M × 512 | ~2× | Common sweet spot — small accuracy gain |
| Titan Text Embeddings V2 / Cohere Embed | 1024 | 1M × 1024 | ~4× | Hard corpora where 1024 measurably lifts recall |
| Titan Text Embeddings V1 | 1536 | 1M × 1536 | ~6× | Legacy indexes only; prefer V2 for new builds |
The embeddings model and the vector store are two halves of one decision. The model produces vectors of a certain dimension and shape; the store has to hold them, index them, and search them well at your volume and budget. Picking them in isolation is how teams end up paying too much or retrieving too slowly.
The hard constraint is simple: the store's index must be configured for the dimension your model outputs and the distance metric your vectors expect (cosine similarity for the normalized/cosine-suited models here). You cannot put 1024-dim vectors into an index built for 1536, and an index using the wrong metric will rank results subtly wrong. Once those match, the open question is cost and performance at scale — and that is where dimension and store interact. A large dimension is more punishing on an always-on managed store (you pay for that capacity continuously) than on a serverless store that scales down when idle; conversely, a purpose-built vector DB may handle high-dimensional search at large scale more efficiently than a general database.
On Bedrock, if you use Knowledge Bases the managed pipeline wires the embeddings model to the store for you, but you still choose both — so the matching logic still applies. If you build RAG yourself, you own the wiring end to end. Either way, the pairing heuristics below cover the large majority of builds; for the full menu of stores and their trade-offs, the amazon-bedrock-knowledge-bases sibling goes deeper on each option.
The synthesis: choose the model on language and quality, choose the dimension on your accuracy-vs-cost trade-off, and then make sure your store is configured for that dimension and cosine similarity — and let the store's cost shape (always-on vs serverless vs purpose-built) push you toward a smaller or larger dimension at the margin. Get those three aligned and the embeddings layer is both accurate and economical.
The most important operational fact about embeddings models is also the easiest to overlook until it hurts: you cannot change your embeddings model in place. Vectors from one model are meaningless to another, so switching means re-embedding the entire corpus and rebuilding the index. This is why "choose deliberately up front" is not a platitude — it is the whole game.
The reason is fundamental, not a Bedrock limitation. Each embeddings model defines its own vector space; a vector from Titan V2 and a vector from Cohere Embed are simply different coordinate systems, not comparable in the slightest. The same is true across versions and even across dimensions of the same model — Titan V1 (1536-dim) and Titan V2 (1024-dim) are incompatible, and a Titan V2 index built at 512 cannot be queried with 1024-dim vectors. An index is permanently tied to the exact model and dimension that built it. Mixing is not "degraded," it is broken: similarity scores become noise.
So a migration is a full re-ingestion. Concretely it means: re-embed every chunk in the corpus with the new model (an embeddings-token bill proportional to total corpus tokens — cheap per token, but it is the whole corpus, and large corpora make this non-trivial); stand up a new index sized for the new dimension/metric; write all the new vectors; and cut over queries from the old index to the new one. Until cutover you are paying for two indexes. None of these steps is individually hard, but together they are real work and real cost, and they recur every time you change your mind about the model.
The good news is that the things you tune most often do not require re-embedding. Your generation model is independent — you can switch the chat model in RetrieveAndGenerate (or your own pipeline) from Claude to Nova to anything else without touching the index, because generation happens after retrieval. Your prompt, your top-k, your metadata filters, and adding a re-ranking step are all query-time changes that leave the embeddings untouched. Re-embedding is forced only by changing the embeddings model itself, its version, or its dimension. That clean separation is exactly why the embeddings decision deserves the up-front rigor and the generation decision can stay flexible.
Two practical mitigations. First, do the bake-off before you commit at scale: run the recall@k comparison on a representative subset so the expensive full-corpus embedding only happens once, on the winner. Second, when a migration is genuinely warranted (a markedly better model, or a hard language requirement you missed), treat it as a planned re-ingestion with a parallel index and a clean cutover rather than an in-place tweak — and note that the entire re-embedding token cost is itself AWS-credit-eligible, so even a migration can be funded.
Forces a full re-embed + new index: changing the embeddings model, its version (V1↔V2), or its dimension (512↔1024). Free, query-time changes: swapping the generation model, editing the prompt, changing top-k, adding metadata filters, adding a re-ranker. Choose the embeddings model once; keep everything after retrieval flexible.
Everything in this decision — the embedding tokens at ingest, the vector store that holds them, the inference that generates answers, even a future re-embedding migration — is AWS spend. And all of it is AWS-credit-eligible, which is why teams routinely build the whole RAG stack without paying out of pocket while they prove the use case.
The cost shape of an embeddings-backed RAG system is the stack covered above: embedding tokens (a small, largely one-time cost at ingest, plus a trivial per-query amount), the vector store (the standing cost, set by dimension × chunk count), and inference (normal Bedrock token cost when a model writes the grounded answer). Add the underlying S3 storage and, if you switch models later, a one-time re-embedding bill. At prototype scale this is typically single-digit to low-tens of dollars a month; it grows with corpus size and query volume — which is exactly the window where credits matter most.
Every one of those layers draws down AWS credits automatically. The relevant pools are AWS Activate (commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / generative-AI POC pool ($10K–$50K) aimed squarely at proving out exactly this kind of use case, and the competitive Generative AI Accelerator (up to $1M for selected AI-first companies). Most of these pools are partner-filed through the AWS Partner Network rather than available on a public form — which is the gap CloudRoute fills.
CloudRoute routes you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and builds the embeddings layer with you — running the model bake-off (Titan V2 vs Cohere on your own eval set), right-sizing the dimension against your vector-store cost, wiring the chosen store, and shipping the retrieval integration (whether managed Knowledge Bases or a custom pipeline with re-ranking and hybrid search). The customer pays $0: AWS funds the credits, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You are never in the payment loop. See AWS credits for generative-AI startups and Bedrock POC funding for the full mechanics.
Embedding tokens + the vector store + inference (+ any re-embedding migration) are all AWS-credit-eligible. CloudRoute matches you to the right pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted partner who files the credits and builds the embeddings layer — so the build is $0 while you prove the workload out.
For text RAG the real decision is Titan Text Embeddings V2 versus Cohere Embed (English or Multilingual). Here is how they compare on the dimensions that actually drive the choice, with Titan Multimodal included for when images enter the picture. Specs are representative as of 2026 — confirm current values on the AWS Bedrock model page.
| Dimension | Titan Text Embeddings V2 | Cohere Embed — English | Cohere Embed — Multilingual | Titan Multimodal Embeddings |
|---|---|---|---|---|
| Modality | Text | Text | Text | Image + text |
| Output dimensions | 256 / 512 / 1024 (you choose) | 1024 (fixed) | 1024 (fixed) | 1024 + smaller options |
| Language strength | Many; best in English | English | 100+; cross-lingual | n/a (image+text) |
| Normalized output | Yes (default) | Cosine-suited | Cosine-suited | Yes |
| Standout feature | Dimension dial → cost control | Asymmetric query/doc embeddings | Best many-language retrieval | Search across image + text |
| Relative cost posture | Lowest; tunable by dimension | Low | Low | Per image + text input |
| Reach for it when | English-first RAG; want cheapest, flexible default | English retrieval quality is the priority | Multilingual / cross-lingual corpus | Images are part of what you retrieve |
Situation: The team had shipped a first semantic-search feature on a default English-tuned embeddings model at its full 1536-dimension setting. Two problems showed up in production: non-English queries retrieved poorly (the model was English-first, so German and Dutch users got weak results), and the always-on vector store was already one of their larger AWS-adjacent line items because every one of 600k chunks carried a 1536-dim vector. They wanted better multilingual retrieval and a smaller standing bill — without spending runway on the rebuild or the inference while they validated it.
What CloudRoute did: CloudRoute matched them in under 24 hours to an EU AWS partner with RAG experience. The partner ran a proper bake-off on the team's own eval set (representative queries per language with known-correct passages), measuring recall@k across Titan Text Embeddings V2 at 512 and 1024 and Cohere Embed Multilingual. Cohere Multilingual won decisively on the non-English queries; on the cost side, the partner confirmed 1024-dim was sufficient (no measurable recall gain justified going higher) — already a ~33% smaller vector footprint than the old 1536-dim index. They re-embedded the full corpus into a fresh Aurora pgvector index (the team already ran Postgres) configured for cosine similarity, added a re-ranking step for the top candidates, and kept generation on the team's existing chat model untouched. In parallel, the partner filed a Bedrock POC credit application plus an Activate Portfolio application to fund the rebuild — re-embedding tokens, the new vector store, and inference included.
Outcome: Multilingual retrieval quality jumped (German, French, and Dutch queries now surfaced the right passages), and the new vector store ran materially cheaper thanks to the lower dimension and a serverless Postgres footprint. The entire rebuild — the full re-embedding of 600k chunks, the new index, the re-ranker, and inference during validation — was covered by the approved credits, so the team paid $0 during the migration and early rollout. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
corpus: ~600k chunks, 4 languages · model: Cohere Multilingual @ 1024 · re-embed + new index: credit-funded · out-of-pocket during rebuild: $0
Whatever the build costs — embedding tokens, the vector store, inference, even a re-embedding migration — AWS credits can cover it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner to run the Titan-vs-Cohere bake-off on your own data, right-size the dimension against vector-store cost, wire the store, and ship the retrieval integration. Customer pays $0.