AI search — semantic, vector, and hybrid search — makes your product's search box understand meaning, not just match keywords, so a query for "can't log in" finds the "authentication troubleshooting" article. This is the full build guide: keyword vs semantic vs hybrid explained, the reference architecture (embed → index → query → re-rank → optionally generate an answer), every vector-store option on AWS, choosing embeddings (Titan vs Cohere), re-ranking for precision, generative answers via Amazon Bedrock, how to tune relevance, what it costs — and when to reach for the fully-managed Amazon Kendra instead of building it yourself.
AI search is search that understands meaning. Instead of matching the literal words in a query against the literal words in your documents, it compares the <em>semantics</em> of the query to the semantics of each item — so "laptop won't charge" surfaces the "battery and power adapter troubleshooting" page even though they share almost no words.
Traditional search is lexical: it scores documents by how often the query's terms appear (TF-IDF / BM25), optionally with stemming and synonym lists you maintain by hand. It is fast, exact, and explainable — and it fails the moment a user phrases something differently from your content. Search for "remote work policy" when the document says "telecommuting guidelines" and lexical search returns nothing. Semantic search fixes this by comparing meaning instead of words.
The mechanics: offline, you run each document (or product, ticket, FAQ, listing) through an embedding model that converts text into a vector — a list of numbers that encodes meaning, where semantically similar text lands near each other in vector space. You store those vectors in a vector index. At query time, you embed the user's query with the same model and ask the index for the nearest vectors (approximate nearest-neighbour search). The results are ranked by semantic closeness, so synonyms, paraphrases, and intent all match without anyone maintaining a synonym list.
Where this shows up in a product: the search box in a SaaS app, help centre, or docs site; product / catalogue search in e-commerce (find by description, not just title); internal knowledge search across wikis and tickets; recommendations and "more like this"; and as the retrieval layer underneath a chat assistant. The last one is RAG — retrieval-augmented generation — and AI search is the retrieval half of it. This page is about making search itself better; if your goal is specifically a grounded chatbot over your docs, see the dedicated RAG-on-AWS guide.
On AWS, every stage of this maps to a managed service, and there is also a fully-managed end-to-end option (Amazon Kendra) if you would rather buy than build. The next section sets up the decision that determines everything else: keyword vs semantic vs hybrid.
AI search = embed your content and the query into vectors with the same model, then return the items whose meaning is nearest the query — so search matches intent and synonyms, not just the exact words the user typed.
The single most important thing to understand before building is that semantic search is not strictly better than keyword search — it is better at different things. The production answer on almost every real corpus is to run both and fuse the results.
Each mode has a failure shape. Keyword (lexical / BM25) search is exact and unbeatable for precise terms — a part number, a SKU, an error code, a person's name, an acronym — but it returns nothing when the user's words differ from your content's words. Semantic (vector) search is the opposite: it shines on intent, synonyms, and natural-language questions, but it can miss or blur an exact token (it might rank a conceptually-similar product above the exact SKU someone typed) and it can return plausible-but-wrong neighbours. Hybrid search runs both and combines the scores, so exact matches and semantic matches both surface.
The standard way to combine them is Reciprocal Rank Fusion (RRF) — score each result by its rank position in each list and sum, which needs no score-normalisation and is robust across very different scoring scales. The typical pattern: run a BM25 query and a vector query in parallel, fuse with RRF, then (often) re-rank the fused top-N with a cross-encoder for final precision. Amazon OpenSearch supports this hybrid flow natively because it is both a keyword engine and a vector engine in one system.
The practical guidance: start hybrid, do not start pure-vector. Teams that ship pure semantic search are frequently surprised when it cannot find an exact product code that lexical search would have nailed instantly. Hybrid is a small amount of extra wiring for a large, reliable quality gain — and it is the configuration that most often clears a relevance bar that pure-vector alone misses.
| Dimension | Keyword / BM25 | Semantic / vector | Hybrid (fused) |
|---|---|---|---|
| Matches on | Exact terms, stems | Meaning, synonyms, intent | Both — fused |
| Wins at | SKUs, codes, names, acronyms | Natural-language questions, paraphrase | Real mixed queries |
| Fails at | Different wording than content | Exact tokens; plausible-wrong neighbours | Few blind spots |
| Needs embeddings? | No | Yes | Yes |
| Synonym list to maintain? | Yes (by hand) | No | No |
| Explainability | High (you see the matched terms) | Lower (distance in vector space) | Medium |
| AWS implementation | OpenSearch / Aurora full-text | OpenSearch k-NN / pgvector / Kendra | OpenSearch hybrid + RRF |
A semantic / hybrid search system runs five logical stages. Two are offline (indexing), three are at query time. Almost every relevance problem traces back to a specific stage, so it pays to know each one before you build.
It helps to split the stages into an indexing phase (embed → index) that runs whenever your content changes, and a query phase (query → re-rank → optionally generate) that runs on every search. The table below maps each stage to the AWS service that typically implements it. Note that if you choose Amazon Kendra, it collapses all five stages behind a single managed API — that build-vs-buy choice is section VII.
1. Embed. Run each searchable item — a doc, product, FAQ, ticket, listing — through an embedding model (Amazon Titan Text Embeddings v2 or Cohere Embed, both on Amazon Bedrock) to produce a vector. For long documents you first split them into passages (chunks) so each vector represents one coherent unit; for short items (a product title + description) one vector per item is fine. The model and its output dimensions are fixed for the life of the index — change the model and you must re-embed everything.
2. Index. Write each vector plus the original text and structured metadata (id, title, category, price, tags, ACL/tenant, timestamp) into a vector index with an approximate-nearest-neighbour (ANN) algorithm such as HNSW. On AWS this is Amazon OpenSearch Serverless, Aurora PostgreSQL with pgvector, or — if you went managed — Kendra's own index. Metadata is not optional: it is what lets you filter (in-stock only, this tenant only, this category) and what powers hybrid search's keyword side.
3. Query. Embed the incoming query with the same model and run a nearest-neighbour search for the top-K candidates (K is commonly 20–100 at this stage). In a hybrid setup you also run a BM25 keyword query in parallel and fuse the two result lists (RRF). Apply metadata filters here — never after generation — so a user only ever sees items they are entitled to and that match their facets.
4. Re-rank. Pass the fused top-K through a cross-encoder re-ranker (Amazon Rerank or Cohere Rerank on Bedrock) that scores each candidate against the query directly and keeps the best handful. This is the highest-leverage precision step — re-ranking routinely turns a mediocre result list into a sharp one, especially for the top 3–5 positions users actually look at.
5. Generate (optional). If you want a direct answer rather than a list of links — an "AI answer" or "answer box" above the results — pass the re-ranked top results to a generation model on Bedrock (Claude, Amazon Nova, Llama, Mistral) with an instruction to answer only from those results and cite them. This is exactly the RAG pattern; for pure search (return ranked items), you stop at stage 4.
| Stage | Phase | What it does | Typical AWS service |
|---|---|---|---|
| 1. Embed | Indexing (offline) | Text → vectors | Titan Text Embeddings v2 / Cohere Embed (Bedrock) |
| 2. Index | Indexing (offline) | Store vectors + metadata for ANN search | OpenSearch Serverless / Aurora pgvector / Kendra |
| 3. Query | Query (real time) | Nearest-neighbour (+ BM25 hybrid + filters) | OpenSearch k-NN + hybrid / pgvector / Kendra |
| 4. Re-rank | Query (real time) | Score + keep the best few results | Amazon Rerank / Cohere Rerank (Bedrock) |
| 5. Generate (optional) | Query (real time) | Direct cited answer above results | Claude / Nova / Llama / Mistral (Bedrock) |
Here is the fastest credible path from a keyword-only search box to AI-powered hybrid search on AWS, using Bedrock embeddings and OpenSearch Serverless. Each step maps to a stage above; the order matters because every step depends on the one before it being clean.
The vector index holds your embeddings and answers nearest-neighbour queries. For an in-app search feature the three relevant AWS options are OpenSearch Serverless, Aurora PostgreSQL with pgvector, and — at the fully-managed end — Amazon Kendra, which is an index and a search engine in one.
Amazon OpenSearch Serverless is the default for building search on AWS, and the natural choice for AI search specifically because it is both a vector engine (k-NN) and a mature keyword engine (BM25) in one system — so native hybrid search and Reciprocal Rank Fusion work out of the box, plus faceting, filtering, and aggregations you already expect from a search backend. It is fully managed and auto-scales; the trade is that it bills by OpenSearch Compute Units (OCUs) with a baseline minimum, so it can feel expensive for a very small index even when idle.
Aurora PostgreSQL with the pgvector extension is the pragmatic choice when your app already runs on Postgres. Your vectors live next to your relational data, so you can combine a vector search with SQL WHERE filters and joins to business tables in a single query — ideal for product search where you must filter by price, stock, and category. With HNSW indexing it scales comfortably into the millions of vectors. Native keyword search is weaker than OpenSearch's (Postgres full-text plus pgvector is workable hybrid, but not as turnkey), and at very high query concurrency a purpose-built engine pulls ahead.
Amazon Kendra is a different animal: a fully-managed intelligent-search service, not just an index. It ingests from 40+ connectors (S3, SharePoint, Confluence, Salesforce, databases, web crawl), builds and tunes its own semantic ranking, supports natural-language queries and FAQ matching, and enforces document-level access control by reading source-system ACLs — all without you running an embedding pipeline or a vector store. You trade per-engine control and granular cost tuning for speed-to-launch and far less to operate. The full build-vs-buy comparison is in section VII.
Two more AWS-native vector options exist and are worth knowing for adjacent needs: Amazon MemoryDB / Redis (in-memory, single-digit-millisecond vector queries) when latency is the hard constraint, and the newer S3 Vectors capability for very large, cost-optimised vector sets with infrequent queries. For an in-product search feature, though, the decision is almost always among the three in the table below.
| Option | What it is | Hybrid search | Best fit | Cost shape | Watch-out |
|---|---|---|---|---|---|
| OpenSearch Serverless | Managed vector + keyword engine | Native (vector + BM25 + RRF) | In-app search wanting one engine for everything | Per OCU + storage; baseline minimum | Baseline cost stings tiny indexes |
| Aurora PostgreSQL (pgvector) | Managed Postgres + vector extension | Vector + SQL/full-text filters | Apps already on Postgres; faceted product search | Per Aurora instance / ACU + storage | Keyword side weaker than OpenSearch |
| Amazon Kendra | Fully-managed intelligent search | Built-in semantic + keyword ranking | Buy-not-build; 40+ connectors; ACL-aware search | Per index edition (Developer / Enterprise) + queries | Less control; index pricing is a fixed baseline |
Whether AI search feels magic or mediocre comes down to retrieval quality, not the fanciness of any single component. Four knobs dominate: the embedding model, chunking, hybrid fusion, and re-ranking — and you tune all four against a labelled query set, not by feel.
On Bedrock the two mainstream embedding families are Amazon Titan Text Embeddings v2 and Cohere Embed. Titan v2 is AWS-native, inexpensive, and supports configurable output dimensions (e.g. 256 / 512 / 1024) — smaller dimensions cut index size and speed up search at a modest recall cost, which is a real lever for a large catalogue. Cohere Embed (English and multilingual variants) scores strongly on retrieval benchmarks and is a frequent pick for multilingual search or search-heavy products.
Two rules outweigh the choice itself. First, embed queries and documents with the same model and version — mixing them produces incomparable vectors and silently wrecks relevance. Second, changing the embedding model means re-embedding the entire corpus, so treat it as semi-permanent and benchmark candidates on a sample of your own queries before committing — leaderboards rarely predict your domain.
For document or article search, how you split text into chunks is the highest-variance decision: chunks too large dilute the vector (one embedding trying to represent many topics) and hurt precision; chunks too small lose the context that makes a passage answerable. A sensible default is 300–800 tokens with 10–20% overlap, then tune. For short structured items (products, listings), the lever is instead what you embed — title alone vs title + description vs title + key attributes — and concatenating the most search-relevant fields usually beats embedding the whole record verbatim.
Hybrid fusion (vector + BM25 via RRF) is the first big relevance gain; re-ranking is the second and larger one. A re-ranker is a cross-encoder that reads the query and each candidate together and scores true relevance — far more accurate than ANN distance. The standard pattern: cast a wide net cheaply (fused top-50 to top-100), then re-rank and keep only the best 5–10 to show. On AWS, Amazon Rerank and Cohere Rerank run on Bedrock. If you do one thing to fix a struggling search, add re-ranking before you touch the embedding model.
Tune against numbers, not vibes. Build a labelled set of real queries with judged-relevant results, then track nDCG (rank-aware quality), recall@K (did the right items make the candidate set), and MRR (how high the first good result lands). Run it on every change — new chunk size, new fusion weights, re-ranking on/off — so you can prove an improvement instead of guessing. Pair offline metrics with online signals (click-through rate, zero-result rate, search-to-conversion) once it is live, because real user behaviour catches what an offline set misses.
When results are bad, debug retrieval in order: (1) is the right item even in the index (ingestion / what you embedded)? (2) does it come back in the candidate set (embedding model / chunking / recall@K)? (3) does hybrid fusion surface it above noise (add BM25 + RRF)? (4) does re-ranking push it into the top-5? Tune the embedding model or generation last — most AI-search quality problems are in retrieval and ranking, not the model.
Before choosing an embedding model or a vector store, decide whether to build the pipeline at all. Amazon Kendra is AWS's fully-managed intelligent-search service; building on Bedrock + OpenSearch is the DIY path. This one choice sets how much you build, how much you control, and how fast you ship.
The honest framing: Kendra if you want managed search with connectors and access control fast; build if you need control, custom ranking, or the lowest per-query cost at scale. Kendra is a search service — it crawls 40+ data sources, builds and tunes its own semantic ranking, answers natural-language queries, matches FAQs, and (critically for enterprise) reads source-system ACLs so each user only sees what they are permitted to, with no embedding pipeline or vector store to run. The trade is less granular control over ranking and embeddings, and a pricing model that is a fixed index baseline plus queries rather than something you tune stage by stage.
Building on Bedrock + OpenSearch Serverless gives you the opposite: full control of the embedding model, chunking, hybrid fusion weights, re-ranking, and exactly what you index — and typically a lower marginal cost per query at high volume because you are paying for tokens and OCUs rather than a managed-search premium. You own the connectors (you write the ingestion), the access-control logic (metadata filters you design), and the maintenance. For a custom in-product search box with bespoke relevance rules, or a very high-QPS workload where per-query cost dominates, building usually wins. For "make our internal wiki and Confluence and SharePoint searchable, with permissions, by next month," Kendra usually wins.
A common pattern is to prototype on Kendra to prove the use case and get a strong baseline in days, then migrate to a built pipeline only if a concrete requirement — custom ranking, a vector store you already run, or cost at scale — forces it. Many teams never need to: Kendra is enough. Others start built because search relevance is their product and they want every knob. The comparison table makes the trade explicit.
Choose Amazon Kendra when speed, connectors, and out-of-the-box access control matter more than control — especially for enterprise knowledge search across many sources with permissions. Build on Bedrock + OpenSearch when you need custom ranking, custom chunking, an existing vector store, or the lowest per-query cost at high volume. Prototype on Kendra; graduate to built only when a hard requirement forces it.
A search demo and a production search feature differ on freshness, access control, latency, and a bill that scales with usage. Each has a concrete AWS answer, and the cost stack has predictable line items that surprise teams who budgeted only for embeddings.
On freshness, your index is only as current as your last sync: trigger re-embedding on a data-change event (an S3 event or DB CDC stream via Lambda, or a scheduled job) and store an updated-at in each record so stale items can be filtered or down-weighted. On access control, enforce it in the query — tag each record with ACL / tenant metadata at index time and apply a filter on every search so a user only ever retrieves what they are entitled to; for multi-tenant SaaS, filter by tenant at minimum, or use separate indexes for hard isolation. (Kendra does this for you by reading source ACLs.) On latency, end-to-end time is query + re-rank + optional generation; cache query embeddings for repeated searches, keep re-ranking to a sensible top-N, and stream any generated answer so perceived latency stays low.
The cost figures below are representative as of 2026 to show the shape of the bill — always check the AWS pricing page (and the third-party vendor for any non-AWS component) for current rates. For pure semantic/hybrid search the dominant cost is the always-on index baseline (OpenSearch OCUs or the Kendra index edition); embeddings are a one-time-per-corpus cost plus updates; per-query embedding and re-ranking are small; and generation only appears if you added an answer box — at which point generation tokens usually become the largest line.
| Cost line | When you pay | Driver | Main lever to control it |
|---|---|---|---|
| Embeddings (indexing) | One-time per corpus + on updates | Total tokens embedded | Chunk size; smaller embedding dimensions; only re-embed changed records |
| Search index | Continuous (baseline) | OpenSearch OCUs / Kendra edition / Aurora ACUs | Right-size the engine; pgvector if Postgres already runs; tune dimensions |
| Query embeddings | Per query | Search volume | Negligible per call; cache embeddings for repeated queries |
| Re-ranking | Per query | Candidates re-ranked × queries | Re-rank top-50/100, not top-1000; skip on trivial exact-match queries |
| Generation (only with an answer box) | Per query | Input + output tokens × model price | Cheaper model for easy queries; fewer chunks; prompt caching; tight max-tokens |
This is the comparison that decides your architecture. Read it as "Kendra if speed, connectors, and access control matter most; build if you need control, custom ranking, or lowest per-query cost at scale."
| Dimension | Amazon Kendra (managed) | Build (Bedrock + OpenSearch Serverless) |
|---|---|---|
| Time to first search | Days — connect a source, it indexes + ranks | Days to weeks — build embed→index→query→re-rank |
| Pipeline you maintain | Almost none — AWS runs ingestion + ranking | All of it — embeddings, index, hybrid, re-ranking |
| Data ingestion | 40+ built-in connectors (S3, SharePoint, Confluence, Salesforce…) | You write ingestion for each source |
| Ranking control | Managed semantic ranking + relevance tuning knobs | Full — your embeddings, fusion weights, re-ranker |
| Hybrid search | Built in | Native in OpenSearch (vector + BM25 + RRF) |
| Access control | Reads source-system ACLs automatically | You design metadata filters / per-tenant isolation |
| Embedding model choice | Managed (not yours to pick) | Titan v2 / Cohere — your choice and dimensions |
| Cost shape | Fixed index edition + per-query | Per OCU + tokens — lower marginal cost at high QPS |
| Best for | Enterprise knowledge search across many sources, fast, with permissions | Custom in-product search, bespoke ranking, high volume |
Situation: Their in-app product search was lexical only: shoppers who typed descriptions or synonyms ("rain jacket" when the listing said "waterproof shell") got zero results, and the zero-result rate was visibly hurting conversion. They wanted semantic search that still matched exact SKUs and brand names, with per-merchant isolation so one merchant's catalogue never leaked into another's results. The two engineers who could build it were committed to the core roadmap, and the projected Bedrock + OpenSearch bill made the founder hesitate to start.
What CloudRoute did: Routed within 24 hours to a US-region AWS partner with a search / GenAI track record. The partner built it on AWS: Titan v2 embeddings over title + key attributes, OpenSearch Serverless as the vector + keyword engine with native hybrid search and Reciprocal Rank Fusion, Cohere Rerank for top-result precision, per-merchant metadata filtering for isolation, and a 300-query labelled set scored on nDCG and recall@K to tune chunking and fusion weights. The whole engagement was funded by AWS credits the partner filed for — Activate Portfolio plus a Bedrock POC allocation.
Outcome: Hybrid semantic search in production in about 5 weeks. Zero-result rate fell sharply while exact SKU and brand lookups still resolved instantly; per-merchant isolation enforced at query time. The build and the first months of search + inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.
engagement window: ~5 weeks · founder time: ~7 hours · stack: Titan v2 + OpenSearch Serverless (hybrid + RRF) + Cohere Rerank · cost to customer: $0
CloudRoute routes you to a vetted AWS search / GenAI partner who designs and ships it — semantic + hybrid search on Bedrock embeddings and OpenSearch (or managed Amazon Kendra), the right vector store, re-ranking, access control, and relevance tuning. AWS credits fund the build and the inference. You pay $0.