for AWS partners →Have a partner build your chatbot — funded by AWS credits →

build a chatbot on aws · the 2026 guide

How to build a chatbot on AWS (2026).

A modern AWS chatbot is a foundation model on Amazon Bedrock wrapped in a thin application: an API and compute layer to mediate calls, a place to keep conversation history, optional retrieval over your own documents, safety guardrails, and a streaming front end. This is the full build guide — the reference architecture end to end, how to choose the model, a step-by-step build outline, what production actually costs, the concerns that separate a demo from a real deployment (latency, history, safety), and the common variations (customer-support bot vs internal assistant).

Have a partner build your chatbot — funded by AWS credits →→ jump to the reference architecture

core building blocks

managed model API

Bedrock

time to a prototype

hours

credits to fund it

up to $100K

TL;DR

An AWS chatbot is, at its core, a foundation model on Amazon Bedrock plus a thin app around it: an API/compute layer (API Gateway + Lambda, or a container/app) that calls the Bedrock Converse API, a store for conversation memory (usually DynamoDB), optional RAG via Bedrock Knowledge Bases when it must answer from your own data, Bedrock Guardrails for safety, and a streaming UI so replies appear token-by-token.
Pick the model for the job: a small, fast, cheap model (Amazon Nova Lite/Micro, Claude Haiku) for high-volume routing and simple Q&A; a mid or frontier model (Claude Sonnet, Nova Pro, Llama) when answers need real reasoning. You can route easy turns to a cheap model and hard ones to a frontier model. Bedrock exposes them all behind one Converse API, so swapping is a config change, not a rewrite.
The hard parts are not the wiring — they are managing conversation history within the context window, grounding answers in your data without hallucination, enforcing safety with Guardrails, and keeping latency and cost in check as traffic grows. GenAI inference bills add up fast; CloudRoute routes you to AWS credits (Activate Portfolio up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and vetted ML partners who build the bot — you pay $0.

the core idea

IWhat "a chatbot on AWS" actually means in 2026

In 2026 a chatbot is no longer a rules-and-buttons decision tree. It is a conversational interface to a large language model: the user types in natural language, a foundation model interprets and responds in natural language, and the surrounding application supplies memory, knowledge, safety, and a channel to talk through. On AWS, that foundation model almost always lives behind Amazon Bedrock.

It helps to separate the two things people mean by "AWS chatbot," because they lead to completely different builds. The first is AWS Chatbot, a specific AWS service that pipes CloudWatch alarms and operational notifications into Slack, Microsoft Teams, and Amazon Chime — a ChatOps tool for your engineers, not a conversational AI. The second — and what this guide is about — is a generative-AI chatbot you build on AWS: a custom application powered by a foundation model on Amazon Bedrock that talks to your users or your staff. Two different things that share a name; this page covers the second.

The mental model for the generative-AI version is simple. A foundation model is a stateless text-in, text-out function: it has no memory of previous turns and no knowledge of your private data. Everything that makes it feel like a chatbot — remembering the conversation, answering from your documents, staying safe and on-brand — lives in the application you wrap around the model. That is good news: the heavy lifting (the model) is a managed API call, and the parts you build are well-understood components AWS provides managed services for.

Concretely, a turn of conversation flows like this. The user sends a message. Your application loads the recent conversation history, optionally retrieves relevant passages from your knowledge base, assembles a prompt (a system instruction + history + any retrieved context + the new message), passes it through input guardrails, and calls the model on Bedrock. The model streams back a reply; you run it through output guardrails, show it to the user token-by-token, and append both the user message and the reply to the stored history for next time. Every chatbot on AWS — support bot, internal assistant, agentic copilot — is a variation on that loop.

Because the model is reached through the Bedrock Converse API, the same code works across Anthropic Claude, Amazon Nova, Meta Llama, Mistral, Cohere, and others. You are not locking yourself to one model vendor; you are building against one AWS API and choosing (and changing) the model underneath it as a configuration value.

the one-sentence definition

A chatbot on AWS = a foundation model on Amazon Bedrock wrapped in a thin application that supplies conversation memory, optional retrieval over your own data, safety guardrails, and a streaming interface — so a stateless model behaves like a stateful, grounded, safe assistant.

end to end

IIThe reference chatbot architecture on AWS

Almost every production chatbot on AWS is assembled from the same five building blocks. Understanding each one — and which AWS service implements it — is what lets you reason about cost, latency, and where a problem lives when something goes wrong.

The blocks are: (1) a front end / channel the user talks through; (2) an API + compute layer that orchestrates a turn; (3) the model on Bedrock; (4) conversation memory; and (5) optional knowledge (RAG), with Guardrails sitting across the model call as a sixth, cross-cutting concern. The table at the end of this section maps each to its typical AWS service. Walk them in the order a request travels.

1. Front end and channel

This is where the conversation happens: a web widget or app (commonly a React front end on Amazon S3 + CloudFront, or Amplify), a messaging channel (Slack, WhatsApp, your in-app chat), or voice via Amazon Connect for a contact centre. The channel matters mostly for one reason: streaming. Users will tolerate a multi-second answer if they see it being written token-by-token; the same wait with a spinner feels broken. Choose a channel and transport (WebSocket, server-sent events, Bedrock streaming responses) that can stream.

2. API and compute (the orchestrator)

This layer receives the user message and runs the turn: load history, retrieve context, assemble the prompt, apply guardrails, call Bedrock, stream the reply, and persist the turn. The two common shapes are serverless — Amazon API Gateway (REST or WebSocket) in front of AWS Lambda — and container/app — an app on AWS Fargate, ECS, EKS, or App Runner. Serverless is the fast default: it scales to zero, you pay per request, and it suits spiky chat traffic. A long-running container is better when you need persistent connections at scale, heavy in-process orchestration, or a framework (LangChain, LlamaIndex) that you would rather host than cram into a Lambda. Either way, this layer holds no model logic itself — it orchestrates calls to Bedrock.

3. The model on Amazon Bedrock

The brain of the bot is a foundation model called through the Bedrock Converse API — a single, model-agnostic API for multi-turn conversation that also supports tool use (function calling) and streaming. You send the system prompt, the message history, and the new user turn; Bedrock returns the assistant reply. Because the API is uniform across Claude, Nova, Llama, Mistral, and Cohere, the model is a parameter you pick (section III) and can change without rewriting the orchestrator.

4. Conversation memory

Foundation models are stateless, so the application must remember the conversation and replay it on every turn. The standard store is Amazon DynamoDB: one item per message (or per session), keyed by a session ID, low-latency and pay-per-request — a near-perfect fit for chat history. On each turn you read the recent history for the session, include it in the prompt, and write the new turn back. The subtlety is that history cannot grow forever (the context window is finite and longer prompts cost more), which is why managing it is a first-class production concern covered in section VI.

5. Knowledge — optional RAG via Bedrock Knowledge Bases

If the bot must answer from your content — a help centre, product docs, policies, past tickets — bolt on retrieval-augmented generation so it grounds answers in your data instead of guessing. The managed path is Amazon Bedrock Knowledge Bases: point it at an S3 bucket (or a connector), and it handles chunking, embedding, vector storage, retrieval, and re-ranking, exposing a RetrieveAndGenerate call that returns a cited answer. In a chatbot you typically retrieve relevant passages for the user's message and inject them into the prompt before calling the model. A bot that only needs general knowledge and conversation can skip this entirely; a support or internal-knowledge bot lives or dies by it. (For the full RAG build, see the RAG-on-AWS guide in the related links.)

6. Guardrails — the cross-cutting safety layer

Wrapping the model call, Amazon Bedrock Guardrails screen both the user input and the model output for denied topics, harmful content, profanity, prompt-injection attempts, and — importantly — contextual grounding, which can block or flag answers that are not supported by the retrieved context (a strong anti-hallucination control for RAG bots). Guardrails also detect and redact PII. They are configured once and applied on every Bedrock call, independent of the model, so safety policy is consistent even if you change models.

the chatbot building blocks mapped to AWS services · representative as of 2026

Building block	What it does	Typical AWS service	Required?
Front end / channel	Where the user chats; must support streaming	S3 + CloudFront / Amplify · Slack/WhatsApp · Amazon Connect (voice)	Yes
API + compute	Orchestrates a turn end to end	API Gateway + Lambda (serverless) · or Fargate/ECS/App Runner	Yes
Model	Generates the reply	Amazon Bedrock (Converse API) — Claude / Nova / Llama / Mistral	Yes
Conversation memory	Stores + replays chat history	Amazon DynamoDB (session-keyed)	Yes
Knowledge (RAG)	Grounds answers in your documents	Amazon Bedrock Knowledge Bases (+ S3, vector store)	Optional
Guardrails	Input/output safety + grounding + PII	Amazon Bedrock Guardrails	Strongly recommended
Auth + observability	Who is talking; logs, metrics, tracing	Amazon Cognito · CloudWatch · X-Ray	Recommended

A minimal conversational bot needs only the first four rows. A support or internal-knowledge bot adds RAG and Guardrails. Every block above is a managed AWS service — you assemble them, you do not host the model.

the central choice

IIIChoosing the model — cost, latency, and capability

The single most consequential decision is which foundation model answers your users. It sets answer quality, response latency, and the per-message cost that determines whether the bot is economical at scale. Because Bedrock serves them all through one Converse API, you can — and often should — choose per use case, and even per turn.

Think along three axes: capability (how hard is the reasoning?), latency (how fast must the first token appear?), and cost (how many messages a day, at what price per 1K tokens?). They trade off: the most capable models are slower and pricier; the fastest, cheapest models handle routing, classification, and simple Q&A brilliantly but struggle with multi-step reasoning. There is no single right answer — there is a right answer for this bot.

A pattern worth knowing: model routing. Use a small, cheap model (Amazon Nova Micro/Lite, Claude Haiku) as the default for the majority of turns — greetings, FAQs, routing, structured extraction — and escalate only the genuinely hard turns to a frontier model (Claude Sonnet/Opus, Amazon Nova Pro/Premier). Most production chatbots find that a large share of traffic is easy, so routing can cut the model bill substantially while keeping quality high where it matters. With Bedrock's uniform API, the "escalation" is just calling a different model ID.

Two more levers shape the model decision. Streaming (ConverseStream / InvokeModelWithResponseStream) does not change the model but transforms perceived latency — always stream a chatbot. And prompt caching lets Bedrock cache a static, repeated prefix (your system prompt, tool definitions, even a long shared context) so you stop re-paying to process it on every turn — a real saving for chatbots, which send the same system prompt thousands of times a day.

choosing a chatbot model on bedrock · representative profile as of 2026 — check the AWS pricing page for current rates

Model tier	Example models (Bedrock)	Best for in a chatbot	Latency	Relative cost
Small / fast	Amazon Nova Micro & Lite · Claude Haiku	High-volume FAQ, routing, classification, extraction	Lowest	$ (cheapest)
Mid	Amazon Nova Pro · Claude Sonnet · Llama (mid)	The everyday workhorse — good reasoning, sane cost	Medium	$$
Frontier	Claude Opus · Amazon Nova Premier · large Llama/Mistral	Hard multi-step reasoning, nuanced or high-stakes replies	Highest	$$$ (priciest)
Routed (hybrid)	Small default + frontier escalation	Best blended economics at scale	Mostly low	$ → $$$ as needed

Benchmark on your own conversations, not a leaderboard. A common, economical default: a small/fast model for most turns, escalating to a mid or frontier model only when a turn needs deeper reasoning — all behind the same Converse API.

state + knowledge

IVConversation memory and grounding in your data

Two things make a model feel like an assistant rather than a one-shot text generator: it remembers what was just said, and it can answer from your specific knowledge. Both are application concerns you build around the stateless model — here is how each works on AWS, and where they get hard.

Conversation memory — and why it cannot grow forever

Because the model is stateless, "memory" means: store every turn, and replay the relevant history in the prompt on the next turn. Amazon DynamoDB is the standard home for this — session-keyed items, single-digit-millisecond reads, pay-per-request — and it slots naturally next to a Lambda orchestrator. The naïve approach (replay the entire conversation every turn) breaks for two reasons: the context window is finite, and you pay for every input token on every call, so an ever-growing history means ever-rising latency and cost.

The production answer is a memory strategy. The common ones: a sliding window (keep the last N turns verbatim); summarisation (periodically compress older turns into a running summary the model maintains, keeping recent turns verbatim); and semantic / long-term memory (embed past turns or extracted facts into a vector store and retrieve only the relevant ones — effectively RAG over the conversation). Most chatbots start with a sliding window, add summarisation when conversations run long, and reach for semantic memory only when they need to recall facts from far earlier or across sessions.

Grounding — optional RAG so the bot answers from your data

If users will ask about things only your organisation knows, the bot needs retrieval-augmented generation. The flow inside a turn: take the user's message, retrieve the most relevant passages from your corpus, and inject them into the prompt as grounding context with an instruction to answer only from that context and cite sources. On AWS the fast path is Amazon Bedrock Knowledge Bases, which manages ingestion, chunking, embeddings, the vector store (OpenSearch Serverless, Aurora pgvector, Pinecone, or Redis), retrieval, and re-ranking behind one RetrieveAndGenerate call. A do-it-yourself pipeline gives more control (custom chunking, hybrid search, multi-tenant access control) at the cost of more engineering.

The most important grounding control for a chatbot is honesty about limits: instruct the model to say "I don't know" when retrieval returns nothing relevant, and pair that with Guardrails' contextual-grounding check, which can flag or block answers not supported by the retrieved passages. A support bot that confidently invents a refund policy is worse than one that admits it cannot find the answer and hands off to a human. RAG depth — chunking, embeddings, evaluation — is its own topic; the related RAG-on-AWS guide covers it in full.

memory vs knowledge — keep them separate

Memory is what was said in this conversation (DynamoDB, replayed in the prompt). Knowledge is what your organisation knows (a corpus, retrieved via RAG). They are different stores solving different problems — conflating them ("just dump everything in the prompt") blows the context window and the budget. Replay recent memory; retrieve relevant knowledge.

the build, in order

VA step-by-step build outline

Here is the fastest credible path from zero to a streaming, grounded, guard-railed chatbot on AWS using the serverless shape (API Gateway + Lambda + Bedrock). A container-based build follows the same logical order with the orchestrator hosted differently. Ship the thin vertical slice first, then layer on memory, RAG, and safety.

Step 1 — Enable Bedrock model access — In the Bedrock console, request access to a generation model (start with a small/fast one like Amazon Nova Lite or Claude Haiku, plus a frontier model for escalation) in your chosen Region. Confirm it is available there — model and Region availability vary.
Step 2 — Call the Converse API from a Lambda — Write a Lambda that takes a user message, calls the Bedrock Converse API with a system prompt, and returns the reply. Put Amazon API Gateway in front of it. This is the thin slice — a stateless single-turn bot — and proves model access end to end.
Step 3 — Add conversation memory — Create a DynamoDB table keyed by session ID. On each turn, read recent history, include it in the Converse call, and write the new turn back. Start with a sliding window of the last N turns; you now have a stateful multi-turn bot.
Step 4 — Stream the responses — Switch to ConverseStream (or InvokeModelWithResponseStream) and stream tokens to the client over a WebSocket API or server-sent events. This is the single biggest perceived-latency win — do it before you optimise anything else.
Step 5 — Add RAG if it must answer from your data — Stand up an Amazon Bedrock Knowledge Base over your documents in S3 (or skip this step for a general-purpose bot). In each turn, retrieve relevant passages and inject them into the prompt, instructing the model to answer only from context and cite sources, and to say "I don't know" when nothing relevant is found.
Step 6 — Attach Guardrails — Create a Bedrock Guardrail (denied topics, content filters, PII redaction, and — for RAG bots — contextual grounding) and apply it on every Converse call, on both input and output. Safety policy now holds regardless of which model you use.
Step 7 — Add auth, observability, and cost controls — Gate the API with Amazon Cognito (or your IdP), log every turn (prompt, retrieved context, response) to CloudWatch, add tracing with X-Ray, and set billing alarms. Wire prompt caching for the static system prompt, and add model routing if traffic justifies it.
Step 8 — Evaluate, then iterate — Build a fixed set of real conversations and score answer quality (and, for RAG, faithfulness and relevance) on every change — Bedrock model/RAG evaluation can score these automatically. Tune the model choice, memory window, retrieval, and prompt against the numbers, with a human-review sample before scaling traffic.

ship the vertical slice first

Steps 1–2 give you a working bot in an afternoon. Resist building memory, RAG, Guardrails, and a routing layer up front — get one real conversation flowing end to end, then add each capability as a thin increment. Most stalled chatbot projects over-engineered before they had a single answer streaming to a user.

what it costs

VIWhat a chatbot on AWS costs

A chatbot bill is dominated by one line — model inference — with a handful of small supporting costs around it. Knowing the shape lets you estimate before you build and control it after you ship. The figures below are representative as of 2026 to show the shape, not a quote; always check the AWS pricing page for current rates.

The driver of the model bill is tokens × price-per-token × messages. Every turn sends input tokens (system prompt + replayed history + any retrieved context + the user message) and receives output tokens, each priced per 1K and varying by model. That makes the big levers obvious: a cheaper model for easy turns, a tighter memory window and fewer retrieved chunks (fewer input tokens), prompt caching for the static prefix, and capping output length. The supporting costs — Lambda invocations, API Gateway requests, DynamoDB reads/writes, the vector store if you run RAG, and Guardrails evaluations — are usually small next to inference, but the always-on vector-store baseline (if you use RAG) is the one fixed cost worth right-sizing.

A rough way to sanity-check: estimate messages per day, average input + output tokens per message, and multiply by your model's per-1K rates, then add the per-message supporting costs and any fixed RAG baseline. A low-volume internal assistant on a small model can run for very little; a high-volume customer-support bot on a frontier model is dominated entirely by inference and is exactly where model routing, prompt caching, and tight prompts pay for themselves. For a full breakdown with worked examples, see the dedicated chatbot-cost page linked below.

chatbot cost stack on aws · representative shape as of 2026 — check the AWS pricing page for current rates

Cost line	When you pay	Driver	Main lever to control it
Model inference	Per message (usually the largest)	Input + output tokens × model price	Cheaper model for easy turns; prompt caching; tighter history + fewer chunks; cap output
Compute (orchestrator)	Per request / running time	Lambda invocations or container hours	Serverless scales to zero; right-size containers
API layer	Per request / connection	API Gateway requests + WebSocket minutes	Usually minor; batch/trim chatty calls
Conversation memory	Per read/write + storage	DynamoDB ops × turns	On-demand capacity; TTL old sessions
Knowledge (RAG)	Continuous baseline + per query	Vector store size + retrieval + embeddings	Right-size the vector store; only if you need RAG
Guardrails	Per evaluated request	Input/output units checked	Minor; scope policies to what you need

Inference dominates almost every chatbot bill. The two highest-leverage savings are prompt caching (stop re-paying for the static system prompt every turn) and model routing (send the bulk of easy traffic to a small model). The full chatbot-cost guide works through concrete monthly examples.

shipping it for real

VIIProduction concerns — latency, history, safety, and reliability

A chatbot demo and a production chatbot differ on a handful of axes that rarely show up in a prototype: how fast it feels, how it handles long and messy conversations, how safe it stays under adversarial input, and how it behaves when something fails. Each has a concrete AWS answer.

Latency (perceived and real)

Total latency is retrieval (if any) + the model's time-to-first-token + generation time. The cheapest win is streaming: showing the answer as it is written makes a 4-second reply feel instant. Beyond that, a smaller/faster model lowers real latency, fewer input tokens (tight history, re-ranked context, prompt caching) speed up the first token, and keeping the orchestrator and Bedrock in the same Region avoids cross-Region hops. For strict latency SLAs, Bedrock cross-Region inference can also smooth out regional capacity pressure.

Conversation history (the silent failure mode)

Long conversations are where naïve bots break: replay everything and you eventually overflow the context window, costs creep up turn over turn, and the model starts losing the thread (early instructions get buried). Pick a memory strategy deliberately — sliding window, summarisation, or semantic memory (section IV) — set a maximum prompt budget, and store a session TTL in DynamoDB so stale sessions age out. Test explicitly with very long conversations; this rarely shows up in a five-message demo.

Safety and abuse resistance

A public chatbot is an open input box, so assume adversarial use: prompt-injection ("ignore your instructions and…"), attempts to extract the system prompt, requests for disallowed content, and PII flowing in and out. Bedrock Guardrails are the first line — denied topics, content filters, PII redaction, contextual grounding to curb hallucination — applied on every call. Layer defensive prompting (keep secrets and tools out of reach of user-controllable text), rate limiting at API Gateway, and, for RAG bots, retrieval-time access control so a user can never retrieve documents they are not entitled to. Safety is not one feature; it is Guardrails + prompt design + access control + rate limits together.

Reliability and observability

Treat the bot like any production service. Handle Bedrock throttling and transient errors with retries and backoff (and consider Provisioned Throughput for guaranteed capacity at steady high volume). Log every turn — prompt, retrieved context, model, latency, token counts, and the final answer — to CloudWatch so you can reproduce and audit any response, and trace requests with X-Ray. Add a human-handoff path for when the bot is unsure or the user asks for a person, and watch quality with a periodic human-review sample on real traffic, because automated metrics miss domain-specific errors a person catches instantly.

production readiness checklist

Before launch: streaming responses · a deliberate memory strategy with a prompt-token budget and session TTL · Guardrails on input and output · retrieval-time access control (RAG bots) · rate limiting + auth on the API · full per-turn logging and tracing · retries/backoff for Bedrock throttling · a human-handoff path · a golden conversation set scored on every change · billing alarms and a cost ceiling. Miss one and the gap shows up in production, not the demo.

the common shapes

VIIICommon variations — support bot, internal assistant, and agentic copilot

The reference architecture flexes into a few recurring shapes. They share the same building blocks but weight them differently, which changes the model you pick, whether you need RAG, and how hard you push on safety and access control.

Across all four, the spine is identical — a Bedrock model, an orchestrator, memory, optional RAG, and Guardrails. What changes is emphasis: how capable a model, whether grounding is central, how strict the access control, and whether the bot only talks or also acts. Start from the reference architecture and dial each block to the variation you are building.

Customer-support bot (external) — Answers customers from your help centre and policies, ideally with deflection to a human when unsure. Heavy on RAG (Bedrock Knowledge Bases over your docs + tickets), heavy on Guardrails (it speaks for your brand to the public), and sensitive to latency and cost at volume — so model routing and prompt caching matter. Often integrated into existing chat or a contact centre (Amazon Connect). The grounding-and-citations discipline from the RAG guide is non-negotiable here.
Internal knowledge assistant (employee-facing) — Answers staff from internal wikis, runbooks, HR/IT docs, and code. Lower abuse risk than a public bot but much stricter on access control — different employees may see different documents, so retrieval-time metadata filtering (per user/group/tenant) is essential. RAG-centric, with conversation memory for multi-step troubleshooting. Amazon Q Business is the off-the-shelf alternative when you would rather buy this pattern than build it (see related links).
Agentic copilot (takes actions, not just answers) — Goes beyond Q&A to call tools and APIs — create a ticket, look up an order, trigger a workflow — using the Converse API's tool use or Amazon Bedrock Agents to plan and execute multi-step tasks. Adds real power and real risk: every tool the bot can call is an action it can take, so scope tool permissions tightly, require confirmation for consequential actions, and log every tool call. Bedrock Agents (and AgentCore for running agents at scale) is the managed route; the Bedrock Agents guide in the related links goes deep.
Voice / contact-centre bot — The chatbot loop fronted by speech: Amazon Connect (or Transcribe + Polly) handles speech-to-text and text-to-speech, with the same Bedrock-backed brain in the middle. Latency budgets are tighter (people expect fast spoken turns), so favour faster models and streaming, and design explicit fallbacks to a human agent.

build vs buy, side by side

Build a custom chatbot on Bedrock vs buy Amazon Q Business

Before building, it is worth asking whether you should. AWS offers a managed enterprise assistant — Amazon Q Business — that is essentially a pre-built RAG chatbot over your data. Build a custom bot on Bedrock when you need control or a customer-facing/branded experience; buy Q Business when you want an internal assistant fast with minimal engineering.

Dimension	Custom chatbot on Amazon Bedrock	Amazon Q Business (managed)
What it is	You assemble model + app + memory + RAG + Guardrails	Pre-built enterprise RAG assistant over your connectors
Time to value	Hours to a prototype; weeks to production-grade	Fast — connect data sources, configure, go
Control / customisation	Total — any model, prompt, UX, logic, channel	Limited to the product's configuration surface
Model choice	Any model on Bedrock; route per turn	Managed by AWS under the hood
Customer-facing / branded	Yes — embed anywhere, own the UX	Primarily internal, employee-facing
Engineering effort	Higher — you build and maintain the app	Low — configuration over code
Pricing shape	Pay per token + supporting AWS services	Per-user subscription (+ usage)
Best for	Custom/agentic bots, customer support, full control	Internal knowledge assistant, minimal build

Not mutually exclusive: many organisations buy Q Business for the internal-knowledge use case and build a custom Bedrock chatbot for the customer-facing or agentic one. Decide per use case, not once for the whole company.

building this for real?

Have a vetted AWS partner build your chatbot — and let AWS credits pay for it

Start in 3 minutes →

a recent match

A grounded customer-support chatbot — anonymized

inquiry · series-a b2b SaaS, support automation, US

Series-A B2B SaaS, ~30 people, ~8k help-centre articles + a busy support queue, US-based, already on AWS

Situation: Support volume was outgrowing the team and the founders wanted a chatbot on their site that answered strictly from their own docs, with citations, that escalated to a human when unsure — and that would not hallucinate a wrong answer about billing or security. An early in-house prototype on a single model with no memory and no grounding gave confident-but-wrong answers and had no safety story. The one engineer who could build it properly was fully committed to the core product, and the projected Bedrock inference bill at their support volume made the founders hesitant to commit.

What CloudRoute did: Routed within 24 hours to a US-region AWS partner with a GenAI/ML track record. The partner built the reference architecture in the team's existing account: API Gateway + Lambda orchestrator, the Converse API with model routing (a small fast model for the bulk of turns, escalating to Claude Sonnet for hard ones), DynamoDB conversation memory with a sliding window plus summarisation, Bedrock Knowledge Bases over the help centre for grounded, cited answers, Bedrock Guardrails on input and output with contextual grounding, streaming responses, prompt caching for the static system prompt, full CloudWatch logging, and a human-handoff path. The whole engagement — build plus the first months of inference — was funded by AWS credits the partner filed for: Activate Portfolio plus a Bedrock POC allocation.

Outcome: A cited, grounded support chatbot in production in under 5 weeks, deflecting a meaningful share of routine tickets and handing off cleanly when unsure. Model routing and prompt caching kept the inference bill well below the founders' worst-case estimate. The build and early inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.

engagement window: ~5 weeks · founder time: ~7 hours · stack: API Gateway + Lambda + Bedrock Converse (routed) + DynamoDB + Bedrock KB + Guardrails · cost to customer: $0

faq

Common questions

How do I build a chatbot on AWS?

Build it as a foundation model on Amazon Bedrock wrapped in a thin application. The minimal path: enable a Bedrock model, call the Converse API from a Lambda behind API Gateway (a single-turn bot), add Amazon DynamoDB for conversation memory (multi-turn), and stream the responses. Then layer on what your use case needs: Amazon Bedrock Knowledge Bases for retrieval-augmented generation if it must answer from your own documents, Amazon Bedrock Guardrails for input/output safety, and auth, logging, and cost controls for production. A prototype takes hours; production-grade is typically a few weeks.

Which AWS service do I use to build an AI chatbot?

Amazon Bedrock is the core — it serves the foundation models (Claude, Amazon Nova, Llama, Mistral, Cohere) through one Converse API. Around it you typically use API Gateway + AWS Lambda (or a container on Fargate/ECS/App Runner) to orchestrate, Amazon DynamoDB for conversation history, Amazon Bedrock Knowledge Bases for RAG, and Amazon Bedrock Guardrails for safety. Note this is different from the service literally named "AWS Chatbot," which routes operational alerts into Slack/Teams and is not a conversational AI builder.

Should I use Amazon Bedrock or Amazon Q Business to build a chatbot?

Build a custom chatbot on Amazon Bedrock when you need control — any model, custom prompts and UX, agentic actions, or a customer-facing/branded experience. Choose Amazon Q Business when you want a managed, pre-built enterprise assistant over your internal data with minimal engineering: you connect data sources, configure it, and it handles the RAG chatbot pattern for you, priced per user. A common split is Q Business for the internal-knowledge assistant and a custom Bedrock build for the customer-facing or agentic bot — decide per use case.

Which foundation model should I use for a chatbot?

Match the model to the job and benchmark on your own conversations. Use a small, fast, cheap model (Amazon Nova Micro/Lite, Claude Haiku) for high-volume FAQ, routing, and classification; a mid model (Amazon Nova Pro, Claude Sonnet) as the everyday workhorse; and a frontier model (Claude Opus, Amazon Nova Premier) for hard multi-step reasoning. A cost-effective pattern is model routing — a small model handles the bulk of easy turns and escalates only hard ones to a frontier model. Because all run behind the Bedrock Converse API, changing models is a configuration change, not a rewrite.

How does a chatbot remember the conversation if the model is stateless?

The model has no memory, so the application supplies it: store every turn (commonly in Amazon DynamoDB, keyed by session) and replay the relevant history in the prompt on the next turn. Because the context window is finite and you pay per input token, you cannot replay everything forever — use a memory strategy: a sliding window of recent turns, periodic summarisation of older turns, or semantic long-term memory (embed past turns and retrieve the relevant ones). Most bots start with a sliding window and add summarisation as conversations get longer.

How do I stop my AWS chatbot from hallucinating or going off the rails?

Combine four things. Ground it: use retrieval-augmented generation (Bedrock Knowledge Bases) so it answers from your documents, instruct it to answer only from the retrieved context and to say "I don't know" otherwise, and return citations. Guard it: apply Amazon Bedrock Guardrails on input and output for denied topics, content filtering, PII redaction, and contextual grounding (which flags answers unsupported by the context). Defend the prompt: keep secrets and tools out of reach of user text to resist prompt injection. And measure it: score faithfulness and relevance on a fixed conversation set so you catch regressions.

What does it cost to run a chatbot on AWS?

The bill is dominated by model inference, priced per 1K input + output tokens and varying by model, multiplied by message volume. Supporting costs — Lambda/compute, API Gateway, DynamoDB, the vector store if you use RAG, and Guardrails — are usually small by comparison, with the always-on vector-store baseline the main fixed cost. The biggest levers are routing easy turns to a cheaper model, prompt caching for the static system prompt, tighter conversation history and fewer retrieved chunks (fewer input tokens), and capping output length. Figures are representative as of 2026 — check the AWS pricing page for current rates, and see the dedicated chatbot-cost guide for worked examples.

How long does it take to build a chatbot on AWS?

A streaming prototype on Amazon Bedrock — Lambda calling the Converse API behind API Gateway, with basic memory — can be working in hours to a day. Getting to genuinely production-ready (RAG grounding with citations, Guardrails, access control, conversation-history management, logging and tracing, a human-handoff path, evaluation, and cost controls) is typically 2–6 weeks depending on data cleanliness and requirements. The slowest part is usually preparing the knowledge for RAG, not wiring the bot. A specialist AWS ML partner compresses this — the engagement CloudRoute routes, funded by AWS credits, so the customer pays $0.

Build your chatbot on AWS — funded by AWS credits

CloudRoute routes you to a vetted AWS GenAI/ML partner who designs and ships the bot — the Bedrock model (or a routed set), the API and compute layer, conversation memory, RAG grounding, Guardrails, streaming, and evaluation. AWS credits fund the build and the inference. You pay $0.

Get matched with a chatbot build partner →→ see the AI-team persona detail

matched within< 24h

credits to fund itup to $100K

cost to you$0