A modern AWS chatbot is a foundation model on Amazon Bedrock wrapped in a thin application: an API and compute layer to mediate calls, a place to keep conversation history, optional retrieval over your own documents, safety guardrails, and a streaming front end. This is the full build guide — the reference architecture end to end, how to choose the model, a step-by-step build outline, what production actually costs, the concerns that separate a demo from a real deployment (latency, history, safety), and the common variations (customer-support bot vs internal assistant).
In 2026 a chatbot is no longer a rules-and-buttons decision tree. It is a conversational interface to a large language model: the user types in natural language, a foundation model interprets and responds in natural language, and the surrounding application supplies memory, knowledge, safety, and a channel to talk through. On AWS, that foundation model almost always lives behind Amazon Bedrock.
It helps to separate the two things people mean by "AWS chatbot," because they lead to completely different builds. The first is AWS Chatbot, a specific AWS service that pipes CloudWatch alarms and operational notifications into Slack, Microsoft Teams, and Amazon Chime — a ChatOps tool for your engineers, not a conversational AI. The second — and what this guide is about — is a generative-AI chatbot you build on AWS: a custom application powered by a foundation model on Amazon Bedrock that talks to your users or your staff. Two different things that share a name; this page covers the second.
The mental model for the generative-AI version is simple. A foundation model is a stateless text-in, text-out function: it has no memory of previous turns and no knowledge of your private data. Everything that makes it feel like a chatbot — remembering the conversation, answering from your documents, staying safe and on-brand — lives in the application you wrap around the model. That is good news: the heavy lifting (the model) is a managed API call, and the parts you build are well-understood components AWS provides managed services for.
Concretely, a turn of conversation flows like this. The user sends a message. Your application loads the recent conversation history, optionally retrieves relevant passages from your knowledge base, assembles a prompt (a system instruction + history + any retrieved context + the new message), passes it through input guardrails, and calls the model on Bedrock. The model streams back a reply; you run it through output guardrails, show it to the user token-by-token, and append both the user message and the reply to the stored history for next time. Every chatbot on AWS — support bot, internal assistant, agentic copilot — is a variation on that loop.
Because the model is reached through the Bedrock Converse API, the same code works across Anthropic Claude, Amazon Nova, Meta Llama, Mistral, Cohere, and others. You are not locking yourself to one model vendor; you are building against one AWS API and choosing (and changing) the model underneath it as a configuration value.
A chatbot on AWS = a foundation model on Amazon Bedrock wrapped in a thin application that supplies conversation memory, optional retrieval over your own data, safety guardrails, and a streaming interface — so a stateless model behaves like a stateful, grounded, safe assistant.
Almost every production chatbot on AWS is assembled from the same five building blocks. Understanding each one — and which AWS service implements it — is what lets you reason about cost, latency, and where a problem lives when something goes wrong.
The blocks are: (1) a front end / channel the user talks through; (2) an API + compute layer that orchestrates a turn; (3) the model on Bedrock; (4) conversation memory; and (5) optional knowledge (RAG), with Guardrails sitting across the model call as a sixth, cross-cutting concern. The table at the end of this section maps each to its typical AWS service. Walk them in the order a request travels.
This is where the conversation happens: a web widget or app (commonly a React front end on Amazon S3 + CloudFront, or Amplify), a messaging channel (Slack, WhatsApp, your in-app chat), or voice via Amazon Connect for a contact centre. The channel matters mostly for one reason: streaming. Users will tolerate a multi-second answer if they see it being written token-by-token; the same wait with a spinner feels broken. Choose a channel and transport (WebSocket, server-sent events, Bedrock streaming responses) that can stream.
This layer receives the user message and runs the turn: load history, retrieve context, assemble the prompt, apply guardrails, call Bedrock, stream the reply, and persist the turn. The two common shapes are serverless — Amazon API Gateway (REST or WebSocket) in front of AWS Lambda — and container/app — an app on AWS Fargate, ECS, EKS, or App Runner. Serverless is the fast default: it scales to zero, you pay per request, and it suits spiky chat traffic. A long-running container is better when you need persistent connections at scale, heavy in-process orchestration, or a framework (LangChain, LlamaIndex) that you would rather host than cram into a Lambda. Either way, this layer holds no model logic itself — it orchestrates calls to Bedrock.
The brain of the bot is a foundation model called through the Bedrock Converse API — a single, model-agnostic API for multi-turn conversation that also supports tool use (function calling) and streaming. You send the system prompt, the message history, and the new user turn; Bedrock returns the assistant reply. Because the API is uniform across Claude, Nova, Llama, Mistral, and Cohere, the model is a parameter you pick (section III) and can change without rewriting the orchestrator.
Foundation models are stateless, so the application must remember the conversation and replay it on every turn. The standard store is Amazon DynamoDB: one item per message (or per session), keyed by a session ID, low-latency and pay-per-request — a near-perfect fit for chat history. On each turn you read the recent history for the session, include it in the prompt, and write the new turn back. The subtlety is that history cannot grow forever (the context window is finite and longer prompts cost more), which is why managing it is a first-class production concern covered in section VI.
If the bot must answer from your content — a help centre, product docs, policies, past tickets — bolt on retrieval-augmented generation so it grounds answers in your data instead of guessing. The managed path is Amazon Bedrock Knowledge Bases: point it at an S3 bucket (or a connector), and it handles chunking, embedding, vector storage, retrieval, and re-ranking, exposing a RetrieveAndGenerate call that returns a cited answer. In a chatbot you typically retrieve relevant passages for the user's message and inject them into the prompt before calling the model. A bot that only needs general knowledge and conversation can skip this entirely; a support or internal-knowledge bot lives or dies by it. (For the full RAG build, see the RAG-on-AWS guide in the related links.)
Wrapping the model call, Amazon Bedrock Guardrails screen both the user input and the model output for denied topics, harmful content, profanity, prompt-injection attempts, and — importantly — contextual grounding, which can block or flag answers that are not supported by the retrieved context (a strong anti-hallucination control for RAG bots). Guardrails also detect and redact PII. They are configured once and applied on every Bedrock call, independent of the model, so safety policy is consistent even if you change models.
| Building block | What it does | Typical AWS service | Required? |
|---|---|---|---|
| Front end / channel | Where the user chats; must support streaming | S3 + CloudFront / Amplify · Slack/WhatsApp · Amazon Connect (voice) | Yes |
| API + compute | Orchestrates a turn end to end | API Gateway + Lambda (serverless) · or Fargate/ECS/App Runner | Yes |
| Model | Generates the reply | Amazon Bedrock (Converse API) — Claude / Nova / Llama / Mistral | Yes |
| Conversation memory | Stores + replays chat history | Amazon DynamoDB (session-keyed) | Yes |
| Knowledge (RAG) | Grounds answers in your documents | Amazon Bedrock Knowledge Bases (+ S3, vector store) | Optional |
| Guardrails | Input/output safety + grounding + PII | Amazon Bedrock Guardrails | Strongly recommended |
| Auth + observability | Who is talking; logs, metrics, tracing | Amazon Cognito · CloudWatch · X-Ray | Recommended |
The single most consequential decision is which foundation model answers your users. It sets answer quality, response latency, and the per-message cost that determines whether the bot is economical at scale. Because Bedrock serves them all through one Converse API, you can — and often should — choose per use case, and even per turn.
Think along three axes: capability (how hard is the reasoning?), latency (how fast must the first token appear?), and cost (how many messages a day, at what price per 1K tokens?). They trade off: the most capable models are slower and pricier; the fastest, cheapest models handle routing, classification, and simple Q&A brilliantly but struggle with multi-step reasoning. There is no single right answer — there is a right answer for this bot.
A pattern worth knowing: model routing. Use a small, cheap model (Amazon Nova Micro/Lite, Claude Haiku) as the default for the majority of turns — greetings, FAQs, routing, structured extraction — and escalate only the genuinely hard turns to a frontier model (Claude Sonnet/Opus, Amazon Nova Pro/Premier). Most production chatbots find that a large share of traffic is easy, so routing can cut the model bill substantially while keeping quality high where it matters. With Bedrock's uniform API, the "escalation" is just calling a different model ID.
Two more levers shape the model decision. Streaming (ConverseStream / InvokeModelWithResponseStream) does not change the model but transforms perceived latency — always stream a chatbot. And prompt caching lets Bedrock cache a static, repeated prefix (your system prompt, tool definitions, even a long shared context) so you stop re-paying to process it on every turn — a real saving for chatbots, which send the same system prompt thousands of times a day.
| Model tier | Example models (Bedrock) | Best for in a chatbot | Latency | Relative cost |
|---|---|---|---|---|
| Small / fast | Amazon Nova Micro & Lite · Claude Haiku | High-volume FAQ, routing, classification, extraction | Lowest | $ (cheapest) |
| Mid | Amazon Nova Pro · Claude Sonnet · Llama (mid) | The everyday workhorse — good reasoning, sane cost | Medium | $$ |
| Frontier | Claude Opus · Amazon Nova Premier · large Llama/Mistral | Hard multi-step reasoning, nuanced or high-stakes replies | Highest | $$$ (priciest) |
| Routed (hybrid) | Small default + frontier escalation | Best blended economics at scale | Mostly low | $ → $$$ as needed |
Two things make a model feel like an assistant rather than a one-shot text generator: it remembers what was just said, and it can answer from your specific knowledge. Both are application concerns you build around the stateless model — here is how each works on AWS, and where they get hard.
Because the model is stateless, "memory" means: store every turn, and replay the relevant history in the prompt on the next turn. Amazon DynamoDB is the standard home for this — session-keyed items, single-digit-millisecond reads, pay-per-request — and it slots naturally next to a Lambda orchestrator. The naïve approach (replay the entire conversation every turn) breaks for two reasons: the context window is finite, and you pay for every input token on every call, so an ever-growing history means ever-rising latency and cost.
The production answer is a memory strategy. The common ones: a sliding window (keep the last N turns verbatim); summarisation (periodically compress older turns into a running summary the model maintains, keeping recent turns verbatim); and semantic / long-term memory (embed past turns or extracted facts into a vector store and retrieve only the relevant ones — effectively RAG over the conversation). Most chatbots start with a sliding window, add summarisation when conversations run long, and reach for semantic memory only when they need to recall facts from far earlier or across sessions.
If users will ask about things only your organisation knows, the bot needs retrieval-augmented generation. The flow inside a turn: take the user's message, retrieve the most relevant passages from your corpus, and inject them into the prompt as grounding context with an instruction to answer only from that context and cite sources. On AWS the fast path is Amazon Bedrock Knowledge Bases, which manages ingestion, chunking, embeddings, the vector store (OpenSearch Serverless, Aurora pgvector, Pinecone, or Redis), retrieval, and re-ranking behind one RetrieveAndGenerate call. A do-it-yourself pipeline gives more control (custom chunking, hybrid search, multi-tenant access control) at the cost of more engineering.
The most important grounding control for a chatbot is honesty about limits: instruct the model to say "I don't know" when retrieval returns nothing relevant, and pair that with Guardrails' contextual-grounding check, which can flag or block answers not supported by the retrieved passages. A support bot that confidently invents a refund policy is worse than one that admits it cannot find the answer and hands off to a human. RAG depth — chunking, embeddings, evaluation — is its own topic; the related RAG-on-AWS guide covers it in full.
Memory is what was said in this conversation (DynamoDB, replayed in the prompt). Knowledge is what your organisation knows (a corpus, retrieved via RAG). They are different stores solving different problems — conflating them ("just dump everything in the prompt") blows the context window and the budget. Replay recent memory; retrieve relevant knowledge.
Here is the fastest credible path from zero to a streaming, grounded, guard-railed chatbot on AWS using the serverless shape (API Gateway + Lambda + Bedrock). A container-based build follows the same logical order with the orchestrator hosted differently. Ship the thin vertical slice first, then layer on memory, RAG, and safety.
Steps 1–2 give you a working bot in an afternoon. Resist building memory, RAG, Guardrails, and a routing layer up front — get one real conversation flowing end to end, then add each capability as a thin increment. Most stalled chatbot projects over-engineered before they had a single answer streaming to a user.
A chatbot bill is dominated by one line — model inference — with a handful of small supporting costs around it. Knowing the shape lets you estimate before you build and control it after you ship. The figures below are representative as of 2026 to show the shape, not a quote; always check the AWS pricing page for current rates.
The driver of the model bill is tokens × price-per-token × messages. Every turn sends input tokens (system prompt + replayed history + any retrieved context + the user message) and receives output tokens, each priced per 1K and varying by model. That makes the big levers obvious: a cheaper model for easy turns, a tighter memory window and fewer retrieved chunks (fewer input tokens), prompt caching for the static prefix, and capping output length. The supporting costs — Lambda invocations, API Gateway requests, DynamoDB reads/writes, the vector store if you run RAG, and Guardrails evaluations — are usually small next to inference, but the always-on vector-store baseline (if you use RAG) is the one fixed cost worth right-sizing.
A rough way to sanity-check: estimate messages per day, average input + output tokens per message, and multiply by your model's per-1K rates, then add the per-message supporting costs and any fixed RAG baseline. A low-volume internal assistant on a small model can run for very little; a high-volume customer-support bot on a frontier model is dominated entirely by inference and is exactly where model routing, prompt caching, and tight prompts pay for themselves. For a full breakdown with worked examples, see the dedicated chatbot-cost page linked below.
| Cost line | When you pay | Driver | Main lever to control it |
|---|---|---|---|
| Model inference | Per message (usually the largest) | Input + output tokens × model price | Cheaper model for easy turns; prompt caching; tighter history + fewer chunks; cap output |
| Compute (orchestrator) | Per request / running time | Lambda invocations or container hours | Serverless scales to zero; right-size containers |
| API layer | Per request / connection | API Gateway requests + WebSocket minutes | Usually minor; batch/trim chatty calls |
| Conversation memory | Per read/write + storage | DynamoDB ops × turns | On-demand capacity; TTL old sessions |
| Knowledge (RAG) | Continuous baseline + per query | Vector store size + retrieval + embeddings | Right-size the vector store; only if you need RAG |
| Guardrails | Per evaluated request | Input/output units checked | Minor; scope policies to what you need |
A chatbot demo and a production chatbot differ on a handful of axes that rarely show up in a prototype: how fast it feels, how it handles long and messy conversations, how safe it stays under adversarial input, and how it behaves when something fails. Each has a concrete AWS answer.
Total latency is retrieval (if any) + the model's time-to-first-token + generation time. The cheapest win is streaming: showing the answer as it is written makes a 4-second reply feel instant. Beyond that, a smaller/faster model lowers real latency, fewer input tokens (tight history, re-ranked context, prompt caching) speed up the first token, and keeping the orchestrator and Bedrock in the same Region avoids cross-Region hops. For strict latency SLAs, Bedrock cross-Region inference can also smooth out regional capacity pressure.
Long conversations are where naïve bots break: replay everything and you eventually overflow the context window, costs creep up turn over turn, and the model starts losing the thread (early instructions get buried). Pick a memory strategy deliberately — sliding window, summarisation, or semantic memory (section IV) — set a maximum prompt budget, and store a session TTL in DynamoDB so stale sessions age out. Test explicitly with very long conversations; this rarely shows up in a five-message demo.
A public chatbot is an open input box, so assume adversarial use: prompt-injection ("ignore your instructions and…"), attempts to extract the system prompt, requests for disallowed content, and PII flowing in and out. Bedrock Guardrails are the first line — denied topics, content filters, PII redaction, contextual grounding to curb hallucination — applied on every call. Layer defensive prompting (keep secrets and tools out of reach of user-controllable text), rate limiting at API Gateway, and, for RAG bots, retrieval-time access control so a user can never retrieve documents they are not entitled to. Safety is not one feature; it is Guardrails + prompt design + access control + rate limits together.
Treat the bot like any production service. Handle Bedrock throttling and transient errors with retries and backoff (and consider Provisioned Throughput for guaranteed capacity at steady high volume). Log every turn — prompt, retrieved context, model, latency, token counts, and the final answer — to CloudWatch so you can reproduce and audit any response, and trace requests with X-Ray. Add a human-handoff path for when the bot is unsure or the user asks for a person, and watch quality with a periodic human-review sample on real traffic, because automated metrics miss domain-specific errors a person catches instantly.
Before launch: streaming responses · a deliberate memory strategy with a prompt-token budget and session TTL · Guardrails on input and output · retrieval-time access control (RAG bots) · rate limiting + auth on the API · full per-turn logging and tracing · retries/backoff for Bedrock throttling · a human-handoff path · a golden conversation set scored on every change · billing alarms and a cost ceiling. Miss one and the gap shows up in production, not the demo.
The reference architecture flexes into a few recurring shapes. They share the same building blocks but weight them differently, which changes the model you pick, whether you need RAG, and how hard you push on safety and access control.
Across all four, the spine is identical — a Bedrock model, an orchestrator, memory, optional RAG, and Guardrails. What changes is emphasis: how capable a model, whether grounding is central, how strict the access control, and whether the bot only talks or also acts. Start from the reference architecture and dial each block to the variation you are building.
Before building, it is worth asking whether you should. AWS offers a managed enterprise assistant — Amazon Q Business — that is essentially a pre-built RAG chatbot over your data. Build a custom bot on Bedrock when you need control or a customer-facing/branded experience; buy Q Business when you want an internal assistant fast with minimal engineering.
| Dimension | Custom chatbot on Amazon Bedrock | Amazon Q Business (managed) |
|---|---|---|
| What it is | You assemble model + app + memory + RAG + Guardrails | Pre-built enterprise RAG assistant over your connectors |
| Time to value | Hours to a prototype; weeks to production-grade | Fast — connect data sources, configure, go |
| Control / customisation | Total — any model, prompt, UX, logic, channel | Limited to the product's configuration surface |
| Model choice | Any model on Bedrock; route per turn | Managed by AWS under the hood |
| Customer-facing / branded | Yes — embed anywhere, own the UX | Primarily internal, employee-facing |
| Engineering effort | Higher — you build and maintain the app | Low — configuration over code |
| Pricing shape | Pay per token + supporting AWS services | Per-user subscription (+ usage) |
| Best for | Custom/agentic bots, customer support, full control | Internal knowledge assistant, minimal build |
Situation: Support volume was outgrowing the team and the founders wanted a chatbot on their site that answered strictly from their own docs, with citations, that escalated to a human when unsure — and that would not hallucinate a wrong answer about billing or security. An early in-house prototype on a single model with no memory and no grounding gave confident-but-wrong answers and had no safety story. The one engineer who could build it properly was fully committed to the core product, and the projected Bedrock inference bill at their support volume made the founders hesitant to commit.
What CloudRoute did: Routed within 24 hours to a US-region AWS partner with a GenAI/ML track record. The partner built the reference architecture in the team's existing account: API Gateway + Lambda orchestrator, the Converse API with model routing (a small fast model for the bulk of turns, escalating to Claude Sonnet for hard ones), DynamoDB conversation memory with a sliding window plus summarisation, Bedrock Knowledge Bases over the help centre for grounded, cited answers, Bedrock Guardrails on input and output with contextual grounding, streaming responses, prompt caching for the static system prompt, full CloudWatch logging, and a human-handoff path. The whole engagement — build plus the first months of inference — was funded by AWS credits the partner filed for: Activate Portfolio plus a Bedrock POC allocation.
Outcome: A cited, grounded support chatbot in production in under 5 weeks, deflecting a meaningful share of routine tickets and handing off cleanly when unsure. Model routing and prompt caching kept the inference bill well below the founders' worst-case estimate. The build and early inference ran on AWS credits — the customer paid $0. CloudRoute's commission was paid by the partner from AWS engagement funding.
engagement window: ~5 weeks · founder time: ~7 hours · stack: API Gateway + Lambda + Bedrock Converse (routed) + DynamoDB + Bedrock KB + Guardrails · cost to customer: $0
CloudRoute routes you to a vetted AWS GenAI/ML partner who designs and ships the bot — the Bedrock model (or a routed set), the API and compute layer, conversation memory, RAG grounding, Guardrails, streaming, and evaluation. AWS credits fund the build and the inference. You pay $0.