A voice assistant is three jobs in a loop: turn speech into text, reason over it, and turn the answer back into speech. On AWS that maps cleanly onto Amazon Transcribe (speech-to-text), Amazon Bedrock (the reasoning model), and Amazon Polly (text-to-speech), with Amazon Lex for structured dialog and Amazon Connect when the surface is a phone line. This is the full reference: the reference architecture, the latency budget that decides whether it feels alive or laggy, streaming and barge-in, a step-by-step build, the real cost model, and where it fits — IVR, in-app voice, and beyond.
Strip away the branding and every voice assistant, from a smart speaker to a bank's phone line, is the same loop: listen, think, speak — then listen again. On AWS each of those three jobs is a managed service, and the engineering challenge is making the hand-offs between them fast enough that the loop feels like a conversation rather than a walkie-talkie.
The first job is speech-to-text (STT), also called automatic speech recognition (ASR): converting the audio of someone speaking into a transcript your software can work with. On AWS this is Amazon Transcribe. The critical distinction is batch versus streaming — batch transcription processes a finished recording, while streaming transcription returns words as the person is still talking, which is the only mode that works for a live assistant.
The second job is reasoning: deciding what to say back. This is where the intelligence lives, and on AWS it is Amazon Bedrock — a managed API to foundation models like Anthropic Claude, Amazon Nova, Meta Llama, and Mistral. The model takes the transcript (plus conversation history, system instructions, and usually retrieved context from your own data) and produces a reply. For anything beyond chit-chat you ground the model on your knowledge with a Bedrock Knowledge Base so it answers from your policies, catalog, or account data rather than from generic training.
The third job is text-to-speech (TTS), also called speech synthesis: turning the model's text reply back into natural-sounding audio. On AWS this is Amazon Polly, whose neural and generative voices sound far closer to a human than the robotic TTS of a decade ago. Like Transcribe, Polly can stream its output, so audio starts playing before the whole sentence has been synthesized.
Two more services sit around this core. Amazon Lex is the dialog-management layer — the same technology that powers Alexa — handling intents, slots (the pieces of information you need to collect), and the back-and-forth flow; it has Transcribe and Polly built in, so for many bots Lex is the STT+TTS+dialog layer and you only add Bedrock for open-ended reasoning. Amazon Connect is the cloud contact center: it gives you a phone number, call routing, queues, and agents, and it integrates Lex natively so a caller can talk to your bot and be handed to a human seamlessly. Which of these you reach for depends on the surface — covered in section II.
A voice AI assistant on AWS = Amazon Transcribe (speech → text) → Amazon Bedrock (a foundation model reasons, grounded on your data) → Amazon Polly (text → speech), with Amazon Lex for structured dialog and Amazon Connect when it lives on a phone line — all serverless, all streaming, with the whole turn engineered to land under ~1 second.
Five AWS services do the work, and choosing the right subset is most of the architecture decision. The table below maps each to its job, its streaming story (the property that makes or breaks a live assistant), and roughly how it is priced — so you can see at a glance which services your specific surface needs.
Read the table with one question in mind: am I building for a phone line or for an app/device? A phone-line / IVR experience leans on Amazon Connect and Amazon Lex, which bundle telephony, dialog, and the STT/TTS plumbing, and you add Bedrock for open-ended answers. An in-app or on-device voice feature usually skips Connect and Lex and calls Transcribe streaming, Bedrock, and Polly directly from your own backend, because you already own the microphone, the UI, and the audio transport.
| Service | Job in the pipeline | Streaming? | Priced on | Reach for it when |
|---|---|---|---|---|
| Amazon Transcribe | Speech-to-text (ASR) | Yes — streaming + batch | Per second / minute of audio | You need live transcription in any custom pipeline |
| Amazon Bedrock | Reasoning / response generation | Yes — ConverseStream | Per 1K input/output tokens | The assistant must understand and answer open-endedly |
| Amazon Polly | Text-to-speech (synthesis) | Yes — streaming synthesis | Per million characters synthesized | You need natural spoken audio out |
| Amazon Lex | Dialog, intents & slots (STT+TTS built in) | Yes (voice streaming) | Per speech / text request | You have structured tasks (book, check, route) |
| Amazon Connect | Cloud contact center / telephony | Yes (real-time media) | Per minute of connected call | The surface is an actual phone number / IVR |
There is no single voice-AI architecture on AWS; there are two canonical ones, and they diverge on whether the entry point is a telephone or an application. Both share the same Transcribe → Bedrock → Polly core; they differ in what wraps that core and who manages the audio transport.
Whichever shape you pick, the data path is the same in spirit: audio in, transcript, a grounded model call, a text reply, audio out — looped. What changes is the orchestration layer and the telephony.
The caller dials a number provisioned by Amazon Connect. Connect's contact flow routes the call to an Amazon Lex bot, which handles the speech I/O (its built-in Transcribe + Polly), captures intents and slots, and — for any request its intent model cannot resolve on its own — invokes a fulfillment AWS Lambda. That Lambda calls Amazon Bedrock (typically with a Knowledge Base for grounding) to generate the answer, returns the text to Lex, and Lex speaks it via Polly. If the caller needs a human, Connect routes the call to an agent queue — and the agent can be assisted live by Amazon Q in Connect. This is the most managed path: AWS owns the telephony, the dialog runtime, and most of the STT/TTS, and you own a Lambda and a Knowledge Base.
Your app captures microphone audio and streams it to your backend (commonly over a WebSocket). The backend opens an Amazon Transcribe streaming session and receives partial and final transcripts as the user speaks. On a final utterance (detected via Transcribe's endpointing / partial-stabilization), the backend calls Amazon Bedrock with ConverseStream, grounding the model on your data, and feeds the model's output tokens — as they stream — into Amazon Polly streaming synthesis. The synthesized audio is streamed back to the app and played. You own the audio transport and the orchestration, which is more code than Pattern A but gives you full control over latency, UX, and where the assistant lives (web, mobile, kiosk, vehicle, embedded device).
Both patterns share three concerns. Grounding: a raw model will confidently invent answers, so production voice assistants retrieve from a Bedrock Knowledge Base (managed RAG over your documents) before answering — see also RAG on AWS. Safety: Bedrock Guardrails filter harmful content, block denied topics, and redact PII consistently, which matters more in voice because there is no UI to soften a bad answer. State: conversation history and session attributes (the caller's account, what they have already said) are typically kept in DynamoDB or in Lex/Connect session attributes so the model has context across turns. None of this is voice-specific, but all of it is non-negotiable for a voice assistant people will trust.
If the entry point is a phone number → Amazon Connect + Lex + a Bedrock Lambda (most managed). If the entry point is your app or device → call Transcribe streaming + Bedrock (ConverseStream) + Polly streaming directly from your backend (most control). Everything else — grounding via a Knowledge Base, Guardrails, session state — is shared.
A voice assistant lives or dies on response latency. In text chat a two-second wait is fine; in voice it is an awkward silence that makes the caller say "hello? are you there?" The target is that the user perceives a reply beginning in well under a second after they stop talking. Hitting that target is an exercise in streaming everything and overlapping the stages.
The naive pipeline runs the stages in series: wait for the full recording, transcribe it, send the whole transcript to the model, wait for the entire answer, synthesize all of it, then play. Each stage adds its full duration, and the delays stack into multiple seconds of dead air. The production pipeline instead streams and overlaps: Transcribe emits partial results while the user is still speaking, so the transcript is essentially ready the instant they stop; Bedrock's ConverseStream returns the first tokens of the answer a fraction of a second after the request; and those first tokens are fed straight into Polly streaming, so audio starts playing before the model has finished thinking. The user hears the beginning of the answer while its end is still being generated.
A second lever is endpointing — deciding when the user has actually finished their turn. End the turn too eagerly and you cut people off mid-sentence; wait too long and the assistant feels slow to respond. Transcribe's streaming partial-results stabilization and Lex's built-in endpointing handle this, and the timeout is tunable. A third lever is model choice: a smaller, faster model (Claude Haiku, Amazon Nova Lite/Micro, or Mistral) has dramatically lower first-token latency than a frontier model, so many voice assistants route the common, simple turns to a small fast model and reserve a larger model for genuinely hard questions — the same routing pattern that controls cost on Amazon Bedrock generally. For the streaming mechanics of the model call itself, see Bedrock streaming & tool use.
| Stage | Naive (serial) | Streaming (overlapped) | How the win happens |
|---|---|---|---|
| Speech-to-text | After user stops: ~0.5–2s | ≈ ready at end of speech | Partial results stream while the user talks |
| Reasoning (first token) | Full answer: ~1–4s | First token: ~0.3–0.8s | ConverseStream + a small fast model for easy turns |
| Text-to-speech | Whole reply: ~0.3–1s | First audio: ~0.1–0.3s | Polly streaming starts on the first tokens |
| Perceived turn latency | ~2–7s of dead air | ~0.5–1s to first audio | Stages overlap instead of stacking |
The difference between a real assistant and a hold message is whether you can interrupt it. Barge-in — the caller starting to speak while the assistant is mid-sentence, and the assistant stopping to listen — is the single feature that makes a voice bot feel human, and it falls out of the streaming architecture rather than being a separate product.
Mechanically, barge-in means running the input and output paths concurrently: even while Polly audio is playing, the microphone stream stays open and Transcribe keeps listening. When inbound speech is detected above a threshold, the system stops playback immediately, cancels the in-flight model generation if appropriate, and treats the new audio as the next turn. In Amazon Lex and Amazon Connect, barge-in is a configurable setting you enable on prompts. In a custom Pattern-B build you implement it yourself — keep the Transcribe stream alive during synthesis, and wire a "voice detected" event to halt the Polly playback buffer. Either way it depends on the same full-duplex, streaming foundation as low latency.
Beyond barge-in, a few conversational details separate good from grating. Confirmation and grounding for risky actions: a voice assistant that transfers money or cancels an order should read back the request and confirm, because there is no screen to double-check. Graceful fallback: when the model is unsure or the request is out of scope, the assistant should say so and offer a human, not hallucinate — this is where routing a "fallback intent" to a human via Amazon Connect matters. Short, speakable answers: text that reads fine on screen is exhausting to hear, so voice system prompts instruct the model to answer in one or two concise sentences and offer detail on request. SSML: Polly supports Speech Synthesis Markup Language for pauses, emphasis, pronunciation, and numbers/dates read naturally, which materially improves perceived quality. Latency masking: a brief, natural filler ("let me check that for you") while a slow tool call runs keeps the line from going silent.
Standing up a first voice assistant on AWS is a matter of days, not quarters, especially with the Connect + Lex path. Below is the build sequence for the more general custom pipeline (Pattern B), with notes on where the managed Pattern A short-circuits the work. None of it requires provisioning a GPU.
The order matters: get each stage working and measured on its own before you chain them, because a latency or quality problem is far easier to localize in isolation than in the full loop.
In your chosen Region, enable Amazon Bedrock model access for one workhorse model (e.g. Claude Sonnet) and a small fast model (e.g. a Nova or Claude Haiku tier) for easy turns; Transcribe and Polly need no enablement. Grant least-privilege IAM permissions to your backend role: transcribe:StartStreamTranscription, bedrock:Converse / bedrock:ConverseStream, and polly:SynthesizeSpeech. Pick the Region for data-residency and latency, and verify your chosen models and voices exist there.
Open an Amazon Transcribe streaming session from your backend and pipe microphone audio in (typically PCM over a WebSocket from the client). Render partial results so you can see recognition working, and tune endpointing so turns end naturally. Confirm the transcript is essentially ready the instant the user stops talking before moving on.
Create a Bedrock Knowledge Base over your documents in S3 so the assistant answers from your content. Then call ConverseStream with a voice-tuned system prompt ("answer in one or two short, spoken sentences; if unsure, say so and offer a human"), the retrieved context, and the conversation history. Wrap the call in Guardrails. Measure time-to-first-token and, if it is too slow for common turns, route those to a smaller model.
Feed the model's streaming text into Amazon Polly streaming synthesis, choosing a neural or generative voice and applying SSML for natural pacing and pronunciation. Stream the audio back to the client and start playback on the first chunk. You now have a full turn — measure end-to-end "time to first audio."
Keep the Transcribe stream open during playback and wire detected speech to stop Polly immediately (barge-in). Persist conversation history and session attributes (DynamoDB or Lex/Connect session state). Add a fallback path that hands off to a human when confidence is low. For the contact-center surface, this is also where you build the Amazon Connect contact flow and the Amazon Lex bot — Lex gives you Steps 2 and 4 (STT + TTS) and the dialog runtime out of the box, so Pattern A collapses much of the above into "configure Lex, write the Bedrock fulfillment Lambda."
# transcript already produced by Transcribe streaming
brt = boto3.client("bedrock-runtime", region_name="us-east-1")
stream = brt.converse_stream(
modelId="anthropic.claude-haiku", # fast model for low-latency turns
system=[{"text": "Answer in one or two short spoken sentences. If unsure, offer a human."}],
messages=[{"role": "user", "content": [{"text": transcript}]}],
inferenceConfig={"maxTokens": 256, "temperature": 0.2},
)
# feed each text delta straight into Polly streaming as it arrives
for event in stream["stream"]:
delta = event.get("contentBlockDelta", {}).get("delta", {}).get("text")
if delta: polly_feed(delta)
Model IDs and API shapes are illustrative — copy current IDs from the Bedrock console and check the latest SDK docs.
A voice assistant's bill is the sum of several metered services, and the units are different for each: audio is billed per minute (Transcribe, Connect) or per character (Polly), while reasoning is billed per token (Bedrock). Understanding which line item dominates tells you where to optimize. Figures below are representative as of 2026 to show relative scale — always confirm current rates on each AWS pricing page.
The useful mental model is "cost per conversation minute." For a typical assistant, the Bedrock reasoning line is usually the largest and most variable — it scales with how much context you send (system prompt + retrieved passages + history) and which model you use, which is exactly why grounding efficiency, prompt caching, and model routing matter. Transcribe bills per second/minute of audio processed; streaming is the relevant tier. Polly bills per million characters synthesized, so it scales with how much the assistant talks (another reason to keep answers short). If you use Amazon Connect, you add per-minute telephony, and Amazon Lex bills per speech/text request.
The cost levers are familiar from Bedrock generally: route easy turns to a small fast model and reserve a frontier model for hard ones (this cuts both cost and latency at once); turn on prompt caching so a large, stable system prompt and tool schema are not re-billed at full price every single turn — a big win for chatty voice loops; keep retrieved context tight so you are not paying to re-send a whole manual on every utterance; and keep spoken answers short, which lowers both Bedrock output tokens and Polly characters. For the deep dives see Bedrock pricing and the Bedrock pricing calculator.
| Component | Billed per | Typical share of bill | Biggest cost lever |
|---|---|---|---|
| Bedrock reasoning | 1K input / output tokens | Often the largest | Model routing + prompt caching + tight RAG context |
| Amazon Transcribe | Second / minute of audio (streaming) | Moderate | Only stream while the caller is actually talking |
| Amazon Polly | Million characters synthesized | Low–moderate | Short spoken answers; cache repeated prompts/audio |
| Amazon Lex (if used) | Speech / text request | Low | Handle structured turns in Lex; escalate only when needed |
| Amazon Connect (if used) | Minute of connected call | Telephony baseline | Resolve in self-service before routing to a paid agent |
Voice AI on AWS earns its keep wherever a conversation is the natural interface and the alternative is either a frustrating phone tree or a human doing repetitive work. The two anchor use cases are the contact center (IVR) and in-product voice, but the same Transcribe → Bedrock → Polly core powers a wider range.
The flagship use case is the intelligent IVR / contact-center assistant. Traditional phone trees ("press 1 for billing") are slow and rigid; a Bedrock-backed Amazon Connect + Lex assistant lets callers say what they want in their own words, answers grounded questions from a Knowledge Base, completes routine tasks (check an order, reset a password, book an appointment), and hands off to a human — with full context — only when needed. This deflects a large share of calls from human agents while improving the caller experience, which is why it is the most common production deployment.
The second is in-app and on-device voice: a voice assistant embedded in a mobile app, web app, kiosk, vehicle, or hardware device, built on the direct Transcribe + Bedrock + Polly pipeline. Here voice is a feature of your product rather than a phone line — hands-free control, accessibility, a conversational help agent, or a voice-driven workflow. Beyond these two, the same building blocks power outbound voice (reminders, confirmations), real-time agent assist (transcribe a live human call and surface answers, via Amazon Q in Connect), voice analytics (transcribe and summarize calls for QA and insight), and multilingual support (Transcribe and Polly cover dozens of languages, so one architecture serves many markets).
The first real decision is not which model but which architecture. Amazon Connect + Lex is the managed, telephony-first path; a direct Transcribe + Bedrock + Polly pipeline is the build-it-yourself, control-first path. This is the head-to-head; both share the same reasoning core.
| Dimension | Connect + Lex + Bedrock (managed IVR) | Direct Transcribe + Bedrock + Polly (custom) |
|---|---|---|
| Best surface | Phone line / contact center / IVR | In-app, web, mobile, kiosk, device |
| Telephony | Built in (Amazon Connect) | You provide the audio transport |
| STT + TTS | Built into Lex (Transcribe + Polly under the hood) | You call Transcribe & Polly directly |
| Dialog management | Lex intents, slots, flows out of the box | You orchestrate turns yourself |
| Reasoning | Bedrock via a fulfillment Lambda | Bedrock ConverseStream from your backend |
| Latency control | Good; tuned via Lex/Connect settings | Maximum — you own every stage |
| Build effort | Lower (configure + one Lambda) | Higher (own the full pipeline) |
| Human handoff | Native (Connect agent queues + Q in Connect) | You build the escalation path |
Situation: Clinic front desks were drowning in inbound appointment calls, and the company's rigid touch-tone IVR deflected almost nothing — most callers mashed 0 for a human. They wanted a voice assistant that could understand natural requests ("I need to move my Thursday appointment"), answer from their own scheduling rules, and complete or reschedule bookings, while staying within their HIPAA-eligible posture and handing complex cases to staff. They had no speech or ML engineers and no appetite to run inference infrastructure.
What CloudRoute did: Routed within 22 hours to a US-East AWS partner with a healthcare + Amazon Connect track record. The partner built it on the managed path: Amazon Connect for telephony, Amazon Lex for the dialog and built-in speech I/O, and a Bedrock fulfillment Lambda calling Claude (grounded on a Bedrock Knowledge Base over the clinic policies and the scheduling API) with Guardrails for PII redaction and denied topics. Barge-in was enabled, easy turns were routed to a fast model for sub-second responses, and low-confidence calls fell back to a human agent queue with full context. In parallel the partner filed a Bedrock/GenAI proof-of-concept credit application and an Activate Portfolio application.
Outcome: GenAI POC credits ($25K) approved in under 2 weeks and Portfolio ($100K) shortly after, so the first ~6 months of Transcribe + Bedrock + Polly + Connect usage were effectively credit-funded. A grounded voice assistant went live in 6 weeks, deflected a large share of routine scheduling calls, kept all data in-Region under the HIPAA-eligible services, and handed edge cases to staff cleanly. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.
time-to-match: < 24h · credits secured: $125K · go-live: ~6 weeks · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who files your Bedrock/GenAI credit application (Activate Portfolio up to $100K, GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and, if you need hands, builds the Transcribe + Bedrock + Polly voice assistant with you. AWS funds the credits and the engagement. You pay $0.