for AWS partners →Fund your voice-AI build with AWS credits →

voice ai on aws · the 2026 builder guide

How to add a voice AI assistant on AWS — Transcribe → Bedrock → Polly.

Q: How do I add a voice AI assistant on AWS?

Build a three-stage streaming loop: Amazon Transcribe converts the caller's speech to text in real time, Amazon Bedrock runs a foundation model (such as Claude or Amazon Nova) to reason over the transcript — usually grounded on your own data via a Bedrock Knowledge Base — and Amazon Polly converts the reply back to natural speech. For a phone-line / IVR experience, use Amazon Connect plus Amazon Lex (which has Transcribe and Polly built in) with a Bedrock fulfillment Lambda; for an in-app or on-device assistant, call Transcribe streaming, Bedrock ConverseStream, and Polly streaming directly from your backend. Stream at every stage so the turn lands under about a second.

Q: Which AWS services do I need for a voice assistant?

The core three are Amazon Transcribe (speech-to-text), Amazon Bedrock (reasoning / response generation, typically with a Knowledge Base for grounding and Guardrails for safety), and Amazon Polly (text-to-speech). Add Amazon Lex for structured dialog (intents and slots) and Amazon Connect when the assistant lives on a phone number — Connect provides the telephony, routing, and human handoff, and integrates Lex natively.

Q: What is the difference between Amazon Lex and using Bedrock directly for voice?

Amazon Lex is a managed dialog system (the technology behind Alexa) with built-in speech recognition and synthesis and a model for intents and slots — great for structured tasks like booking or status checks. Amazon Bedrock provides open-ended foundation-model reasoning. They are complementary: a common pattern uses Lex for the structured turns and routes anything Lex cannot resolve (the fallback / "anything else" intent) to a Bedrock-backed Lambda for a free-form, grounded answer. For an in-app assistant you may skip Lex and call Bedrock directly, handling dialog yourself.

Q: How do I make the voice assistant fast enough to feel natural?

Stream and overlap every stage. Use Amazon Transcribe streaming so the transcript is ready the moment the user stops talking, call Bedrock with ConverseStream so the first tokens arrive in a fraction of a second, and feed those tokens into Amazon Polly streaming so audio starts before the full answer exists. Also route common, simple turns to a smaller, faster model (Claude Haiku, Amazon Nova Lite/Micro, or Mistral) for low first-token latency, and tune endpointing so turns end naturally. The target is well under a second of perceived delay; measure end-to-end "time to first audio."

Q: What is barge-in and how do I implement it?

Barge-in is letting the caller interrupt the assistant mid-sentence — when they start talking, the assistant stops speaking and listens. In Amazon Lex and Amazon Connect it is a configurable setting on prompts. In a custom pipeline you implement it by keeping the microphone and Transcribe stream open while Polly audio plays, then halting playback (and optionally cancelling the in-flight Bedrock generation) the moment inbound speech is detected. It depends on the same full-duplex, streaming design as low latency.

Q: How much does a voice AI assistant on AWS cost?

You pay per metered service with different units: Amazon Transcribe per second/minute of audio (streaming tier), Amazon Bedrock per 1,000 input/output tokens, and Amazon Polly per million characters synthesized; add Amazon Connect per connected-call minute and Amazon Lex per request if you use them. Bedrock reasoning is usually the largest and most variable line. The big cost levers are routing easy turns to a small model, enabling prompt caching so a large stable system prompt is not re-billed every turn, keeping retrieved context tight, and keeping spoken answers short. Rates change — confirm current figures on each AWS pricing page.

Q: Is a voice assistant on AWS secure and compliant enough for regulated industries?

Yes, within the usual AWS controls. Bedrock does not use your prompts or outputs to train the base models and processes them in the AWS Region you call; data is encrypted in transit and at rest, traffic can stay off the public internet via VPC endpoints, access is governed by IAM with CloudTrail audit logging, and Guardrails redact PII. Transcribe, Polly, Bedrock, and Connect are included in AWS compliance programs (commonly SOC, ISO 27001, HIPAA eligibility, and PCI DSS depending on Region) — confirm the current scope for your Region and service in AWS Artifact. This is why regulated buyers (healthcare, financial services, public sector) can deploy voice AI on AWS.

Q: Can AWS credits cover the cost of building voice AI?

Yes. The same GenAI credit programs that fund Bedrock builds apply to a voice assistant, since Bedrock is the reasoning core: Activate Portfolio (up to $100K) for institutionally-funded startups, Bedrock / GenAI proof-of-concept funding ($10K–$50K) for a defined build, and the competitive Generative AI Accelerator (up to $1M). These pools are largely partner-filed and invisible on the public Activate page. CloudRoute routes you to a vetted AWS partner who files the credit application and, if you need hands, builds the Transcribe + Bedrock + Polly workload with you — AWS funds the credits and the engagement, so you pay $0.

A voice assistant is three jobs in a loop: turn speech into text, reason over it, and turn the answer back into speech. On AWS that maps cleanly onto Amazon Transcribe (speech-to-text), Amazon Bedrock (the reasoning model), and Amazon Polly (text-to-speech), with Amazon Lex for structured dialog and Amazon Connect when the surface is a phone line. This is the full reference: the reference architecture, the latency budget that decides whether it feels alive or laggy, streaming and barge-in, a step-by-step build, the real cost model, and where it fits — IVR, in-app voice, and beyond.

Fund your voice-AI build with AWS credits →→ jump to the architecture

core services

target turn latency

< 1s

servers to manage

languages (STT+TTS)

dozens

TL;DR

A voice AI assistant on AWS is a pipeline: Amazon Transcribe converts the caller's speech to text in real time, Amazon Bedrock runs the reasoning (a foundation model such as Claude or Amazon Nova, usually grounded on your own data via a Knowledge Base), and Amazon Polly converts the model's reply back to natural speech. Amazon Lex adds structured dialog and intent handling; Amazon Connect is the managed contact center that puts the whole thing on a phone number.
The hard part is not wiring the three services together — it is latency. A turn that feels natural lands under ~1 second of perceived delay, which means you must stream at every stage: streaming transcription (partial results as the user talks), a streaming model response (first tokens fed to TTS before the full answer exists), and streaming audio synthesis. Barge-in — letting the caller interrupt the assistant mid-sentence — is what separates a real assistant from a recording, and it depends on that streaming design.
Two patterns dominate. For a contact-center / IVR experience, Amazon Connect + Amazon Lex (with a Bedrock-backed fulfillment Lambda) is the managed path. For an in-app or device voice feature, you call Transcribe streaming, Bedrock, and Polly directly from your own backend. GenAI voice bills scale with minutes of audio and tokens of reasoning — CloudRoute routes you to AWS credits (Activate Portfolio up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted partner to build it; you pay $0.

the core idea

IThe anatomy of a voice assistant — three jobs in a loop

Strip away the branding and every voice assistant, from a smart speaker to a bank's phone line, is the same loop: listen, think, speak — then listen again. On AWS each of those three jobs is a managed service, and the engineering challenge is making the hand-offs between them fast enough that the loop feels like a conversation rather than a walkie-talkie.

The first job is speech-to-text (STT), also called automatic speech recognition (ASR): converting the audio of someone speaking into a transcript your software can work with. On AWS this is Amazon Transcribe. The critical distinction is batch versus streaming — batch transcription processes a finished recording, while streaming transcription returns words as the person is still talking, which is the only mode that works for a live assistant.

The second job is reasoning: deciding what to say back. This is where the intelligence lives, and on AWS it is Amazon Bedrock — a managed API to foundation models like Anthropic Claude, Amazon Nova, Meta Llama, and Mistral. The model takes the transcript (plus conversation history, system instructions, and usually retrieved context from your own data) and produces a reply. For anything beyond chit-chat you ground the model on your knowledge with a Bedrock Knowledge Base so it answers from your policies, catalog, or account data rather than from generic training.

The third job is text-to-speech (TTS), also called speech synthesis: turning the model's text reply back into natural-sounding audio. On AWS this is Amazon Polly, whose neural and generative voices sound far closer to a human than the robotic TTS of a decade ago. Like Transcribe, Polly can stream its output, so audio starts playing before the whole sentence has been synthesized.

Two more services sit around this core. Amazon Lex is the dialog-management layer — the same technology that powers Alexa — handling intents, slots (the pieces of information you need to collect), and the back-and-forth flow; it has Transcribe and Polly built in, so for many bots Lex is the STT+TTS+dialog layer and you only add Bedrock for open-ended reasoning. Amazon Connect is the cloud contact center: it gives you a phone number, call routing, queues, and agents, and it integrates Lex natively so a caller can talk to your bot and be handed to a human seamlessly. Which of these you reach for depends on the surface — covered in section II.

the one-sentence definition

A voice AI assistant on AWS = Amazon Transcribe (speech → text) → Amazon Bedrock (a foundation model reasons, grounded on your data) → Amazon Polly (text → speech), with Amazon Lex for structured dialog and Amazon Connect when it lives on a phone line — all serverless, all streaming, with the whole turn engineered to land under ~1 second.

every aws service in the pipeline

IIThe AWS building blocks — what each service does

Five AWS services do the work, and choosing the right subset is most of the architecture decision. The table below maps each to its job, its streaming story (the property that makes or breaks a live assistant), and roughly how it is priced — so you can see at a glance which services your specific surface needs.

Read the table with one question in mind: am I building for a phone line or for an app/device? A phone-line / IVR experience leans on Amazon Connect and Amazon Lex, which bundle telephony, dialog, and the STT/TTS plumbing, and you add Bedrock for open-ended answers. An in-app or on-device voice feature usually skips Connect and Lex and calls Transcribe streaming, Bedrock, and Polly directly from your own backend, because you already own the microphone, the UI, and the audio transport.

aws voice-AI building blocks · job, streaming support, and pricing basis · representative as of 2026 — check each service's pricing page

Service	Job in the pipeline	Streaming?	Priced on	Reach for it when
Amazon Transcribe	Speech-to-text (ASR)	Yes — streaming + batch	Per second / minute of audio	You need live transcription in any custom pipeline
Amazon Bedrock	Reasoning / response generation	Yes — ConverseStream	Per 1K input/output tokens	The assistant must understand and answer open-endedly
Amazon Polly	Text-to-speech (synthesis)	Yes — streaming synthesis	Per million characters synthesized	You need natural spoken audio out
Amazon Lex	Dialog, intents & slots (STT+TTS built in)	Yes (voice streaming)	Per speech / text request	You have structured tasks (book, check, route)
Amazon Connect	Cloud contact center / telephony	Yes (real-time media)	Per minute of connected call	The surface is an actual phone number / IVR

Lex has Transcribe and Polly built in, so a Lex voice bot already does STT, dialog, and TTS — you add Bedrock only for the open-ended reasoning Lex's intent model does not cover. A common 2026 pattern is "Lex for the structured turns, Bedrock for everything else," with Lex routing fallback/anything-else intents to a Bedrock-backed Lambda. Pricing dimensions differ per service and change over time; confirm current rates on each AWS pricing page.

how the pieces fit

IIIThe reference architecture — two shapes for two surfaces

There is no single voice-AI architecture on AWS; there are two canonical ones, and they diverge on whether the entry point is a telephone or an application. Both share the same Transcribe → Bedrock → Polly core; they differ in what wraps that core and who manages the audio transport.

Whichever shape you pick, the data path is the same in spirit: audio in, transcript, a grounded model call, a text reply, audio out — looped. What changes is the orchestration layer and the telephony.

Pattern A — Contact center / IVR (Amazon Connect + Lex + Bedrock)

The caller dials a number provisioned by Amazon Connect. Connect's contact flow routes the call to an Amazon Lex bot, which handles the speech I/O (its built-in Transcribe + Polly), captures intents and slots, and — for any request its intent model cannot resolve on its own — invokes a fulfillment AWS Lambda. That Lambda calls Amazon Bedrock (typically with a Knowledge Base for grounding) to generate the answer, returns the text to Lex, and Lex speaks it via Polly. If the caller needs a human, Connect routes the call to an agent queue — and the agent can be assisted live by Amazon Q in Connect. This is the most managed path: AWS owns the telephony, the dialog runtime, and most of the STT/TTS, and you own a Lambda and a Knowledge Base.

Pattern B — In-app / on-device voice (direct Transcribe + Bedrock + Polly)

Your app captures microphone audio and streams it to your backend (commonly over a WebSocket). The backend opens an Amazon Transcribe streaming session and receives partial and final transcripts as the user speaks. On a final utterance (detected via Transcribe's endpointing / partial-stabilization), the backend calls Amazon Bedrock with ConverseStream, grounding the model on your data, and feeds the model's output tokens — as they stream — into Amazon Polly streaming synthesis. The synthesized audio is streamed back to the app and played. You own the audio transport and the orchestration, which is more code than Pattern A but gives you full control over latency, UX, and where the assistant lives (web, mobile, kiosk, vehicle, embedded device).

What stays constant: grounding, guardrails, and state

Both patterns share three concerns. Grounding: a raw model will confidently invent answers, so production voice assistants retrieve from a Bedrock Knowledge Base (managed RAG over your documents) before answering — see also RAG on AWS. Safety: Bedrock Guardrails filter harmful content, block denied topics, and redact PII consistently, which matters more in voice because there is no UI to soften a bad answer. State: conversation history and session attributes (the caller's account, what they have already said) are typically kept in DynamoDB or in Lex/Connect session attributes so the model has context across turns. None of this is voice-specific, but all of it is non-negotiable for a voice assistant people will trust.

the decision in one line

If the entry point is a phone number → Amazon Connect + Lex + a Bedrock Lambda (most managed). If the entry point is your app or device → call Transcribe streaming + Bedrock (ConverseStream) + Polly streaming directly from your backend (most control). Everything else — grounding via a Knowledge Base, Guardrails, session state — is shared.

the property that decides everything

IVThe latency budget — why streaming is non-negotiable

A voice assistant lives or dies on response latency. In text chat a two-second wait is fine; in voice it is an awkward silence that makes the caller say "hello? are you there?" The target is that the user perceives a reply beginning in well under a second after they stop talking. Hitting that target is an exercise in streaming everything and overlapping the stages.

The naive pipeline runs the stages in series: wait for the full recording, transcribe it, send the whole transcript to the model, wait for the entire answer, synthesize all of it, then play. Each stage adds its full duration, and the delays stack into multiple seconds of dead air. The production pipeline instead streams and overlaps: Transcribe emits partial results while the user is still speaking, so the transcript is essentially ready the instant they stop; Bedrock's ConverseStream returns the first tokens of the answer a fraction of a second after the request; and those first tokens are fed straight into Polly streaming, so audio starts playing before the model has finished thinking. The user hears the beginning of the answer while its end is still being generated.

A second lever is endpointing — deciding when the user has actually finished their turn. End the turn too eagerly and you cut people off mid-sentence; wait too long and the assistant feels slow to respond. Transcribe's streaming partial-results stabilization and Lex's built-in endpointing handle this, and the timeout is tunable. A third lever is model choice: a smaller, faster model (Claude Haiku, Amazon Nova Lite/Micro, or Mistral) has dramatically lower first-token latency than a frontier model, so many voice assistants route the common, simple turns to a small fast model and reserve a larger model for genuinely hard questions — the same routing pattern that controls cost on Amazon Bedrock generally. For the streaming mechanics of the model call itself, see Bedrock streaming & tool use.

illustrative per-stage latency · naive (serial) vs streaming (overlapped) · representative figures, not guarantees

Stage	Naive (serial)	Streaming (overlapped)	How the win happens
Speech-to-text	After user stops: ~0.5–2s	≈ ready at end of speech	Partial results stream while the user talks
Reasoning (first token)	Full answer: ~1–4s	First token: ~0.3–0.8s	ConverseStream + a small fast model for easy turns
Text-to-speech	Whole reply: ~0.3–1s	First audio: ~0.1–0.3s	Polly streaming starts on the first tokens
Perceived turn latency	~2–7s of dead air	~0.5–1s to first audio	Stages overlap instead of stacking

Figures are illustrative orders of magnitude to show the shape of the win, not benchmarks — actual latency depends on model, Region, network, audio length, and concurrency. The single biggest design rule for voice on AWS: never wait for a full result at any stage you can stream. Measure end-to-end "time to first audio," not stage-by-stage averages.

making it feel alive

VBarge-in, turn-taking, and the conversational details

The difference between a real assistant and a hold message is whether you can interrupt it. Barge-in — the caller starting to speak while the assistant is mid-sentence, and the assistant stopping to listen — is the single feature that makes a voice bot feel human, and it falls out of the streaming architecture rather than being a separate product.

Mechanically, barge-in means running the input and output paths concurrently: even while Polly audio is playing, the microphone stream stays open and Transcribe keeps listening. When inbound speech is detected above a threshold, the system stops playback immediately, cancels the in-flight model generation if appropriate, and treats the new audio as the next turn. In Amazon Lex and Amazon Connect, barge-in is a configurable setting you enable on prompts. In a custom Pattern-B build you implement it yourself — keep the Transcribe stream alive during synthesis, and wire a "voice detected" event to halt the Polly playback buffer. Either way it depends on the same full-duplex, streaming foundation as low latency.

Beyond barge-in, a few conversational details separate good from grating. Confirmation and grounding for risky actions: a voice assistant that transfers money or cancels an order should read back the request and confirm, because there is no screen to double-check. Graceful fallback: when the model is unsure or the request is out of scope, the assistant should say so and offer a human, not hallucinate — this is where routing a "fallback intent" to a human via Amazon Connect matters. Short, speakable answers: text that reads fine on screen is exhausting to hear, so voice system prompts instruct the model to answer in one or two concise sentences and offer detail on request. SSML: Polly supports Speech Synthesis Markup Language for pauses, emphasis, pronunciation, and numbers/dates read naturally, which materially improves perceived quality. Latency masking: a brief, natural filler ("let me check that for you") while a slow tool call runs keeps the line from going silent.

Full-duplex audio — Keep the mic and Transcribe stream open while Polly is speaking, so the caller can interrupt at any moment — the technical basis of barge-in.
Interrupt handling — On detected inbound speech, stop playback instantly and (where sensible) cancel the in-flight Bedrock generation so the assistant pivots to the new turn rather than finishing the old one.
Confirm risky actions — Read back and confirm anything consequential (payments, cancellations, address changes) — there is no screen for the user to verify against.
Speakable responses — Instruct the model to reply in one or two short sentences; long screen-style answers are painful to listen to. Offer "want the details?" instead.
SSML for naturalness — Use Polly SSML for pauses, emphasis, and correct pronunciation of numbers, dates, currencies, and names — it is a large, cheap quality win.
Graceful human handoff — On low confidence or out-of-scope requests, route to a human (Amazon Connect agent queue) instead of guessing — voice errors are costlier than chat errors.

from zero to a talking assistant

VIStep-by-step — building the assistant

Standing up a first voice assistant on AWS is a matter of days, not quarters, especially with the Connect + Lex path. Below is the build sequence for the more general custom pipeline (Pattern B), with notes on where the managed Pattern A short-circuits the work. None of it requires provisioning a GPU.

The order matters: get each stage working and measured on its own before you chain them, because a latency or quality problem is far easier to localize in isolation than in the full loop.

Step 1 — Enable the services and lock down access

In your chosen Region, enable Amazon Bedrock model access for one workhorse model (e.g. Claude Sonnet) and a small fast model (e.g. a Nova or Claude Haiku tier) for easy turns; Transcribe and Polly need no enablement. Grant least-privilege IAM permissions to your backend role: transcribe:StartStreamTranscription, bedrock:Converse / bedrock:ConverseStream, and polly:SynthesizeSpeech. Pick the Region for data-residency and latency, and verify your chosen models and voices exist there.

Step 2 — Stand up streaming transcription

Open an Amazon Transcribe streaming session from your backend and pipe microphone audio in (typically PCM over a WebSocket from the client). Render partial results so you can see recognition working, and tune endpointing so turns end naturally. Confirm the transcript is essentially ready the instant the user stops talking before moving on.

Step 3 — Ground and call the model

Create a Bedrock Knowledge Base over your documents in S3 so the assistant answers from your content. Then call ConverseStream with a voice-tuned system prompt ("answer in one or two short, spoken sentences; if unsure, say so and offer a human"), the retrieved context, and the conversation history. Wrap the call in Guardrails. Measure time-to-first-token and, if it is too slow for common turns, route those to a smaller model.

Step 4 — Synthesize and stream audio out

Feed the model's streaming text into Amazon Polly streaming synthesis, choosing a neural or generative voice and applying SSML for natural pacing and pronunciation. Stream the audio back to the client and start playback on the first chunk. You now have a full turn — measure end-to-end "time to first audio."

Step 5 — Add barge-in, state, and fallback

Keep the Transcribe stream open during playback and wire detected speech to stop Polly immediately (barge-in). Persist conversation history and session attributes (DynamoDB or Lex/Connect session state). Add a fallback path that hands off to a human when confidence is low. For the contact-center surface, this is also where you build the Amazon Connect contact flow and the Amazon Lex bot — Lex gives you Steps 2 and 4 (STT + TTS) and the dialog runtime out of the box, so Pattern A collapses much of the above into "configure Lex, write the Bedrock fulfillment Lambda."

a minimal Bedrock turn for voice (python / boto3, illustrative)

# transcript already produced by Transcribe streaming
brt = boto3.client("bedrock-runtime", region_name="us-east-1")
stream = brt.converse_stream(
  modelId="anthropic.claude-haiku",  # fast model for low-latency turns
  system=[{"text": "Answer in one or two short spoken sentences. If unsure, offer a human."}],
  messages=[{"role": "user", "content": [{"text": transcript}]}],
  inferenceConfig={"maxTokens": 256, "temperature": 0.2},
)
# feed each text delta straight into Polly streaming as it arrives
for event in stream["stream"]:
  delta = event.get("contentBlockDelta", {}).get("delta", {}).get("text")
  if delta: polly_feed(delta)

Model IDs and API shapes are illustrative — copy current IDs from the Bedrock console and check the latest SDK docs.

how you actually pay

VIIThe cost model — what a voice minute really costs

A voice assistant's bill is the sum of several metered services, and the units are different for each: audio is billed per minute (Transcribe, Connect) or per character (Polly), while reasoning is billed per token (Bedrock). Understanding which line item dominates tells you where to optimize. Figures below are representative as of 2026 to show relative scale — always confirm current rates on each AWS pricing page.

The useful mental model is "cost per conversation minute." For a typical assistant, the Bedrock reasoning line is usually the largest and most variable — it scales with how much context you send (system prompt + retrieved passages + history) and which model you use, which is exactly why grounding efficiency, prompt caching, and model routing matter. Transcribe bills per second/minute of audio processed; streaming is the relevant tier. Polly bills per million characters synthesized, so it scales with how much the assistant talks (another reason to keep answers short). If you use Amazon Connect, you add per-minute telephony, and Amazon Lex bills per speech/text request.

The cost levers are familiar from Bedrock generally: route easy turns to a small fast model and reserve a frontier model for hard ones (this cuts both cost and latency at once); turn on prompt caching so a large, stable system prompt and tool schema are not re-billed at full price every single turn — a big win for chatty voice loops; keep retrieved context tight so you are not paying to re-send a whole manual on every utterance; and keep spoken answers short, which lowers both Bedrock output tokens and Polly characters. For the deep dives see Bedrock pricing and the Bedrock pricing calculator.

voice-AI cost components on AWS · unit + the lever that controls it · representative 2026 framing — verify on each pricing page

Component	Billed per	Typical share of bill	Biggest cost lever
Bedrock reasoning	1K input / output tokens	Often the largest	Model routing + prompt caching + tight RAG context
Amazon Transcribe	Second / minute of audio (streaming)	Moderate	Only stream while the caller is actually talking
Amazon Polly	Million characters synthesized	Low–moderate	Short spoken answers; cache repeated prompts/audio
Amazon Lex (if used)	Speech / text request	Low	Handle structured turns in Lex; escalate only when needed
Amazon Connect (if used)	Minute of connected call	Telephony baseline	Resolve in self-service before routing to a paid agent

Shares are directional, not audited — the actual mix depends on conversation length, model, context size, and whether you use Connect/Lex at all. Numbers and tiers change; confirm on aws.amazon.com pricing pages for Transcribe, Bedrock, Polly, Lex, and Connect. The highest-leverage moves for a voice bill: route to small models, cache the system prompt, keep RAG context and spoken answers short.

where it fits

VIIIUse cases — IVR, in-app voice, and where it pays off

Voice AI on AWS earns its keep wherever a conversation is the natural interface and the alternative is either a frustrating phone tree or a human doing repetitive work. The two anchor use cases are the contact center (IVR) and in-product voice, but the same Transcribe → Bedrock → Polly core powers a wider range.

The flagship use case is the intelligent IVR / contact-center assistant. Traditional phone trees ("press 1 for billing") are slow and rigid; a Bedrock-backed Amazon Connect + Lex assistant lets callers say what they want in their own words, answers grounded questions from a Knowledge Base, completes routine tasks (check an order, reset a password, book an appointment), and hands off to a human — with full context — only when needed. This deflects a large share of calls from human agents while improving the caller experience, which is why it is the most common production deployment.

The second is in-app and on-device voice: a voice assistant embedded in a mobile app, web app, kiosk, vehicle, or hardware device, built on the direct Transcribe + Bedrock + Polly pipeline. Here voice is a feature of your product rather than a phone line — hands-free control, accessibility, a conversational help agent, or a voice-driven workflow. Beyond these two, the same building blocks power outbound voice (reminders, confirmations), real-time agent assist (transcribe a live human call and surface answers, via Amazon Q in Connect), voice analytics (transcribe and summarize calls for QA and insight), and multilingual support (Transcribe and Polly cover dozens of languages, so one architecture serves many markets).

Intelligent IVR / call deflection — Amazon Connect + Lex + Bedrock: callers speak naturally, get grounded answers and self-service task completion, and reach a human only when truly needed — deflecting routine calls.
In-app & on-device voice assistant — Direct Transcribe + Bedrock + Polly inside your app, kiosk, vehicle, or device: hands-free control, accessibility, and conversational help as a product feature.
Real-time agent assist — Transcribe a live human call and use Bedrock (via Amazon Q in Connect) to surface answers and next-best actions to the agent in real time.
Voice analytics & QA — Batch-transcribe calls with Transcribe and summarize/score them with Bedrock for quality assurance, compliance, and customer insight.
Outbound & reminder calls — Connect places calls; Lex + Bedrock + Polly deliver appointment reminders, confirmations, and simple interactive follow-ups.
Multilingual support at scale — Transcribe and Polly support dozens of languages, so one Transcribe → Bedrock → Polly architecture serves many markets without a rebuild.

pick the right path

Two ways to build voice AI on AWS — managed IVR vs custom pipeline

The first real decision is not which model but which architecture. Amazon Connect + Lex is the managed, telephony-first path; a direct Transcribe + Bedrock + Polly pipeline is the build-it-yourself, control-first path. This is the head-to-head; both share the same reasoning core.

Dimension	Connect + Lex + Bedrock (managed IVR)	Direct Transcribe + Bedrock + Polly (custom)
Best surface	Phone line / contact center / IVR	In-app, web, mobile, kiosk, device
Telephony	Built in (Amazon Connect)	You provide the audio transport
STT + TTS	Built into Lex (Transcribe + Polly under the hood)	You call Transcribe & Polly directly
Dialog management	Lex intents, slots, flows out of the box	You orchestrate turns yourself
Reasoning	Bedrock via a fulfillment Lambda	Bedrock ConverseStream from your backend
Latency control	Good; tuned via Lex/Connect settings	Maximum — you own every stage
Build effort	Lower (configure + one Lambda)	Higher (own the full pipeline)
Human handoff	Native (Connect agent queues + Q in Connect)	You build the escalation path

Rule of thumb: if the assistant answers a phone, start with Amazon Connect + Lex and add a Bedrock fulfillment Lambda — you get telephony, dialog, STT, TTS, and human handoff with the least code. If the assistant lives inside your product or a device, build the direct Transcribe → Bedrock → Polly pipeline for full control over latency and UX. The reasoning layer (Bedrock + a Knowledge Base + Guardrails) is identical either way.

building a voice assistant on aws?

Get AWS credits to fund the Transcribe + Bedrock + Polly bill — and a vetted partner to build it. You pay $0.

Get matched in 24h →

a recent match

A voice IVR assistant, funded by AWS credits — anonymized

inquiry · series-a healthcare-scheduling startup, US

Series-A health-tech SaaS, 24 people, scheduling product for clinics; HIPAA-eligibility requirement; already on AWS at ~$6K/month

Situation: Clinic front desks were drowning in inbound appointment calls, and the company's rigid touch-tone IVR deflected almost nothing — most callers mashed 0 for a human. They wanted a voice assistant that could understand natural requests ("I need to move my Thursday appointment"), answer from their own scheduling rules, and complete or reschedule bookings, while staying within their HIPAA-eligible posture and handing complex cases to staff. They had no speech or ML engineers and no appetite to run inference infrastructure.

What CloudRoute did: Routed within 22 hours to a US-East AWS partner with a healthcare + Amazon Connect track record. The partner built it on the managed path: Amazon Connect for telephony, Amazon Lex for the dialog and built-in speech I/O, and a Bedrock fulfillment Lambda calling Claude (grounded on a Bedrock Knowledge Base over the clinic policies and the scheduling API) with Guardrails for PII redaction and denied topics. Barge-in was enabled, easy turns were routed to a fast model for sub-second responses, and low-confidence calls fell back to a human agent queue with full context. In parallel the partner filed a Bedrock/GenAI proof-of-concept credit application and an Activate Portfolio application.

Outcome: GenAI POC credits ($25K) approved in under 2 weeks and Portfolio ($100K) shortly after, so the first ~6 months of Transcribe + Bedrock + Polly + Connect usage were effectively credit-funded. A grounded voice assistant went live in 6 weeks, deflected a large share of routine scheduling calls, kept all data in-Region under the HIPAA-eligible services, and handed edge cases to staff cleanly. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.

time-to-match: < 24h · credits secured: $125K · go-live: ~6 weeks · cost to customer: $0

faq

Common questions

How do I add a voice AI assistant on AWS?

Build a three-stage streaming loop: Amazon Transcribe converts the caller's speech to text in real time, Amazon Bedrock runs a foundation model (such as Claude or Amazon Nova) to reason over the transcript — usually grounded on your own data via a Bedrock Knowledge Base — and Amazon Polly converts the reply back to natural speech. For a phone-line / IVR experience, use Amazon Connect plus Amazon Lex (which has Transcribe and Polly built in) with a Bedrock fulfillment Lambda; for an in-app or on-device assistant, call Transcribe streaming, Bedrock ConverseStream, and Polly streaming directly from your backend. Stream at every stage so the turn lands under about a second.

Which AWS services do I need for a voice assistant?

The core three are Amazon Transcribe (speech-to-text), Amazon Bedrock (reasoning / response generation, typically with a Knowledge Base for grounding and Guardrails for safety), and Amazon Polly (text-to-speech). Add Amazon Lex for structured dialog (intents and slots) and Amazon Connect when the assistant lives on a phone number — Connect provides the telephony, routing, and human handoff, and integrates Lex natively.

What is the difference between Amazon Lex and using Bedrock directly for voice?

Amazon Lex is a managed dialog system (the technology behind Alexa) with built-in speech recognition and synthesis and a model for intents and slots — great for structured tasks like booking or status checks. Amazon Bedrock provides open-ended foundation-model reasoning. They are complementary: a common pattern uses Lex for the structured turns and routes anything Lex cannot resolve (the fallback / "anything else" intent) to a Bedrock-backed Lambda for a free-form, grounded answer. For an in-app assistant you may skip Lex and call Bedrock directly, handling dialog yourself.

How do I make the voice assistant fast enough to feel natural?

Stream and overlap every stage. Use Amazon Transcribe streaming so the transcript is ready the moment the user stops talking, call Bedrock with ConverseStream so the first tokens arrive in a fraction of a second, and feed those tokens into Amazon Polly streaming so audio starts before the full answer exists. Also route common, simple turns to a smaller, faster model (Claude Haiku, Amazon Nova Lite/Micro, or Mistral) for low first-token latency, and tune endpointing so turns end naturally. The target is well under a second of perceived delay; measure end-to-end "time to first audio."

What is barge-in and how do I implement it?

Barge-in is letting the caller interrupt the assistant mid-sentence — when they start talking, the assistant stops speaking and listens. In Amazon Lex and Amazon Connect it is a configurable setting on prompts. In a custom pipeline you implement it by keeping the microphone and Transcribe stream open while Polly audio plays, then halting playback (and optionally cancelling the in-flight Bedrock generation) the moment inbound speech is detected. It depends on the same full-duplex, streaming design as low latency.

How much does a voice AI assistant on AWS cost?

You pay per metered service with different units: Amazon Transcribe per second/minute of audio (streaming tier), Amazon Bedrock per 1,000 input/output tokens, and Amazon Polly per million characters synthesized; add Amazon Connect per connected-call minute and Amazon Lex per request if you use them. Bedrock reasoning is usually the largest and most variable line. The big cost levers are routing easy turns to a small model, enabling prompt caching so a large stable system prompt is not re-billed every turn, keeping retrieved context tight, and keeping spoken answers short. Rates change — confirm current figures on each AWS pricing page.

Is a voice assistant on AWS secure and compliant enough for regulated industries?

Yes, within the usual AWS controls. Bedrock does not use your prompts or outputs to train the base models and processes them in the AWS Region you call; data is encrypted in transit and at rest, traffic can stay off the public internet via VPC endpoints, access is governed by IAM with CloudTrail audit logging, and Guardrails redact PII. Transcribe, Polly, Bedrock, and Connect are included in AWS compliance programs (commonly SOC, ISO 27001, HIPAA eligibility, and PCI DSS depending on Region) — confirm the current scope for your Region and service in AWS Artifact. This is why regulated buyers (healthcare, financial services, public sector) can deploy voice AI on AWS.

Can AWS credits cover the cost of building voice AI?

Yes. The same GenAI credit programs that fund Bedrock builds apply to a voice assistant, since Bedrock is the reasoning core: Activate Portfolio (up to $100K) for institutionally-funded startups, Bedrock / GenAI proof-of-concept funding ($10K–$50K) for a defined build, and the competitive Generative AI Accelerator (up to $1M). These pools are largely partner-filed and invisible on the public Activate page. CloudRoute routes you to a vetted AWS partner who files the credit application and, if you need hands, builds the Transcribe + Bedrock + Polly workload with you — AWS funds the credits and the engagement, so you pay $0.

Build voice AI on AWS — and let AWS credits pay for it.

CloudRoute routes you to a vetted AWS partner who files your Bedrock/GenAI credit application (Activate Portfolio up to $100K, GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and, if you need hands, builds the Transcribe + Bedrock + Polly voice assistant with you. AWS funds the credits and the engagement. You pay $0.

Get matched in 24h →→ see the data & AI persona detail

matched within< 24h

GenAI credit ceilingup to $1M

cost to you$0