An AI agent is a language model that does not just answer — it plans, calls your tools, reads your data, and loops until a task is done. This is the full build guide on AWS: what an agent actually is, the two paths to ship one — fully managed with Amazon Bedrock Agents, or a custom tool-use loop you build on the Bedrock Converse API plus Lambda and Step Functions — how to define tools and actions, wire in knowledge (RAG), add memory, attach guardrails, and instrument observability, plus the managed-vs-custom decision, a concrete step-by-step outline, the real cost stack, and the production concerns nobody warns you about.
An agent is not a smarter prompt. It is a foundation model placed inside a loop, given tools and knowledge, and allowed to take actions toward a goal — deciding each next step from what it has seen so far, rather than following a script you wrote.
A plain language-model call is one-shot: text in, text out. A chatbot adds conversation memory but still only produces words. An AI agent adds two things that change the category entirely: tools (the ability to call functions, APIs, and databases — to act on the world) and a loop (the ability to take a result, reason about it, and decide the next action). Give an agent the goal "issue a refund for order 4471 and email the customer," and it will look up the order, check the refund policy, call the refund API, then call the email API — observing each result and choosing the next move. No fixed script decided that sequence; the model did, at runtime.
The canonical pattern is a plan → act → observe cycle, often called ReAct (reason + act). The model reasons about the goal and emits either a final answer or a request to call a tool; something executes that tool; the result is fed back; the model reasons again. The cycle repeats until the model decides the task is done. Everything else in this guide — tools, knowledge, memory, guardrails, observability — is in service of making that loop reliable, grounded, safe, and affordable.
On AWS, the model that powers the loop runs on Amazon Bedrock regardless of which path you choose — Bedrock is AWS's managed API to foundation models (Anthropic's Claude, Amazon Nova, Meta's Llama, Mistral, and others) with enterprise security and data privacy (your data is not used to train base models; it stays in your account and Region). The real architectural decision is not which cloud or even which model — it is who runs the loop: Bedrock Agents (managed) or your own code on the Converse API (custom). That decision is the spine of this guide.
One framing worth internalizing before you build: an agent that can take real actions on your systems is a security and reliability boundary, not just a feature. The same autonomy that makes it useful makes a careless agent dangerous — it can call the wrong tool, loop forever, or be steered by a malicious document. The sections on guardrails, observability, and production concerns are not optional polish; they are the difference between a demo and something you can put in front of customers.
A single model call answers. A chatbot converses. An agent takes a goal, then plans, calls your tools and knowledge, observes results, and loops until the task is done. Reach for an agent only when the work is multi-step and the path depends on intermediate results — otherwise a simpler pattern is cheaper and more predictable.
Whether you build managed or custom, every working agent is assembled from the same five parts. Getting these right — especially the tools and the instructions — is most of what makes an agent behave.
The parts below are vocabulary you will use in both build paths. In Bedrock Agents they are configuration fields; in a custom loop they are code and data structures you assemble yourself. Either way the concepts are identical, which is why it is worth defining them once before the paths diverge.
Every agent is backed by one foundation model and a block of instructions — natural-language text that sets the agent's role, what it may and may not do, its tone, and the business rules it must obey ("never issue a refund over $500 without escalating; always confirm the order ID before acting"). This is effectively the system prompt, and it is the single highest-leverage thing you write. Clear, specific instructions are the difference between an agent that stays on task and one that improvises. Model choice matters in parallel: stronger reasoning models follow multi-step plans and tool schemas more reliably, while smaller, faster models cut latency and cost for simpler agents.
Tools are how an agent does things — query a database, hit an internal API, trigger a workflow, send a message. Each tool has two halves: a schema that describes the tool to the model (its name, what it does, its parameters, and what it returns) and an executor that actually runs it — on AWS, most commonly an AWS Lambda function. The model reads the schema descriptions to decide which tool to call and what to pass; the executor returns a result that re-enters the loop. Tool design is covered in depth in section IV, because thin tool descriptions are the leading cause of agents calling the wrong tool or inventing parameters.
Tools are for doing; knowledge is for knowing. An agent grounded only in the model's training data cannot answer "what is our refund policy for EU customers?" with your actual policy — it will guess. Retrieval-augmented generation (RAG) fixes this: you index your documents into a vector store and let the agent retrieve the most relevant passages at any step. On AWS this is what Bedrock Knowledge Bases provides as a managed capability, or you build the retrieval yourself. Knowledge integration is covered in section V; the full RAG architecture is in the rag-on-aws sibling.
Within one task the agent needs session state — the running context of this interaction (what the user said, what tools returned). Across tasks, useful agents need longer-term memory so a returning user does not start from scratch. Memory is a deliberate design choice, not a default — it has privacy and cost implications because you are storing and re-injecting user context. Section VI covers both kinds and how each path implements them.
An agent that can take real actions and surface retrieved content needs a policy layer that screens inputs and outputs — blocking prompt-injection attempts, denied topics, profanity, and PII leakage, and keeping the agent on-scope. On AWS this is Amazon Bedrock Guardrails, a configurable safety filter you attach to the agent or apply around your own loop. For anything customer- or production-facing this is non-negotiable; section VII explains why agents are uniquely exposed.
When an agent misbehaves, fix in this order: (1) tool schemas — is the description precise enough for the model to choose correctly? (2) instructions — are the rules explicit? (3) knowledge — is the right context being retrieved? (4) memory — is stale or missing context confusing it? (5) model — only after the above, is the model itself the limit? Most "the agent is dumb" problems are actually thin tool descriptions.
The first real decision is not which model or which tools — it is whether to let Amazon Bedrock Agents run the orchestration loop for you, or to build the loop yourself on the Converse API. This one choice determines how much you build, how much you control, and how fast you ship.
The honest framing, mirroring the rest of the GenAI stack: start managed, move to custom only when a specific requirement forces it. Most teams either overbuild a bespoke agent framework they then have to maintain when Bedrock Agents would have shipped the same behavior in days, or they force everything into the managed path and fight it when they hit a hard requirement it does not express. Knowing where the line sits saves weeks. Both paths run the model on Bedrock and use the same concepts from section II — the difference is who owns the loop.
With Amazon Bedrock Agents, you declare the pieces and AWS runs the loop. You pick a model and write instructions, define one or more action groups (each a set of tools described by an OpenAPI or function schema and backed by a Lambda function — or returned to your app via return-of-control), associate any Knowledge Bases for RAG, attach a Guardrail, and optionally enable memory. Bedrock then runs the ReAct-style plan/act/observe loop: it prompts the model, parses the tool-call intent, invokes your Lambda, feeds the result back, and re-prompts — none of which you hand-write. You build and test against a draft (inspecting the step-by-step trace), cut an immutable version, point an alias at it, and invoke with the InvokeAgent API.
Choose managed when: you want to ship in days not weeks; the task is a fairly standard reason-and-act loop over a handful of tools and a knowledge base; you do not need exotic control flow or a specific orchestration framework; and you are happy to let AWS own the plumbing. This covers the large majority of support agents, operational copilots, and internal assistants. The amazon-bedrock-agents sibling is the deep reference on this path.
In the custom path you still call Bedrock for the model, but you own the loop. The foundation is the Bedrock Converse API, which supports tool use (a.k.a. function calling) natively: you send the model the conversation plus a list of tool specs; if the model wants to call a tool it returns a structured toolUse request with the chosen tool and arguments; your code executes that tool (typically in Lambda) and sends the result back as a toolResult; you repeat until the model returns a final answer. You write that while-loop — including how many iterations you allow, how you handle errors, how you log, and how you assemble context.
For anything beyond a short in-memory loop, you reach for AWS Step Functions to orchestrate the agent as a durable state machine: each model call and each tool call becomes a state, so you get built-in retries, error handling, branching, parallelism, timeouts, and — critically — durable, long-running and human-in-the-loop execution (a Step Functions workflow can pause for minutes, hours, or days waiting for an approval via a task token, then resume). Step Functions can invoke Bedrock and Lambda directly, which makes it a natural backbone for multi-step or multi-agent workflows that must survive failures and restarts. Orchestration libraries (LangGraph, the Strands Agents SDK, CrewAI, LlamaIndex) are common in this path too, often running inside Lambda or on a container.
Choose custom when: you need bespoke control flow the managed agent does not expose; durable, long-running, or human-in-the-loop workflows; multi-agent coordination; deep integration with an existing framework; custom retry/caching/routing logic; or you are squeezing cost and latency hard enough that owning every step pays off. The trade is real: you build and maintain the plumbing Bedrock would otherwise run.
Prototype on Bedrock Agents to prove the use case fast and get a baseline. Graduate to a custom Converse loop (with Step Functions for durability) only when a concrete requirement — long-running or human-in-the-loop workflows, multi-agent coordination, framework integration, or aggressive cost/latency control — actually forces it. Many production stacks are a hybrid: a managed Agent for the open-ended sub-task, Step Functions for the durable spine around it.
Tools are where agents succeed or fail. The model can only act through the tools you give it, and it chooses among them using nothing but their descriptions. Tool design is the highest-leverage engineering in the whole build.
A tool (an "action" in Bedrock Agents terms) is a callable capability the agent can invoke: get_order(order_id), search_policy(query), issue_refund(order_id, amount), send_email(to, subject, body). Each needs a schema the model reads and an executor that runs it. In Bedrock Agents you group related actions into action groups, describe them with an OpenAPI schema or a simpler function-definition format, and back each with a Lambda. In a custom Converse loop you pass a toolConfig — a list of tool specs with JSON-Schema input definitions — and dispatch the model's toolUse requests to your own Lambda handlers. The shape differs; the discipline is the same.
The model decides which tool to call and what arguments to pass purely from the schema text. Vague names and thin descriptions are the number-one cause of an agent calling the wrong tool or hallucinating a parameter. Invest in precise tool names, a one-line description of when to use each tool (not just what it does), clear per-parameter descriptions, correct required/optional flags, and explicit enums for constrained values. A good rule: a competent human who had never seen your system should be able to pick the right tool and fill its arguments from the descriptions alone. If they can't, neither can the model.
A small set of sharply scoped tools beats a few god-tools that take a free-form command. Narrow tools are easier for the model to choose correctly and far easier to secure. Because each executor is usually a Lambda, give every tool Lambda an IAM role with least privilege — only the permissions that one action needs — so that even a hijacked or confused agent cannot exceed its mandate. Make state-changing actions idempotent (a retried issue_refund must not double-refund), set sensible timeouts, and validate arguments inside the executor rather than trusting the model's output.
What a tool returns goes straight back into the model's context, so return clean, structured data, not a raw stack trace or an HTML error page. Equally important: handle failure gracefully. When a tool fails — order not found, API timeout, malformed input — return a descriptive, structured error the model can reason about ("order_not_found: no order matches 4471; ask the user to re-check the ID") rather than throwing. A thrown exception leaves the agent blind; a good error message lets it recover or ask the user. This single practice prevents a large share of agent loops and dead ends.
Not every action should be fully automated. For sensitive or irreversible operations (large refunds, account deletion, sending money), keep a human or your own backend in the loop. In Bedrock Agents this is return-of-control: instead of executing the action, Bedrock returns the chosen action and parameters to your application to run (or to gate behind an approval), then you send the result back. In a custom loop you simply do not auto-execute that tool — you route it to a Step Functions approval state or surface it for confirmation. The model still plans; you control execution of the dangerous parts.
For every tool: a precise name · a description of when to use it · clear per-parameter descriptions · correct required/optional + enums · a least-privilege IAM role on its executor · idempotency for state changes · structured success and error returns · a hand-off (return-of-control / approval) for high-impact actions. Get these right and most "the agent is unreliable" problems disappear.
Tools let an agent act; knowledge lets it answer from your data; memory lets it carry context across turns and sessions. Most genuinely useful agents need all three.
An agent that must answer from your documents needs retrieval-augmented generation: index your content into a vector store, retrieve the most relevant chunks for the question at hand, and let the model answer from those passages with citations. On AWS the managed route is Amazon Bedrock Knowledge Bases — point it at an S3 bucket (or a connector like SharePoint, Confluence, or a web crawler), pick an embedding model and a vector store (OpenSearch Serverless, Aurora pgvector, Pinecone, Redis), and it handles chunking, embedding, indexing, retrieval, and optional re-ranking. With Bedrock Agents you simply associate the Knowledge Base and the orchestration loop can query it at any step; in a custom loop you expose retrieval as a tool (e.g. search_docs(query)) the model can call, or you retrieve up front and inject the context.
The division of labor is worth stating plainly: tools are for doing, knowledge is for knowing. A support agent retrieves the refund policy from a Knowledge Base (knowing), then calls the refund tool (doing). The hard parts of RAG — chunking, embedding choice, re-ranking, freshness, and access control — are not agent-specific; they are covered in depth in the rag-on-aws and amazon-bedrock-knowledge-bases siblings. The one agent-specific rule: treat retrieved content as untrusted input, because a poisoned document can carry prompt-injection instructions (see section VII).
Within a single task the agent maintains session state: the running context of the interaction. In Bedrock Agents you invoke with a session identifier and Bedrock keeps the turn-by-turn context (plus session attributes like a logged-in customer ID) tied to it. In a custom loop, the message list you maintain across iterations is the session state — you assemble it, and you decide what to keep, summarize, or drop as it grows.
Across tasks, useful agents need longer-term memory so a returning user is not a stranger. Bedrock Agents offer a configurable cross-session memory feature that retains a summary of prior conversations per memory ID, with its own retention controls. In a custom loop you build this yourself — commonly a DynamoDB table (or a vector store for semantic recall) keyed by user, written to at the end of a session and read back at the start of the next. Either way, memory is a deliberate choice: it improves continuity but stores and re-injects user context, which has privacy and cost implications. Enable it where it earns its keep, scope what you retain, and respect data-retention and deletion requirements.
Tools/actions let the agent do (call APIs, change state). Knowledge / RAG lets it know (retrieve from your data, with citations). Memory lets it remember (within a session, and across sessions). Managed Bedrock Agents provide all three as configuration; a custom loop builds each from Converse + Lambda + a vector store + DynamoDB. Treat all retrieved/tool content as untrusted.
An agent that can act on your systems is a security boundary, and a loop you cannot see is a loop you cannot trust. Guardrails and observability are not finishing touches; they are prerequisites for letting an agent anywhere near production.
Amazon Bedrock Guardrails is a configurable safety filter that screens both the user input and the model output against content filters (hate, violence, sexual, etc.), denied topics you define, word/profanity filters, sensitive-information (PII) detection and redaction, and a prompt-attack filter that helps defend against prompt-injection and jailbreaks. In Bedrock Agents you attach a Guardrail to the agent and it applies across the loop; in a custom loop you call the guardrail (via the ApplyGuardrail API or inline on Converse) around your model calls. Because an agent both reads untrusted content and can take real actions, the Guardrail is an essential boundary, not an optional add-on. The amazon-bedrock-guardrails sibling covers configuration in depth.
The defining operational requirement of an agent is traceability: for any given run you need to see what the model reasoned, which tool it chose and with what arguments, what the tool returned, and how that shaped the next step. In Bedrock Agents this is the trace — a structured, step-by-step record of the orchestration you enable on invocation; it is the single most important debugging tool, because it turns an opaque loop into something you can inspect line by line. In a custom loop you build the equivalent: log every model request/response and every tool call, ideally with a shared trace/correlation ID per task.
Around that, use the standard AWS surface. Amazon CloudWatch for metrics (invocations, latency, errors) and Logs; Bedrock model-invocation logging to capture request/response detail to CloudWatch Logs or S3 for audit; AWS X-Ray for distributed tracing across the Lambda tool calls (and the Step Functions execution graph, which is itself a visual trace of every state). The production-standard setup: enable model-invocation logging, capture the agent trace on a sampled or error-only basis (it is verbose and adds payload), alarm on Lambda errors and latency, and watch token consumption as a first-class metric — it is both your cost signal and an early warning that the agent is looping.
Before an agent touches production: (1) attach a Guardrail (or call ApplyGuardrail in your loop); (2) capture a trace of every step (managed trace, or your own structured logs with a correlation ID); (3) least-privilege every tool Lambda's IAM role; (4) treat tool/retrieved content as untrusted; (5) alarm on errors, latency, and token spend. An agent you cannot see and cannot constrain is not production-ready.
Both paths run the same model on Bedrock and use the same five building blocks. The choice comes down to control versus plumbing: how much of the orchestration you need to own, against how much you want to build and maintain.
Default to managed Bedrock Agents. It collapses the loop, the trace, versioning, memory, and KB integration into configuration, and it is the right answer for the large majority of single-purpose agents — support automation, operational copilots, internal assistants. Reach for the custom Converse loop when a row in the right-hand column of the table below is a hard requirement: durable long-running or human-in-the-loop workflows (Step Functions), multi-agent coordination, a specific orchestration framework, bespoke control flow, or aggressive cost/latency tuning. The table makes the trade explicit.
| Dimension | Managed — Bedrock Agents | Custom — Converse + Lambda + Step Functions |
|---|---|---|
| Who runs the loop | Bedrock (managed orchestration) | You (your code on the Converse API) |
| Time to first agent | Days — declare model, tools, KB, guardrail | Weeks — build the loop, tool dispatch, state, logging |
| Tools / actions | Action groups + Lambda + OpenAPI/function schema | toolConfig + your Lambda handlers + JSON-Schema |
| Knowledge (RAG) | Associate a Bedrock Knowledge Base | Retrieval-as-a-tool, or your own RAG pipeline |
| Memory | Built-in session + cross-session memory | Your message list + DynamoDB / vector store |
| Guardrails | Attach to the agent | ApplyGuardrail / inline on Converse |
| Observability | Built-in step-by-step trace | Your structured logs + X-Ray + Step Functions graph |
| Long-running / human-in-the-loop | Return-of-control for hand-offs | Native via Step Functions (task tokens, waits) |
| Multi-agent / bespoke control flow | Limited (managed loop) | Full — you design the orchestration |
| You maintain | Almost nothing — AWS runs the loop | All of it — the loop and its plumbing |
| Best for | Most single-purpose agents; ship fast | Durable/multi-agent workflows, framework integration, cost-squeeze |
Here is the fastest credible path from zero to a working, production-leaning agent on AWS. The managed steps come first; the note on each step says what changes if you go custom. The order matters — most teams skip the scoping and evaluation steps and pay for it later.
There is no separate "agent" fee on AWS. An agent costs the sum of what it consumes, and the dominant line is almost always model tokens — because an agent re-sends its instructions, tool schemas, and accumulated context on every step of the loop.
The figures and shape below are representative as of 2026 to show where the money goes, not a quote — always check the AWS pricing page (and any third-party vendor, e.g. Pinecone) for current rates. The thing to internalize is the multiplier: a single user request can trigger several model calls, each re-sending the fixed prompt prefix plus everything observed so far. A four-step agent task can therefore cost several times a single chat completion. Same loop, same cost dynamics, whether managed or custom.
| Cost line | When you pay | Driver | Main lever to control it |
|---|---|---|---|
| Model tokens (the loop) | Per orchestration step (usually the largest) | Steps × (instructions + tool schemas + context) tokens | Smaller model; tight instructions/schemas; prompt caching; cap steps |
| Lambda (tool execution) | Per tool call | Invocations × duration × memory | Right-size memory; fast handlers; avoid chatty tools |
| Knowledge Base / RAG | Indexing + per query | Embedding tokens + vector-store baseline + retrieval | Re-rank to few chunks; right-size the vector store; only re-embed changed docs |
| Memory store | Continuous (if enabled) | DynamoDB / vector store reads + writes + storage | Retain only what you need; summarize instead of storing raw transcripts |
| Guardrails | Per evaluation | Text units screened (in + out) | Screen what matters; do not double-screen the same text |
| Step Functions (custom) | Per state transition | Transitions per execution (Standard) × volume | Use Express workflows for high-volume short runs; collapse trivial states |
Agents that demo beautifully struggle in production for predictable reasons. Here are the failure modes that bite teams most, and the mitigation for each — the same list applies to both build paths.
Before launch: narrowly scoped tools with rich schemas · least-privilege IAM on every executor · idempotent state-changing actions · structured error returns · a Guardrail on inputs and outputs · high-impact actions behind approval/return-of-control · a captured trace per run · CloudWatch alarms on errors/latency/token spend · a hard iteration cap · a scenario-based evaluation set in CI · a cost ceiling. Miss one and it surfaces in production, not the demo.
This is the comparison that decides your architecture. Read it as "default to managed Bedrock Agents; move to a custom Converse loop only when a row in the right column is a hard requirement for you."
| Dimension | Bedrock Agents (managed) | Custom (Converse + Lambda + Step Functions) |
|---|---|---|
| Best for | Most single-purpose agents; ship in days | Durable/multi-agent workflows; framework integration; cost-squeeze |
| Who runs the loop | Bedrock (managed ReAct orchestration) | You (your while-loop on the Converse API) |
| Build + maintenance effort | Low — declare components, AWS runs the loop | High — build and own the loop and its plumbing |
| Control over orchestration | Medium — instructions + prompt templates | Total — you design every step |
| Long-running / human-in-the-loop | Return-of-control hand-offs | Native via Step Functions (task tokens, waits) |
| Multi-agent coordination | Limited | Full — orchestrate however you like |
| Knowledge, memory, guardrails | Built-in (KB assoc., session/cross-session memory, attach Guardrail) | You assemble (retrieval-as-tool, DynamoDB/vector store, ApplyGuardrail) |
| Observability | Built-in step-by-step trace | Your structured logs + X-Ray + Step Functions graph |
Situation: The ops team was hand-handling a flood of shipment exceptions — look up the shipment, check the carrier status, decide on a reroute or refund within policy, notify the customer, and escalate the hard ones. They wanted an agent to do the routine 80% end-to-end, but it had to be durable (some cases wait hours for a carrier response), keep a human approval step for refunds and reroutes above a threshold, answer policy questions from their internal docs, and leave a full audit trail. A first single-prompt prototype could not take actions, hallucinated policy, and had no approval or audit story. They also did not want to fund the inference out of a runway earmarked for hiring.
What CloudRoute did: CloudRoute matched them in under 24 hours to an AWS partner in the Singapore Region with GenAI and Step Functions experience. The partner built a custom tool-use loop on the Bedrock Converse API (Claude as the reasoning model with tight instructions), with each action — shipment lookup, carrier status, reroute, refund, notify — a least-privileged Lambda described by a JSON-Schema tool spec returning structured results and errors. AWS Step Functions orchestrated the workflow for durability: long carrier waits via task tokens, refunds and reroutes above a threshold paused for human approval, retries and branching on failure. Policy answers came from a Bedrock Knowledge Base over the ops docs, exposed as a retrieval tool; a Bedrock Guardrail screened inputs and outputs; CloudWatch, X-Ray, and full per-task logging gave end-to-end traceability. The partner filed a Bedrock POC credit application plus an Activate Portfolio application to fund the build and the inference.
Outcome: The agent resolved the routine majority of shipment exceptions end-to-end within policy, with high-impact actions held for human approval and every run fully auditable. The Step Functions backbone handled the multi-hour waits without losing state. Model inference, Lambda, the Knowledge Base, Guardrails, and Step Functions were covered by the approved AWS credits, so the build and early production ran at $0 out of pocket. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
custom Converse loop + Step Functions · tools as least-privileged Lambdas · refunds/reroutes behind approval · KB-grounded policy · credits: POC + Activate · out-of-pocket: $0
Whatever your agent would cost to build and run on AWS — Bedrock inference, Lambda tools, a Knowledge Base, Step Functions, guardrails — AWS credits can cover it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who builds the agent, managed or custom, the right way. Customer pays $0.