A complete, neutral reference for the Evaluations feature in Amazon Bedrock: the three job types — automatic (built-in algorithmic metrics), human, and LLM-as-a-judge; how to evaluate a RAG system on both retrieval and generation quality against a Knowledge Base; how to build an evaluation dataset that actually predicts production; what the metrics mean (accuracy, robustness, toxicity, faithfulness, completeness, relevance); and how to read the scores to choose a model objectively and right-size it so you are not over-paying for capability you do not need. Plus how AWS credits fund the evaluation and the inference behind it, so the work costs you $0.
Model evaluation in Amazon Bedrock is a managed feature for measuring how well a model — or a whole retrieval-augmented generation system — performs on your task, using your data, so you can choose between candidates with evidence instead of intuition. You create an evaluation job, point it at one or more models and a dataset of prompts, pick the metrics you care about, and Bedrock produces a scored report you can compare.
Every team building on Bedrock faces the same fork: which model do we use? There are many foundation models available through the one Bedrock API — Anthropic's Claude, Meta's Llama, Amazon's Nova and Titan, Mistral, Cohere, AI21, DeepSeek, and more — at very different price, latency, and capability points. Picking by a public leaderboard, a blog post, or the model everyone is talking about is how teams end up over-paying for a frontier model on a task a far cheaper one would have aced, or shipping a model that looks great in a demo and fails on the long tail of real inputs. Evaluation replaces that guesswork with a measurement on your prompts.
On Bedrock this is delivered as a managed capability, not a framework you assemble. You create an evaluation job in the console (or via the API), choose what to evaluate, supply a prompt dataset, select metrics, and Bedrock runs the candidates, applies the scoring, and writes the results — including per-metric scores and the underlying generations — to an Amazon S3 bucket you control. You can evaluate a single model to get an absolute read, or run several models and configurations head-to-head to compare them on the same prompts under the same scoring.
There are two broad things you can evaluate. The first is a model (or model configuration) directly: how well does this model, with this prompt and these inference settings, do the task. The second is a RAG system built on a Bedrock Knowledge Base: how well does retrieval-plus-generation answer questions over your own documents — scored on both the retrieval step and the generated answer, separately, which is what lets you diagnose where a bad answer came from. Both are covered below.
One caveat, stated once and meant throughout: the exact metric names, the available built-in metrics, the supported judge models, and pricing change on Bedrock over time. Everything here is representative as of 2026 to convey how evaluation works and how to reason about it. Confirm the current metric catalog, supported models, and rates on the official AWS Bedrock documentation and pricing pages before you design your evaluation around a specific feature.
Bedrock model evaluation = a managed way to score one or more models (or a RAG system) on your prompts and the metrics that matter for your task, so you can choose a model objectively and right-size it. Three job types — automatic, human, and LLM-as-a-judge — plus a dedicated RAG evaluation that scores retrieval and generation separately.
Bedrock evaluation comes in three flavours, and choosing the right one is most of the skill. They trade off cost, speed, scale, and how well they capture subjective quality. The rule of thumb: objective, checkable tasks lean automatic; nuanced quality leans human; LLM-as-a-judge sits in between — close to human judgement at close to automatic cost and scale.
Read these as a spectrum from cheapest-and-most-mechanical to most-expensive-and-most-nuanced. Many real evaluations use more than one — for example, an automatic pass to screen many candidates cheaply, then a human or LLM-judge pass on the finalists for the qualities a metric cannot capture.
Automatic (or "programmatic") evaluation scores outputs with built-in algorithms and reference-based metrics — no humans in the loop. You supply prompts (and, for many metrics, a reference/expected answer), Bedrock runs the candidate model, and computes scores for dimensions such as accuracy, robustness, and toxicity using established methods. It is fast, cheap, repeatable, and scales to large datasets, which makes it the right first pass for objective tasks where there is a checkable right answer — classification, extraction, summarization against a reference, question-answering with known answers. Its limit is that algorithmic metrics struggle with open-ended quality ("is this answer genuinely helpful, well-reasoned, on-brand?"); for that you want one of the next two.
Human evaluation puts real people in the loop to score or compare outputs against instructions and a rubric you define — rating helpfulness, correctness, tone, relevance, or doing side-by-side preference comparisons between two models. Bedrock lets you bring your own work team (your domain experts) or use an AWS-managed workforce. Humans are the gold standard for subjective, high-stakes, or domain-specific quality where nuance matters and a wrong call is costly. The trade-off is the obvious one: it is slower and more expensive and harder to scale, so it is usually reserved for the metrics and the finalists that truly need a human eye.
LLM-as-a-judge uses a capable "judge" model to evaluate another model's outputs against a rubric — approximating human judgement on open-ended quality at a small fraction of human cost and at machine speed and scale. You define what "good" means (helpfulness, coherence, faithfulness to a reference, relevance, harmlessness), and the judge model scores each output, often with a written rationale. This has become the pragmatic default for subjective quality at scale: far cheaper and faster than human review, far more nuanced than algorithmic metrics. Treat it carefully — judge models have biases (e.g., toward longer or more confident answers) — so validate the judge against a sample of human labels and avoid having a model judge itself. Used well, it lets you evaluate thousands of outputs on qualities a metric cannot express.
The most useful thing on this page is a clear rule for choosing the job type. Match the method to the <em>kind</em> of thing you are measuring — whether there is a checkable right answer, how subjective the quality is, and how many outputs you need to score — not to which sounds most rigorous.
Diagnose by asking what "good" means for your task and how you would recognize it, then read across to the right method:
Checkable answer → automatic. Subjective + high-stakes → human. Subjective + needs scale → LLM-as-a-judge. RAG system → RAG evaluation. And in practice: screen cheaply with automatic, then judge the finalists with humans or an LLM-judge. Pick the lightest method that actually captures what "good" means for your task.
The same three methods, lined up against the dimensions that drive the choice: what scores them, how well they capture subjective quality, how they scale, what they cost, and the task they fit best.
| Job type | Who/what scores | Captures subjective quality? | Speed & scale | Relative cost | Best for |
|---|---|---|---|---|---|
| Automatic | Built-in algorithms / reference metrics | Weak (objective only) | Fast, large scale | Lowest (inference only) | Accuracy, robustness, toxicity on checkable tasks |
| Human (your team) | Your domain experts, via a rubric | Strong (expert nuance) | Slow, limited scale | Highest (expert time) | High-stakes, domain-specific quality |
| Human (AWS-managed) | AWS-managed workforce, via a rubric | Strong (general) | Medium, managed scale | High | Subjective quality without standing up your own panel |
| LLM-as-a-judge | A capable judge model, via a rubric | Good (approximates human) | Fast, large scale | Low–medium (judge inference) | Subjective quality at scale; finalist scoring |
If your application answers from your own documents through a Bedrock Knowledge Base, evaluating only the final answer hides the most useful information: <em>where</em> it went wrong. Bedrock's RAG evaluation scores the two halves of the pipeline independently — did retrieval fetch the right context, and did generation use it well — so a bad answer is diagnosable instead of just disappointing.
A RAG answer can fail in two fundamentally different places. Retrieval can fail — the system pulls the wrong, irrelevant, or incomplete chunks from the vector store, so even a perfect model cannot answer well because it never saw the right context. Or generation can fail — retrieval found the right context, but the model ignored it, contradicted it, hallucinated beyond it, or answered incompletely. These call for opposite fixes (improve chunking/embeddings/retrieval settings vs. improve the prompt/model), so a single end-to-end score that cannot tell them apart leaves you guessing.
Bedrock's RAG (Knowledge Base) evaluation addresses this by scoring the pipeline on metrics grouped around the two stages. On the retrieval side, it assesses whether the retrieved passages are relevant to the question and whether they cover the information needed to answer — i.e., did the right context come back at all. On the generation side, it assesses the answer for faithfulness / groundedness (does the answer stay true to the retrieved context, or does it hallucinate beyond it), relevance to the question, completeness (does it cover what was asked), and often correctness against a reference answer where you have one. Many of these generation-side judgements are made with an LLM-as-a-judge under the hood, because faithfulness and relevance are exactly the open-ended qualities a judge model captures well.
The practical payoff is a clean diagnosis. Low retrieval scores, decent generation scores → fix the retrieval layer: revisit chunk size and overlap, the embeddings model, the number of results returned, metadata filtering, or the source documents themselves. Good retrieval scores, low faithfulness/completeness → fix the generation layer: tighten the prompt to force grounding ("answer only from the provided context"), try a stronger or differently-tuned model, or adjust how context is assembled. Without the split you would re-tune the wrong half of the system and wonder why the answers did not improve. See the rag-on-aws and amazon-bedrock-knowledge-bases siblings for building the pipeline this evaluation measures.
A RAG answer fails for one of two reasons: retrieval brought back the wrong context, or generation misused the right context. Bedrock RAG evaluation scores them separately — retrieval relevance/coverage vs. generation faithfulness/relevance/completeness — so you fix the half that is actually broken instead of re-tuning the whole pipeline blind.
An evaluation is only as good as the prompts it runs on. A dataset that mirrors your real workload — including its hard and unusual cases — predicts production; a dataset of easy, hand-picked prompts produces a flattering score that collapses on contact with real users. Bedrock expects the dataset as a JSONL file in Amazon S3, one prompt (and, for reference-based metrics, an expected answer) per line.
The format is JSONL — "JSON Lines" — a plain-text file where each line is one self-contained JSON object describing a single evaluation example: at minimum the input prompt, and for reference-based metrics a reference/expected output to score against. (For RAG evaluation you provide the questions, and the system retrieves and generates; you can also supply ground-truth answers where you have them.) The exact field names depend on the evaluation type and metrics, so check the AWS docs for the schema your job expects. You upload the file to an S3 bucket and Bedrock writes the scored results back to S3.
Representativeness is everything. The prompts must look like what real users actually send — same phrasing, same length distribution, same messiness — and the dataset must include the hard cases: ambiguous questions, edge inputs, adversarial or out-of-scope requests, the long tail where models diverge. A model that scores 95% on twenty clean prompts can be the wrong choice if it falls apart on the 5% of gnarly inputs that drive your support tickets. The point of an evaluation set is to surface where models differ, so deliberately include the cases that separate them.
Size and balance. Bigger and more diverse beats small and cherry-picked, but quality and coverage matter more than raw count — a few hundred well-chosen, representative prompts that span your real input distribution and class mix usually tell you more than thousands of near-duplicates. Balance the categories you care about so a model cannot win by being good at only the common case. Keep the evaluation set separate from any fine-tuning training data — evaluating on data the model trained on measures memorization, not real performance.
Practically, assembling a clean, representative, labelled evaluation set out of raw logs, tickets, and documents is where most of the human effort in an evaluation goes — and it is exactly the work a vetted AWS ML partner does efficiently. Because the engagement is credit-funded, the customer does not pay for it (see §IX). A good evaluation set is also a durable asset: you reuse it every time a new model is released to re-run the comparison cheaply.
A score is only useful if you know what it measures. Bedrock evaluation reports across several families of metric, and the ones you select should follow directly from what your task needs. Here is what the common ones mean and when each should drive your decision.
Group the metrics by what they protect. Quality/correctness metrics ask "is the output right and good." Safety metrics ask "is the output harmful." And for RAG specifically, grounding metrics ask "does the answer stay true to the retrieved source." You will usually care about a small handful, not all of them — choose the two or three that map to how your product can fail.
Do not score on every metric — score on the two or three that map to how your product breaks. A structured-extraction tool lives or dies on accuracy and robustness; a public chatbot adds toxicity; a knowledge assistant is dominated by faithfulness, relevance, and completeness. Pick the metrics that match your failure modes, then let them drive the model choice.
An evaluation report is not the answer — it is the evidence. The decision is a judgement about which model is <em>good enough</em> for the task at the lowest cost and latency. The most valuable outcome of evaluation is usually not "the best model wins" but "the cheapest model that clears the bar wins" — right-sizing, which is one of the biggest levers on an AI product's unit economics.
Read the results as a trade-off, not a ranking. The frontier model will often top the quality scores — but if a model two or three tiers cheaper and faster also clears your quality bar on your prompts, that cheaper model is usually the right production choice, because per-token cost and latency compound across every request you will ever serve. The discipline is to set the quality bar before you look at cost ("for this task we need ≥ X on faithfulness and ≤ Y toxicity"), then pick the cheapest, fastest candidate that meets it. This is what "right-sizing" means: matching model capability to task difficulty instead of reflexively buying the most powerful option.
Evaluation also lets you compare more than just models. Run the same dataset across candidate configurations — a base model with strong prompting vs. the same model with RAG; a smaller model with RAG vs. a larger model without; different temperature or system-prompt variants; a fine-tuned model vs. the base it came from. Often the winning answer is not "a better model" but "a cheaper model plus RAG plus a better prompt," and only a head-to-head on your data reveals it. This is also how you justify (or reject) a fine-tune: evaluate the custom model against the base model with good prompting and decide whether the gain is worth the standing hosting cost (see the amazon-bedrock-fine-tuning sibling).
Finally, make evaluation repeatable, not a one-off. Models are released and updated constantly; a quarterly re-run of your saved evaluation set against the latest models tells you cheaply whether a newer, cheaper, or better model now clears your bar — letting you migrate deliberately instead of either churning on every release or ossifying on an outdated choice. A standing evaluation harness turns "which model?" from a recurring argument into a measurement you can re-run on demand.
Everything above prices evaluation if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the work with credits — and the inference an evaluation runs (every candidate model processing every prompt), any human-review or judge-model fees, and the build of the harness itself draw those credits down before they ever touch your card.
An evaluation's cost is mostly inference: each candidate model has to process every prompt in your dataset, and an LLM-as-a-judge run adds the judge model's tokens on top. Add any AWS-managed human-workforce fees and the S3 storage for datasets and results, and that is the bill. It is usually small relative to the savings from choosing correctly — right-sizing from a frontier model to a model a few tiers cheaper can cut per-request cost by a large multiple across every request you will ever serve, which is exactly why evaluation pays for itself. And all of it is credit-eligible: model inference on Bedrock, the judge-model calls, the embeddings and vector store behind a RAG evaluation, and the S3 storage are covered, and AWS credits apply automatically against your bill until exhausted.
The relevant pools are AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed specifically at proving out a GenAI use case — and a model evaluation is precisely that proof, the step that de-risks which model and architecture you commit to — and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). Funding the evaluation with credits means you can compare candidates thoroughly, including the expensive frontier models, without spending runway just to find out which one you actually need.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS ML partner who both files the credit application and builds the evaluation: assembling a representative JSONL dataset, choosing the right job types and metrics for your task, running the head-to-head across candidate models and configurations, evaluating your RAG pipeline on retrieval and generation, and turning the scores into an honest model choice and right-sizing recommendation. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. Related: AWS credits for generative-AI startups and Bedrock POC funding.
The headline decision, on one screen. Match the row to what you are actually measuring — whether there is a checkable answer, how subjective the quality is, how much scale you need, and whether you are testing a model or a whole retrieval pipeline. Representative 2026 guidance, not quotes.
| Method | Best when… | Captures subjective quality | Scale | Cost shape | Reach for it… |
|---|---|---|---|---|---|
| Automatic | There is a checkable right answer | Weak | Large | Inference only (cheapest) | First — to screen candidates cheaply |
| Human (your team) | High-stakes, domain-specific quality | Strongest | Small | Expert time (highest) | On finalists where nuance is costly |
| Human (AWS-managed) | Subjective quality, no panel of your own | Strong | Medium | Managed-workforce fees | When you lack in-house reviewers |
| LLM-as-a-judge | Subjective quality at scale and speed | Good (approximates human) | Large | Judge-model inference | For open-ended quality at volume |
| RAG evaluation | Answers come from a Knowledge Base | Good (judge under the hood) | Large | Inference + embeddings/vector store | To diagnose retrieval vs. generation |
Situation: The team had shipped a RAG support assistant on the most capable (and most expensive) frontier model on Bedrock "to be safe," and inference cost was eating their margin as volume grew. They suspected a cheaper model would be fine but had no evidence — and they were also seeing occasional confidently-wrong answers they could not explain, unsure whether the fix was a better model or better retrieval.
What CloudRoute did: CloudRoute matched them in under 24 hours to a Canadian AWS ML partner. The partner built a Bedrock evaluation harness: a JSONL dataset of ~600 real support questions (including the gnarly long-tail ones), an automatic pass for accuracy, an LLM-as-a-judge pass for helpfulness and faithfulness, and a dedicated RAG evaluation scoring retrieval and generation separately. They ran four candidate models head-to-head, plus a "smaller model + improved retrieval" configuration. They filed a Bedrock POC credit application plus an Activate Portfolio application to fund the whole build and the evaluation inference.
Outcome: The RAG split showed the confidently-wrong answers were a <em>retrieval</em> problem (poor chunking), not the model — fixed by re-chunking, after which a model two tiers cheaper cleared the same faithfulness and helpfulness bar as the frontier model on their prompts. They right-sized to the cheaper model plus better retrieval, cutting per-request inference cost substantially with no measurable quality loss. The evaluation inference, embeddings, vector store, and judge-model calls were all covered by the approved credits, so the team paid $0 during the build and proof-out. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
method: automatic + LLM-judge + RAG eval · outcome: right-sized down two model tiers · out-of-pocket during build: $0
Picking the wrong model — or over-paying for a frontier model a cheaper one would have aced — is one of the most expensive mistakes in an AI build. A proper Bedrock evaluation fixes it, and AWS credits cover the evaluation runs and the inference behind them. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS ML partner who builds the eval harness, runs the head-to-head, and turns the scores into an honest model choice. Customer pays $0.