bedrock model evaluation · choosing & right-sizing · 2026

Amazon Bedrock model evaluation — pick the right model with evidence, not vibes.

A complete, neutral reference for the Evaluations feature in Amazon Bedrock: the three job types — automatic (built-in algorithmic metrics), human, and LLM-as-a-judge; how to evaluate a RAG system on both retrieval and generation quality against a Knowledge Base; how to build an evaluation dataset that actually predicts production; what the metrics mean (accuracy, robustness, toxicity, faithfulness, completeness, relevance); and how to read the scores to choose a model objectively and right-size it so you are not over-paying for capability you do not need. Plus how AWS credits fund the evaluation and the inference behind it, so the work costs you $0.

job types
automatic · human · LLM-judge
evaluates
models AND RAG
dataset format
JSONL prompts
cost with credits
$0
TL;DR
  • Amazon Bedrock model evaluation lets you compare foundation models — and your RAG system — on your own data with three kinds of job: automatic (built-in algorithmic metrics like accuracy, robustness, and toxicity), human (your own team or an AWS-managed workforce scoring outputs against a rubric), and LLM-as-a-judge (a strong model grading another model's outputs at a fraction of human cost). You pick the method by what you are measuring: objective tasks lean automatic, subjective quality leans human or LLM-judge.
  • It is the antidote to picking a model by leaderboard or vibes. You build an evaluation dataset of prompts that mirror your real workload, run two or more candidate models (and configurations) through it, score them on the metrics that matter for your task, and read off which model is good enough — then right-size down to the cheapest, fastest model that still clears the bar. Bedrock also evaluates RAG specifically, scoring retrieval quality and generation quality (faithfulness, relevance, completeness) separately so you can tell whether a bad answer was a retrieval failure or a generation failure.
  • The cost is mostly the inference you run during evaluation (every candidate processes every prompt) plus any human-review or judge-model fees — usually small next to the savings from choosing correctly, because right-sizing a model is one of the largest levers on an AI product's unit economics. AWS credits (Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) cover the evaluation runs and the build. CloudRoute routes you to the credit pool and a vetted AWS partner who builds the eval harness — so you pay $0.
definition

IWhat Amazon Bedrock model evaluation actually is

Model evaluation in Amazon Bedrock is a managed feature for measuring how well a model — or a whole retrieval-augmented generation system — performs on your task, using your data, so you can choose between candidates with evidence instead of intuition. You create an evaluation job, point it at one or more models and a dataset of prompts, pick the metrics you care about, and Bedrock produces a scored report you can compare.

Every team building on Bedrock faces the same fork: which model do we use? There are many foundation models available through the one Bedrock API — Anthropic's Claude, Meta's Llama, Amazon's Nova and Titan, Mistral, Cohere, AI21, DeepSeek, and more — at very different price, latency, and capability points. Picking by a public leaderboard, a blog post, or the model everyone is talking about is how teams end up over-paying for a frontier model on a task a far cheaper one would have aced, or shipping a model that looks great in a demo and fails on the long tail of real inputs. Evaluation replaces that guesswork with a measurement on your prompts.

On Bedrock this is delivered as a managed capability, not a framework you assemble. You create an evaluation job in the console (or via the API), choose what to evaluate, supply a prompt dataset, select metrics, and Bedrock runs the candidates, applies the scoring, and writes the results — including per-metric scores and the underlying generations — to an Amazon S3 bucket you control. You can evaluate a single model to get an absolute read, or run several models and configurations head-to-head to compare them on the same prompts under the same scoring.

There are two broad things you can evaluate. The first is a model (or model configuration) directly: how well does this model, with this prompt and these inference settings, do the task. The second is a RAG system built on a Bedrock Knowledge Base: how well does retrieval-plus-generation answer questions over your own documents — scored on both the retrieval step and the generated answer, separately, which is what lets you diagnose where a bad answer came from. Both are covered below.

One caveat, stated once and meant throughout: the exact metric names, the available built-in metrics, the supported judge models, and pricing change on Bedrock over time. Everything here is representative as of 2026 to convey how evaluation works and how to reason about it. Confirm the current metric catalog, supported models, and rates on the official AWS Bedrock documentation and pricing pages before you design your evaluation around a specific feature.

the one-line definition

Bedrock model evaluation = a managed way to score one or more models (or a RAG system) on your prompts and the metrics that matter for your task, so you can choose a model objectively and right-size it. Three job types — automatic, human, and LLM-as-a-judge — plus a dedicated RAG evaluation that scores retrieval and generation separately.

the three job types

IIThe three kinds of evaluation job — automatic, human, and LLM-as-a-judge

Bedrock evaluation comes in three flavours, and choosing the right one is most of the skill. They trade off cost, speed, scale, and how well they capture subjective quality. The rule of thumb: objective, checkable tasks lean automatic; nuanced quality leans human; LLM-as-a-judge sits in between — close to human judgement at close to automatic cost and scale.

Read these as a spectrum from cheapest-and-most-mechanical to most-expensive-and-most-nuanced. Many real evaluations use more than one — for example, an automatic pass to screen many candidates cheaply, then a human or LLM-judge pass on the finalists for the qualities a metric cannot capture.

1. Automatic evaluation — built-in algorithmic metrics

Automatic (or "programmatic") evaluation scores outputs with built-in algorithms and reference-based metrics — no humans in the loop. You supply prompts (and, for many metrics, a reference/expected answer), Bedrock runs the candidate model, and computes scores for dimensions such as accuracy, robustness, and toxicity using established methods. It is fast, cheap, repeatable, and scales to large datasets, which makes it the right first pass for objective tasks where there is a checkable right answer — classification, extraction, summarization against a reference, question-answering with known answers. Its limit is that algorithmic metrics struggle with open-ended quality ("is this answer genuinely helpful, well-reasoned, on-brand?"); for that you want one of the next two.

2. Human evaluation — your team or an AWS-managed workforce

Human evaluation puts real people in the loop to score or compare outputs against instructions and a rubric you define — rating helpfulness, correctness, tone, relevance, or doing side-by-side preference comparisons between two models. Bedrock lets you bring your own work team (your domain experts) or use an AWS-managed workforce. Humans are the gold standard for subjective, high-stakes, or domain-specific quality where nuance matters and a wrong call is costly. The trade-off is the obvious one: it is slower and more expensive and harder to scale, so it is usually reserved for the metrics and the finalists that truly need a human eye.

3. LLM-as-a-judge — a strong model grades another model

LLM-as-a-judge uses a capable "judge" model to evaluate another model's outputs against a rubric — approximating human judgement on open-ended quality at a small fraction of human cost and at machine speed and scale. You define what "good" means (helpfulness, coherence, faithfulness to a reference, relevance, harmlessness), and the judge model scores each output, often with a written rationale. This has become the pragmatic default for subjective quality at scale: far cheaper and faster than human review, far more nuanced than algorithmic metrics. Treat it carefully — judge models have biases (e.g., toward longer or more confident answers) — so validate the judge against a sample of human labels and avoid having a model judge itself. Used well, it lets you evaluate thousands of outputs on qualities a metric cannot express.

the decision

IIIWhich evaluation method should you use?

The most useful thing on this page is a clear rule for choosing the job type. Match the method to the <em>kind</em> of thing you are measuring — whether there is a checkable right answer, how subjective the quality is, and how many outputs you need to score — not to which sounds most rigorous.

Diagnose by asking what "good" means for your task and how you would recognize it, then read across to the right method:

  • There is a checkable right answer (classification, extraction, QA with known answers) → automatic — When correctness is objective and you have references, built-in metrics are fast, cheap, repeatable, and scale to large datasets. This is the right first pass.
  • Quality is subjective and the stakes are high (medical, legal, brand-critical) → human — When nuance matters and a wrong call is costly, put domain experts (your own team) or an AWS-managed workforce in the loop with a clear rubric. Reserve it for the metrics and finalists that truly need it.
  • Quality is subjective but you need scale and speed → LLM-as-a-judge — When you must score thousands of open-ended outputs on helpfulness/coherence/relevance and cannot afford human review at that volume, a strong judge model approximates human judgement cheaply. Validate the judge against a human-labelled sample first.
  • You are evaluating a RAG system → RAG evaluation (retrieval + generation) — When answers come from a Knowledge Base, use the dedicated RAG evaluation so you score retrieval quality and generation quality separately and can tell where a bad answer came from. See §V.
  • You are screening many candidates, then choosing among finalists → combine them — A common shape is an automatic pass to cheaply narrow the field, then an LLM-judge or human pass on the top two or three for the qualities a metric cannot capture. The methods are not mutually exclusive.
  • You care about safety/toxicity specifically → automatic toxicity + human spot-check — Built-in toxicity metrics flag the bulk automatically; a human spot-check on flagged and borderline cases catches what the metric misses. Pair with Bedrock Guardrails for runtime enforcement.
the rule of thumb

Checkable answer → automatic. Subjective + high-stakes → human. Subjective + needs scale → LLM-as-a-judge. RAG system → RAG evaluation. And in practice: screen cheaply with automatic, then judge the finalists with humans or an LLM-judge. Pick the lightest method that actually captures what "good" means for your task.

job types at a glance

IVThe three job types, compared on what matters

The same three methods, lined up against the dimensions that drive the choice: what scores them, how well they capture subjective quality, how they scale, what they cost, and the task they fit best.

bedrock evaluation job types compared · 2026
Job typeWho/what scoresCaptures subjective quality?Speed & scaleRelative costBest for
AutomaticBuilt-in algorithms / reference metricsWeak (objective only)Fast, large scaleLowest (inference only)Accuracy, robustness, toxicity on checkable tasks
Human (your team)Your domain experts, via a rubricStrong (expert nuance)Slow, limited scaleHighest (expert time)High-stakes, domain-specific quality
Human (AWS-managed)AWS-managed workforce, via a rubricStrong (general)Medium, managed scaleHighSubjective quality without standing up your own panel
LLM-as-a-judgeA capable judge model, via a rubricGood (approximates human)Fast, large scaleLow–medium (judge inference)Subjective quality at scale; finalist scoring
A frequent production pattern: an <strong>automatic</strong> pass to screen many candidates cheaply, then an <strong>LLM-as-a-judge</strong> pass on the finalists, with a small <strong>human</strong> sample to validate the judge. Representative for 2026 — confirm the current metric catalog and supported judge models on the AWS Bedrock docs.
evaluating retrieval + generation

VEvaluating a RAG system — retrieval quality and generation quality, separately

If your application answers from your own documents through a Bedrock Knowledge Base, evaluating only the final answer hides the most useful information: <em>where</em> it went wrong. Bedrock's RAG evaluation scores the two halves of the pipeline independently — did retrieval fetch the right context, and did generation use it well — so a bad answer is diagnosable instead of just disappointing.

A RAG answer can fail in two fundamentally different places. Retrieval can fail — the system pulls the wrong, irrelevant, or incomplete chunks from the vector store, so even a perfect model cannot answer well because it never saw the right context. Or generation can fail — retrieval found the right context, but the model ignored it, contradicted it, hallucinated beyond it, or answered incompletely. These call for opposite fixes (improve chunking/embeddings/retrieval settings vs. improve the prompt/model), so a single end-to-end score that cannot tell them apart leaves you guessing.

Bedrock's RAG (Knowledge Base) evaluation addresses this by scoring the pipeline on metrics grouped around the two stages. On the retrieval side, it assesses whether the retrieved passages are relevant to the question and whether they cover the information needed to answer — i.e., did the right context come back at all. On the generation side, it assesses the answer for faithfulness / groundedness (does the answer stay true to the retrieved context, or does it hallucinate beyond it), relevance to the question, completeness (does it cover what was asked), and often correctness against a reference answer where you have one. Many of these generation-side judgements are made with an LLM-as-a-judge under the hood, because faithfulness and relevance are exactly the open-ended qualities a judge model captures well.

The practical payoff is a clean diagnosis. Low retrieval scores, decent generation scores → fix the retrieval layer: revisit chunk size and overlap, the embeddings model, the number of results returned, metadata filtering, or the source documents themselves. Good retrieval scores, low faithfulness/completeness → fix the generation layer: tighten the prompt to force grounding ("answer only from the provided context"), try a stronger or differently-tuned model, or adjust how context is assembled. Without the split you would re-tune the wrong half of the system and wonder why the answers did not improve. See the rag-on-aws and amazon-bedrock-knowledge-bases siblings for building the pipeline this evaluation measures.

why the split matters

A RAG answer fails for one of two reasons: retrieval brought back the wrong context, or generation misused the right context. Bedrock RAG evaluation scores them separately — retrieval relevance/coverage vs. generation faithfulness/relevance/completeness — so you fix the half that is actually broken instead of re-tuning the whole pipeline blind.

the evaluation dataset

VIBuilding an evaluation dataset that predicts production

An evaluation is only as good as the prompts it runs on. A dataset that mirrors your real workload — including its hard and unusual cases — predicts production; a dataset of easy, hand-picked prompts produces a flattering score that collapses on contact with real users. Bedrock expects the dataset as a JSONL file in Amazon S3, one prompt (and, for reference-based metrics, an expected answer) per line.

The format is JSONL — "JSON Lines" — a plain-text file where each line is one self-contained JSON object describing a single evaluation example: at minimum the input prompt, and for reference-based metrics a reference/expected output to score against. (For RAG evaluation you provide the questions, and the system retrieves and generates; you can also supply ground-truth answers where you have them.) The exact field names depend on the evaluation type and metrics, so check the AWS docs for the schema your job expects. You upload the file to an S3 bucket and Bedrock writes the scored results back to S3.

Representativeness is everything. The prompts must look like what real users actually send — same phrasing, same length distribution, same messiness — and the dataset must include the hard cases: ambiguous questions, edge inputs, adversarial or out-of-scope requests, the long tail where models diverge. A model that scores 95% on twenty clean prompts can be the wrong choice if it falls apart on the 5% of gnarly inputs that drive your support tickets. The point of an evaluation set is to surface where models differ, so deliberately include the cases that separate them.

Size and balance. Bigger and more diverse beats small and cherry-picked, but quality and coverage matter more than raw count — a few hundred well-chosen, representative prompts that span your real input distribution and class mix usually tell you more than thousands of near-duplicates. Balance the categories you care about so a model cannot win by being good at only the common case. Keep the evaluation set separate from any fine-tuning training data — evaluating on data the model trained on measures memorization, not real performance.

Practically, assembling a clean, representative, labelled evaluation set out of raw logs, tickets, and documents is where most of the human effort in an evaluation goes — and it is exactly the work a vetted AWS ML partner does efficiently. Because the engagement is credit-funded, the customer does not pay for it (see §IX). A good evaluation set is also a durable asset: you reuse it every time a new model is released to re-run the comparison cheaply.

  • One JSON object per line; UTF-8; uploaded to an Amazon S3 bucket (results written back to S3).
  • Each line carries the input prompt — and a reference/expected output for reference-based metrics.
  • Prompts that mirror real production inputs in phrasing, length, and messiness.
  • Deliberately include hard cases: ambiguous, edge, adversarial, and out-of-scope inputs.
  • Balanced across the categories/classes you care about; quality and coverage over raw count.
  • Kept separate from any fine-tuning training data; reused to re-test every new model release.
what the scores mean

VIIThe metrics: accuracy, robustness, toxicity, faithfulness, and the rest

A score is only useful if you know what it measures. Bedrock evaluation reports across several families of metric, and the ones you select should follow directly from what your task needs. Here is what the common ones mean and when each should drive your decision.

Group the metrics by what they protect. Quality/correctness metrics ask "is the output right and good." Safety metrics ask "is the output harmful." And for RAG specifically, grounding metrics ask "does the answer stay true to the retrieved source." You will usually care about a small handful, not all of them — choose the two or three that map to how your product can fail.

  • Accuracy / correctness — How often the output matches the correct/expected answer — exact-match or semantic similarity for QA and extraction, F1 or similar for classification, reference-based scoring for summarization. The core metric whenever there is a checkable right answer.
  • Robustness — How stable the output stays under small, meaning-preserving perturbations of the input (typos, rephrasings, added noise). A model can be accurate on clean prompts yet brittle on messy real-world ones; robustness catches that fragility before users do.
  • Toxicity / harmfulness — How likely the output is to contain toxic, hateful, or otherwise harmful content. A safety gate for any user-facing product. Pair the evaluation metric with Bedrock Guardrails to enforce limits at runtime, not just measure them once.
  • Faithfulness / groundedness (RAG) — For RAG: how faithfully the answer sticks to the retrieved context versus hallucinating beyond it. The single most important generation-side metric for a knowledge assistant — an unfaithful answer is confidently wrong.
  • Relevance — How on-point the answer (or, for RAG, the retrieved context) is to the actual question. Catches answers that are fluent and grounded but do not address what was asked, and retrieval that returns related-but-not-useful passages.
  • Completeness / coverage — Whether the answer (and the retrieved context) covers everything the question needed, not just part of it. A partially-correct answer that omits a critical caveat can be as bad as a wrong one in high-stakes settings.
  • Quality dimensions for open-ended tasks — Helpfulness, coherence, fluency, tone/style, harmlessness — the subjective qualities that algorithmic metrics cannot express. These are scored by humans or an LLM-as-a-judge against your rubric, not by a built-in algorithm.
choose metrics from how you can fail

Do not score on every metric — score on the two or three that map to how your product breaks. A structured-extraction tool lives or dies on accuracy and robustness; a public chatbot adds toxicity; a knowledge assistant is dominated by faithfulness, relevance, and completeness. Pick the metrics that match your failure modes, then let them drive the model choice.

reading the results

VIIIFrom scores to a decision: choosing and right-sizing a model

An evaluation report is not the answer — it is the evidence. The decision is a judgement about which model is <em>good enough</em> for the task at the lowest cost and latency. The most valuable outcome of evaluation is usually not "the best model wins" but "the cheapest model that clears the bar wins" — right-sizing, which is one of the biggest levers on an AI product's unit economics.

Read the results as a trade-off, not a ranking. The frontier model will often top the quality scores — but if a model two or three tiers cheaper and faster also clears your quality bar on your prompts, that cheaper model is usually the right production choice, because per-token cost and latency compound across every request you will ever serve. The discipline is to set the quality bar before you look at cost ("for this task we need ≥ X on faithfulness and ≤ Y toxicity"), then pick the cheapest, fastest candidate that meets it. This is what "right-sizing" means: matching model capability to task difficulty instead of reflexively buying the most powerful option.

Evaluation also lets you compare more than just models. Run the same dataset across candidate configurations — a base model with strong prompting vs. the same model with RAG; a smaller model with RAG vs. a larger model without; different temperature or system-prompt variants; a fine-tuned model vs. the base it came from. Often the winning answer is not "a better model" but "a cheaper model plus RAG plus a better prompt," and only a head-to-head on your data reveals it. This is also how you justify (or reject) a fine-tune: evaluate the custom model against the base model with good prompting and decide whether the gain is worth the standing hosting cost (see the amazon-bedrock-fine-tuning sibling).

Finally, make evaluation repeatable, not a one-off. Models are released and updated constantly; a quarterly re-run of your saved evaluation set against the latest models tells you cheaply whether a newer, cheaper, or better model now clears your bar — letting you migrate deliberately instead of either churning on every release or ossifying on an outdated choice. A standing evaluation harness turns "which model?" from a recurring argument into a measurement you can re-run on demand.

  • Set the quality bar before looking at cost — Define the minimum acceptable score on your key metrics first ("≥ X faithfulness, ≤ Y toxicity"), so cost does not bias the quality judgement.
  • Pick the cheapest candidate that clears the bar — The frontier model usually wins on quality; the right production pick is the cheapest, fastest model that still meets your bar. That is right-sizing.
  • Compare configurations, not just models — Base + prompting vs. base + RAG vs. smaller + RAG vs. fine-tuned. The winner is often "cheaper model + RAG + better prompt," visible only on your data.
  • Re-run the saved eval set on a schedule — New models ship constantly; re-running your evaluation set quarterly tells you when a newer/cheaper model now clears the bar, so migrations are evidence-led.
how it becomes $0

IXHow AWS credits fund the evaluation — and the inference behind it

Everything above prices evaluation if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the work with credits — and the inference an evaluation runs (every candidate model processing every prompt), any human-review or judge-model fees, and the build of the harness itself draw those credits down before they ever touch your card.

An evaluation's cost is mostly inference: each candidate model has to process every prompt in your dataset, and an LLM-as-a-judge run adds the judge model's tokens on top. Add any AWS-managed human-workforce fees and the S3 storage for datasets and results, and that is the bill. It is usually small relative to the savings from choosing correctly — right-sizing from a frontier model to a model a few tiers cheaper can cut per-request cost by a large multiple across every request you will ever serve, which is exactly why evaluation pays for itself. And all of it is credit-eligible: model inference on Bedrock, the judge-model calls, the embeddings and vector store behind a RAG evaluation, and the S3 storage are covered, and AWS credits apply automatically against your bill until exhausted.

The relevant pools are AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed specifically at proving out a GenAI use case — and a model evaluation is precisely that proof, the step that de-risks which model and architecture you commit to — and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). Funding the evaluation with credits means you can compare candidates thoroughly, including the expensive frontier models, without spending runway just to find out which one you actually need.

The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS ML partner who both files the credit application and builds the evaluation: assembling a representative JSONL dataset, choosing the right job types and metrics for your task, running the head-to-head across candidate models and configurations, evaluating your RAG pipeline on retrieval and generation, and turning the scores into an honest model choice and right-sizing recommendation. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. Related: AWS credits for generative-AI startups and Bedrock POC funding.

pick the right job type

Automatic vs human vs LLM-as-a-judge vs RAG evaluation

The headline decision, on one screen. Match the row to what you are actually measuring — whether there is a checkable answer, how subjective the quality is, how much scale you need, and whether you are testing a model or a whole retrieval pipeline. Representative 2026 guidance, not quotes.

MethodBest when…Captures subjective qualityScaleCost shapeReach for it…
AutomaticThere is a checkable right answerWeakLargeInference only (cheapest)First — to screen candidates cheaply
Human (your team)High-stakes, domain-specific qualityStrongestSmallExpert time (highest)On finalists where nuance is costly
Human (AWS-managed)Subjective quality, no panel of your ownStrongMediumManaged-workforce feesWhen you lack in-house reviewers
LLM-as-a-judgeSubjective quality at scale and speedGood (approximates human)LargeJudge-model inferenceFor open-ended quality at volume
RAG evaluationAnswers come from a Knowledge BaseGood (judge under the hood)LargeInference + embeddings/vector storeTo diagnose retrieval vs. generation
These combine. A common shape is an automatic pass to narrow the field, an LLM-as-a-judge pass on the finalists, and a small human sample to validate the judge — plus a dedicated RAG evaluation if the system answers from your documents. Pick the lightest combination that captures what "good" means for your task.
before you commit to a model in production
Get AWS credits that cover the evaluation runs AND the inference behind them — and a partner to build the eval harness (you pay $0)
Get matched in 24h →
a recent match

An evaluation that right-sized a frontier model down two tiers — built on $0 — anonymized

inquiry · Series-A customer-support AI SaaS, Toronto
Series-A customer-support AI SaaS, 22 people, building a support assistant on AWS over their help-centre docs

Situation: The team had shipped a RAG support assistant on the most capable (and most expensive) frontier model on Bedrock "to be safe," and inference cost was eating their margin as volume grew. They suspected a cheaper model would be fine but had no evidence — and they were also seeing occasional confidently-wrong answers they could not explain, unsure whether the fix was a better model or better retrieval.

What CloudRoute did: CloudRoute matched them in under 24 hours to a Canadian AWS ML partner. The partner built a Bedrock evaluation harness: a JSONL dataset of ~600 real support questions (including the gnarly long-tail ones), an automatic pass for accuracy, an LLM-as-a-judge pass for helpfulness and faithfulness, and a dedicated RAG evaluation scoring retrieval and generation separately. They ran four candidate models head-to-head, plus a "smaller model + improved retrieval" configuration. They filed a Bedrock POC credit application plus an Activate Portfolio application to fund the whole build and the evaluation inference.

Outcome: The RAG split showed the confidently-wrong answers were a <em>retrieval</em> problem (poor chunking), not the model — fixed by re-chunking, after which a model two tiers cheaper cleared the same faithfulness and helpfulness bar as the frontier model on their prompts. They right-sized to the cheaper model plus better retrieval, cutting per-request inference cost substantially with no measurable quality loss. The evaluation inference, embeddings, vector store, and judge-model calls were all covered by the approved credits, so the team paid $0 during the build and proof-out. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

method: automatic + LLM-judge + RAG eval · outcome: right-sized down two model tiers · out-of-pocket during build: $0

faq

Common questions

What is Amazon Bedrock model evaluation?
It is a managed feature in Amazon Bedrock for measuring how well a model — or a RAG system built on a Knowledge Base — performs on your task using your own data, so you can choose between candidates with evidence instead of intuition. You create an evaluation job, supply a JSONL prompt dataset in S3, pick the models and metrics you care about, and Bedrock runs the candidates, scores them, and writes the results (per-metric scores plus the underlying generations) back to S3. It supports three job types — automatic, human, and LLM-as-a-judge — and a dedicated RAG evaluation.
What is the difference between automatic, human, and LLM-as-a-judge evaluation?
Automatic evaluation scores outputs with built-in algorithms and reference-based metrics (accuracy, robustness, toxicity) — fast, cheap, repeatable, and best for tasks with a checkable right answer. Human evaluation puts real people (your own team or an AWS-managed workforce) in the loop to score against a rubric — the gold standard for subjective, high-stakes, domain-specific quality, but slower and more expensive. LLM-as-a-judge uses a capable model to grade another model's outputs against a rubric — approximating human judgement on open-ended quality at a fraction of the cost and at machine scale. Many evaluations combine them: automatic to screen, human or LLM-judge on the finalists.
How does Bedrock evaluate a RAG system?
Bedrock's RAG (Knowledge Base) evaluation scores the two halves of the pipeline separately. On the retrieval side it measures whether the retrieved passages are relevant to the question and cover the information needed to answer. On the generation side it measures faithfulness/groundedness (does the answer stay true to the retrieved context or hallucinate), relevance to the question, completeness, and correctness against a reference where available — often using an LLM-as-a-judge for the open-ended generation metrics. Splitting the scores lets you tell whether a bad answer came from retrieval (fix chunking/embeddings/retrieval settings) or generation (fix the prompt/model).
What metrics does Bedrock model evaluation report?
Common metrics include accuracy/correctness (how often the output matches the expected answer), robustness (stability under small input perturbations), and toxicity (harmful-content likelihood) for direct model evaluation; and for RAG, faithfulness/groundedness, relevance, and completeness/coverage on retrieval and generation. Open-ended quality dimensions — helpfulness, coherence, tone, harmlessness — are scored by humans or an LLM-as-a-judge against your rubric rather than a built-in algorithm. Choose the two or three metrics that map to how your product can fail rather than scoring on all of them. The exact metric catalog changes; confirm it on the AWS Bedrock docs.
How do I build a good evaluation dataset?
Supply a JSONL file in Amazon S3 with one example per line — at minimum the input prompt, plus a reference/expected output for reference-based metrics. The single most important property is representativeness: the prompts must mirror real production inputs in phrasing, length, and messiness, and must deliberately include the hard cases (ambiguous, edge, adversarial, out-of-scope) where models actually diverge. Balance the categories you care about, favour coverage and quality over raw count (a few hundred well-chosen prompts often beat thousands of near-duplicates), and keep the evaluation set separate from any fine-tuning training data so you measure performance, not memorization.
How do I use evaluation results to choose and right-size a model?
Read the results as a trade-off, not a ranking. Set your quality bar before looking at cost ("we need ≥ X faithfulness and ≤ Y toxicity for this task"), then pick the cheapest, fastest candidate that clears it. The frontier model usually tops the quality scores, but if a model two or three tiers cheaper also clears the bar on your prompts, it is normally the right production choice because per-token cost and latency compound across every request. Compare configurations too (base + RAG vs. larger model, fine-tuned vs. base), and re-run your saved evaluation set when new models ship so migrations are evidence-led. This right-sizing is one of the biggest levers on an AI product's unit economics.
Is LLM-as-a-judge reliable enough to trust?
Used carefully, yes — it has become the pragmatic default for scoring subjective quality at scale because it is far cheaper and faster than human review and far more nuanced than algorithmic metrics. But judge models have biases (they can favour longer or more confident answers, and a model can be lenient judging itself), so validate the judge against a sample of human-labelled outputs before trusting it broadly, write a clear and specific rubric, and avoid having a model judge its own outputs. Treat it as a calibrated instrument, not an oracle: a validated judge on thousands of outputs plus a small human spot-check is a strong, affordable combination.
How much does Bedrock model evaluation cost?
The cost is mostly the inference the evaluation runs: every candidate model processes every prompt in your dataset, and an LLM-as-a-judge run adds the judge model's tokens on top. Add any AWS-managed human-workforce fees and the S3 storage for datasets and results. It is usually small relative to the savings from choosing correctly — right-sizing from a frontier model to a cheaper one cuts per-request cost across every request you will ever serve. Figures are representative for 2026; confirm current rates on the AWS Bedrock pricing page.
Can AWS credits cover model evaluation?
Yes — the model inference an evaluation runs, the judge-model calls for LLM-as-a-judge, the embeddings and vector store behind a RAG evaluation, and the S3 storage are all credit-eligible, and credits apply automatically against your AWS bill. The relevant pools are AWS Activate (up to $100K), a Bedrock/GenAI POC pool ($10K–$50K) — a model evaluation is exactly the kind of proof-of-concept that pool funds — and the GenAI Accelerator (up to $1M). These are largely partner-filed via the AWS Partner Network. CloudRoute routes you to the right pool and a vetted AWS ML partner who files the application and builds the evaluation (dataset, job types, metrics, head-to-head, RAG split, right-sizing recommendation) — customer pays $0, AWS funds it.

Choose your model on evidence, on AWS's budget

Picking the wrong model — or over-paying for a frontier model a cheaper one would have aced — is one of the most expensive mistakes in an AI build. A proper Bedrock evaluation fixes it, and AWS credits cover the evaluation runs and the inference behind them. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS ML partner who builds the eval harness, runs the head-to-head, and turns the scores into an honest model choice. Customer pays $0.

matched within< 24h
GenAI credit ceilingup to $1M
cost to you$0
Amazon Bedrock model evaluation — choose a model with evidence · CloudRoute