bedrock model distillation · inference cost · 2026

Amazon Bedrock model distillation — big-model quality at small-model cost.

A complete, neutral reference for model distillation on Amazon Bedrock: what distillation is and how it transfers a large "teacher" model's quality into a smaller, cheaper, faster "student"; how Bedrock's managed distillation actually works (generate training data from the teacher, then fine-tune the student); when distillation beats prompt caching, fine-tuning, and RAG on cost; the quality-versus-cost tradeoff and where the quality tax really comes from; the Provisioned Throughput note for hosting the distilled model; and a worked savings example. Plus how AWS credits fund the distillation run so the build costs you $0.

what it cuts
per-token inference cost
teacher → student
quality transfer
best for
high-volume narrow tasks
cost with credits
$0
TL;DR
  • Model distillation transfers the quality of a large, expensive "teacher" model into a small, cheap, fast "student" model on a specific task. You pay the teacher's high per-token price once, during training-data generation — then run the workload forever on the student's much lower price and latency. The payoff is entirely at inference time.
  • On Bedrock distillation is managed in two steps: (1) generate a labelled training dataset by running prompts through a strong teacher model (Bedrock can synthesize these completions for you), then (2) fine-tune a smaller student model on that teacher-generated data. The result is a private custom student model in your account that imitates the teacher on your task.
  • Distillation wins specifically for high-volume, narrow, stable workloads where frontier-model quality is needed but frontier-model per-token cost is not affordable at scale — the regime where prompt caching, RAG, and plain fine-tuning each leave money on the table. Note the catch: the distilled student is a custom model, so serving it needs Provisioned Throughput (a standing hourly cost). AWS credits (Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) cover the teacher generation, the training, and the hosting — CloudRoute routes you to the pool and a vetted AWS partner, so you pay $0.
definition

IWhat model distillation on Amazon Bedrock actually is

Distillation transfers what a large, capable "teacher" model knows about a specific task into a smaller, cheaper, faster "student" model. You use the expensive model to teach the cheap model, then run your workload on the cheap one. On Bedrock the whole process is managed — you provide prompts, AWS uses a teacher model to generate the training data, fine-tunes the student for you, and hands back a private custom model in your account.

Large foundation models are expensive and slow precisely because they are large: more parameters means more compute per token, which means higher cost and higher latency on every single request. Most production tasks, though, are narrow — extracting fields from an invoice, classifying a support ticket, summarizing a call transcript into a fixed shape. On a narrow task you do not need the full breadth of a frontier model; you need its quality on that one task. Distillation is how you get exactly that: the small model never learns everything the big model knows, only how to imitate it on the slice of work you care about.

The mechanism is teacher–student transfer. A strong teacher model (a frontier-class model) produces high-quality outputs for your prompts. Those teacher outputs become the training targets for a much smaller student model, which is fine-tuned to reproduce them. After training, the student answers your production traffic on its own — at its own much lower per-token price and much lower latency — while staying close to the teacher's quality on the task it was distilled for. You have, in effect, paid the teacher's high price once to manufacture training data, and bought a permanently cheaper way to run the workload.

On Bedrock this is delivered as a managed distillation capability so you do not build the pipeline by hand. You select a teacher model and a student model, supply prompts (and optionally let Bedrock synthesize the teacher's responses for you, or use your own production logs as the seed), and Bedrock orchestrates generating the teacher data and fine-tuning the student. As with any customization on Bedrock, your prompts, the generated data, and the resulting student model stay private to your AWS account and region — they are not used to train base models or shared with the model provider. The output is a private custom student model registered in your account.

It is worth separating distillation from its neighbours up front. It is not RAG: RAG supplies fresh facts at request time and changes nothing about the model's size or cost. It is not ordinary fine-tuning on hand-labelled data: in distillation the labels come from a teacher model, and the goal is specifically to shrink the model you run, not just to adjust a base model's behaviour. And it is not prompt caching: caching makes a given model cheaper on repeated context but does not change which model you run. The amazon-bedrock-fine-tuning and amazon-bedrock-prompt-caching siblings cover those; the rest of this page is about distillation.

One caveat, stated once and meant throughout: the exact list of teacher/student pairs supported, precise pricing, and feature availability change frequently on Bedrock. The numbers and pairings here are representative as of 2026 to convey relative cost and the shape of the work. Always confirm the current distillation support matrix and pricing on the official AWS Bedrock documentation and pricing pages before committing.

the one-line definition

Distillation = use a big teacher model to generate training data, then fine-tune a small student model to imitate it on your task. You pay the teacher's high price once (to make the data) and run the workload forever on the student's low cost and latency. The output is a private custom student model — which then needs Provisioned Throughput to serve.

the mechanics

IIHow distillation works on Bedrock — teacher generates, student learns

Bedrock's managed distillation collapses into two conceptual steps: generate a teacher-quality training dataset, then fine-tune the student on it. Understanding the two steps separately is what lets you reason about cost, quality, and where things go wrong.

Distillation is, under the hood, a specialised use of fine-tuning — the difference is where the training labels come from. In ordinary fine-tuning you bring hand-written prompt/completion pairs; in distillation the completions are produced by the teacher model. That single substitution is what makes distillation powerful (you can manufacture far more high-quality labelled data than you could write by hand) and what makes its cost profile distinctive (you pay frontier-model token prices to generate that data).

Step 1 — generate training data from the teacher

You provide a set of prompts that represent your task — ideally drawn from real production traffic so the data matches what the student will actually see. Bedrock runs those prompts through the chosen teacher model to produce high-quality completions, assembling a labelled dataset of prompt → teacher-output pairs. Bedrock's managed distillation can synthesize this data for you, including generating variations to broaden coverage, so you are not limited to the exact prompts you supplied. This step is where the teacher's expensive per-token price is incurred — but only once, as a one-time data-generation cost, not on every future request. The richer and more representative this dataset, the better the student will be, so curating good seed prompts matters as much here as it does in any fine-tune.

Step 2 — fine-tune the student on the teacher's outputs

Bedrock then runs a fine-tuning job on the smaller student model using the teacher-generated dataset as training data, nudging the student's weights toward reproducing the teacher's outputs on your prompts. This is mechanically the same managed training job described on the amazon-bedrock-fine-tuning page — you are choosing a base (student) model, pointing the job at data, and getting back a private custom model — except the dataset was authored by the teacher rather than by you. When the job completes, the distilled student model is registered in your account, ready to evaluate against both the teacher (for quality) and the original small base model (to confirm the distillation actually added value).

Then: serve the student (and mind the hosting)

From this point the distilled student behaves like any custom model on Bedrock. The intended payoff — cheap, fast inference — is real on a per-token basis, but there is a hosting wrinkle covered in detail in §VI: a custom model is served on Provisioned Throughput, a standing hourly charge, not the shared on-demand path. For the high-volume workloads distillation targets that reserved capacity is usually busy enough to be worth it, but it must be in the math. The honest evaluation is therefore three-way: is the distilled student close enough to the teacher on quality, clearly better than the plain small base model, and cheap enough at your volume to beat just running the teacher on-demand?

distillation = fine-tuning with teacher-written labels

The only structural difference between distillation and ordinary fine-tuning is the source of the training labels: a teacher model writes them instead of a human. Everything downstream — a managed fine-tuning job, a private custom student model, Provisioned Throughput to serve it — is the standard fine-tuning path.

the economics

IIIWhy a distilled student is cheaper — and faster

The savings are not a discount or a pricing trick; they are structural. A smaller model does less arithmetic per token, so it costs less and responds faster on every request — and distillation is how you get a small model that is good enough to use.

Foundation-model inference cost scales with model size. A frontier model has many more parameters than a small one, so generating each token requires far more computation — and AWS prices that through, charging more per 1,000 input and output tokens for larger, more capable models. A small model is cheaper to run for the same reason it is less capable in general: there is simply less of it doing work per token. Distillation's whole point is to make a small model capable enough on your specific task that its lower price becomes usable rather than a false economy.

The cost difference between model tiers is large — frontier models commonly cost many times more per token than the small models in the same family (the gap is often roughly an order of magnitude, though exact ratios vary by family and change over time — see amazon-bedrock-pricing and amazon-nova-pricing for the current numbers). When a workload runs millions or billions of tokens a month, that multiple is the difference between an affordable feature and one that quietly dominates the cloud bill. Distillation lets you keep most of the quality while moving the recurring spend onto the cheaper tier.

Latency improves for the same structural reason: fewer computations per token means each response is produced faster. For interactive features (a chat reply, an autocomplete, a real-time classification) the student's lower latency is often as valuable as its lower cost — a smaller model can hit response-time targets a frontier model cannot, which can be the actual reason to distill rather than the savings alone.

The cost you are trading away is the one-time teacher-generation charge plus the fine-tuning training charge — both paid once — against a permanently lower per-token run rate. That is why distillation's value grows with volume: the fixed cost of producing the student is amortised across every future request, so the more traffic you run on the student, the more the upfront teacher cost is dwarfed by the inference you avoided paying frontier prices for. At low volume the upfront cost never pays back; at high volume it pays back fast. The worked example in §VII makes this concrete.

the decision

IVWhen distillation beats prompt caching, fine-tuning, and RAG on cost

Distillation is one of several ways to cut Bedrock spend, and it is the right one only in a specific regime. The key is that each cost lever fixes a different problem — match the lever to where your money is actually going, not to which technique sounds most advanced.

Every cost-reduction technique on Bedrock targets a different driver of the bill. Diagnose which driver dominates your workload, then read across to the right tool:

  • Your cost is dominated by a large repeated context (same long system prompt / documents on every call) → prompt caching — If most of your tokens are the same fixed preamble re-sent on every request, prompt caching cuts the cost of that repeated context directly, with no training and no new model. It does not change which model you run, so it does not help if the model itself is simply too expensive per token. See amazon-bedrock-prompt-caching.
  • The model is wrong because it lacks your facts → RAG (not a cost play at all) — RAG fixes accuracy by retrieving your documents at request time; it adds tokens rather than removing them and does not make the model cheaper. If your problem is "the answers are wrong," RAG is the fix — but it is not the tool for "inference is too expensive." See rag-on-aws and amazon-bedrock-knowledge-bases.
  • You need a behaviour/format locked in and already run a capable-enough small model → plain fine-tuning — If a small base model is nearly good enough and you mainly need a consistent output shape or style, ordinary fine-tuning on a modest hand-labelled set can get there without involving a teacher. Distillation is the heavier move you make when the small model is not good enough on its own and you need to pull quality down from a much larger model.
  • A frontier model is the only thing accurate enough, but its per-token cost is unaffordable at your volume → distillation — This is the distillation regime: high, steady volume on a narrow task where you genuinely need frontier-class quality but cannot pay frontier-class prices per token forever. Distil the frontier teacher into a small student and run the volume on the student. This is the case the rest of this page is about.
  • Volume is low or spiky → probably none of the above; just run the teacher on-demand — If you do not have the volume, the upfront teacher-generation and training cost (and the standing Provisioned Throughput hosting) never pay back. Run the strong model on-demand, add prompt caching if context is repetitive, and revisit distillation only when volume grows.
  • Combine them — these are not mutually exclusive — A mature production stack often layers several: distil for the cheap student, prompt caching on whatever fixed context remains, and RAG feeding the student facts. Reach for the lightest combination that hits your cost and quality bar, not the heaviest single technique.
the rule of thumb

Match the lever to the cost driver. Prompt caching cuts repeated-context cost on the same model. RAG fixes missing facts (and adds tokens — it is not a savings tool). Plain fine-tuning locks behaviour into an already-good-enough small model. Distillation is for when only a frontier model is accurate enough but its per-token price is unaffordable at high volume — it moves the quality down into a small, cheap student.

the honest tradeoff

VThe quality-versus-cost tradeoff — what you give up

Distillation is a trade, not a free lunch. The student is smaller, so on average it will not match the teacher everywhere — the engineering goal is to make the gap negligible <em>on your task</em> while keeping the cost gap large. Knowing where the quality tax shows up is how you decide whether the trade is acceptable.

A distilled student approaches but rarely perfectly equals the teacher. Because it has far fewer parameters, it has less capacity to represent every nuance and edge case the teacher handles. On the narrow task it was distilled for, a well-built student can come strikingly close to teacher quality — close enough that the difference is immaterial for the application. The further a request drifts from the distribution of the distillation data, though, the more the student's smaller capacity shows: it generalizes less well to inputs unlike its training, and it lacks the teacher's broad reasoning and world knowledge for anything off-task. Distillation buys task-specific quality, not general capability.

This is exactly why distillation suits narrow, stable workloads and not broad or fast-changing ones. If the task is well-scoped and its input distribution is steady, you can saturate the distillation data with representative examples and the student will be reliable across the cases you actually see. If the task is open-ended, or its inputs shift over time, the student's blind spots multiply and you either keep re-distilling or accept degraded quality. The narrower and more stable the task, the smaller the quality tax — and the better the trade.

The quality of the teacher data caps the student. A student cannot exceed the teacher it learned from, and it will faithfully reproduce the teacher's mistakes, biases, and formatting quirks — distillation copies behaviour indiscriminately. So the teacher must be genuinely strong on the task, the seed prompts must be representative, and it is worth filtering or lightly curating the teacher's outputs before training rather than trusting them wholesale. Garbage from the teacher becomes garbage baked into the student.

The discipline that makes the tradeoff safe is measurement, covered as evaluation throughout this cluster: run the same held-out prompts through teacher, student, and the plain small base model, and score them on the metric that matters for the task (exact-match or schema-validity for extraction, a rubric or LLM-as-judge score for quality, accuracy/F1 for classification). The decision rule is concrete — adopt the student only if it is (a) close enough to the teacher that the quality loss is acceptable for the use case and (b) clearly better than the plain small base model, so the distillation actually earned its keep. If the student cannot clear both bars, the task may simply need the teacher, or a different lever.

what you trade

You trade a slice of quality — mostly general capability and robustness on off-task or out-of-distribution inputs — for a large, permanent cut in per-token cost and latency. On a narrow, stable task a good student's quality loss can be negligible; off that task it is real. The student can never exceed its teacher and inherits the teacher's flaws, so a strong teacher and representative data are non-negotiable.

the hosting catch

VIHosting the distilled student — the Provisioned Throughput note

The savings story is about per-token price, but the distilled student is a custom model — and custom models on Bedrock are not served on the cheap shared on-demand path. To run yours, you buy Provisioned Throughput, a standing hourly cost. For distillation's high-volume use case this usually pencils out, but it must be in the calculation from the start.

A distilled student is, mechanically, a fine-tuned custom model, so it carries the same hosting requirement as any custom model on Bedrock: you cannot call it on the shared, per-token, on-demand path that base models use. Serving a custom model requires Provisioned Throughput — you purchase dedicated model capacity (measured in "model units") and pay a flat hourly rate for it continuously, the entire time the model is deployed, regardless of how many requests you send. An idle student on Provisioned Throughput still bills every hour. This is the same mechanic the amazon-bedrock-fine-tuning and amazon-bedrock-provisioned-throughput siblings cover in depth.

Why this is usually fine for distillation specifically: the entire reason you distil is high, steady volume. At that volume the reserved capacity is busy — you are getting work out of every hour you pay for — which is exactly the condition under which Provisioned Throughput is efficient rather than wasteful. The standing hourly bill that makes casual fine-tuning a bad idea is, for a genuinely high-throughput distilled workload, simply the cost of the dedicated lane you are keeping full. And because the student is small, its per-hour Provisioned Throughput rate is far lower than hosting a large model would be — a second structural saving on top of the per-token one.

The number to compute is the crossover: at your traffic, does (teacher-generation + training, one-time) + (student Provisioned Throughput, hourly × hours) come out below running the teacher on-demand at your token volume? Below some monthly volume the on-demand teacher is cheaper and you should not distil; above it the distilled student wins, and the margin widens as volume grows because the on-demand teacher cost scales linearly with tokens while the student's reserved capacity is a fixed hourly rate you have already saturated. If you can commit to a 1- or 6-month Provisioned Throughput term, the hourly rate drops and the crossover moves in distillation's favour. The worked example in the next section runs these numbers.

the cost that surprises everyone

The distilled student is a custom model, so it can only be served on Provisioned Throughput — a flat hourly charge that accrues 24/7 whether or not the model is used. For distillation's high-volume target this reserved capacity is usually busy enough to be worth it (and a small student's hourly rate is low), but the standing hosting bill must be in the math alongside the one-time teacher and training charges — not just the per-token savings.

the math, concretely

VIIA worked savings example

An illustrative walk-through of the crossover, with round, clearly-labelled representative numbers — not a quote. The point is the <em>shape</em> of the savings and when distillation flips from a loss to a large win, not these specific figures. Confirm current per-token and Provisioned Throughput rates on the AWS Bedrock pricing page.

Take a document-extraction feature that needs frontier-class accuracy and runs at high, steady volume — say 500 million tokens per month of combined input and output, the kind of load a busy back-office automation generates. Assume, purely to illustrate the arithmetic, a frontier teacher priced around $10 per million tokens (blended) and a small student in the same family priced around $1 per million tokens blended — roughly the order-of-magnitude gap typical between frontier and small tiers (your real ratio will differ; check amazon-bedrock-pricing).

Read the table as a crossover, not a fixed discount. The one-time teacher-generation and fine-tuning costs are a fixed investment; the recurring saving is the gap between the teacher's and the student's run rates (here ~$4,500/month) minus the Provisioned Throughput hosting. At 500M tokens/month the investment pays back almost immediately and the workload then runs roughly an order of magnitude cheaper every month thereafter. At 5M tokens/month the same fixed costs and the standing hosting bill would never pay back — which is the whole reason distillation is a high-volume tool. Before committing, compute your own crossover with current rates and your real traffic; a vetted AWS partner will model this for you (see §VIII), and AWS credits cover the entire experiment either way.

illustrative monthly cost — teacher on-demand vs distilled student · representative 2026 figures, not a quote
Line itemRun the teacher on-demandDistil + run the studentNotes
Monthly inference volume500M tokens500M tokensSame workload either way
Per-token rate (blended)~$10 / M tokens~$1 / M tokensFrontier vs small tier — illustrative
Monthly inference cost~$5,000 / mo~$500 / moStudent runs the volume ~10× cheaper
One-time teacher data generation~$200–$1,000 oncePay frontier price once to label the dataset
One-time student fine-tuning~tens–low-hundreds onceA normal Bedrock training charge
Student hosting (Provisioned Throughput)— (on-demand, no PT)flat hourly, 24/7Standing cost; small student = low hourly rate
Approx. all-in month 1~$5,000~$1,500–$2,500**incl. one-time setup + a month of PT
Approx. steady-state monthly~$5,000~$500 + PT hoursOne-time costs gone; only run-rate + hosting
Illustrative only. The pattern is what matters: the on-demand teacher cost scales linearly with tokens (~$5,000/mo here and rising with volume), while the distilled path front-loads a one-time teacher-generation + training cost and then runs the same volume an order of magnitude cheaper, plus a fixed Provisioned Throughput hourly bill the high volume keeps busy. Distillation is a loss at low volume and a large, widening win at high volume. Confirm all rates on the AWS Bedrock pricing page.
how it becomes $0

VIIIFinOps and AWS credits — funding the distillation run

Everything above prices distillation if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the work with credits — and every line item in a distillation project draws those credits down before it ever touches your card.

Distillation is, in FinOps terms, a textbook unit-economics move: it attacks the recurring cost-per-request of an AI feature rather than trimming around the edges, which is exactly the kind of structural saving a cloud-cost practice exists to find. The irony teams hit is that the move itself has an upfront cost — the teacher's frontier-priced data generation, the fine-tuning job, and a month or more of Provisioned Throughput to evaluate the student — spent before the savings arrive, often out of scarce runway, just to learn whether the distilled student clears the quality bar. That upfront hump is what stops teams from trying distillation even when it would obviously pay off.

AWS credits remove the hump. The teacher-generation tokens, the student fine-tuning charge, the Provisioned Throughput hosting during the build and proof-out, the embeddings and vector store of any RAG you pair with the student, and the S3 storage for the datasets are all credit-eligible, and AWS credits apply automatically against your bill until exhausted. The relevant pools are AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed squarely at proving out a GenAI use case — and a distillation experiment is precisely that — and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). That means you can generate the teacher data, distil the student, run a proper three-way evaluation, and only commit real money to ongoing hosting once the savings are proven.

The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS ML partner who both files the credit application and does the work: choosing the teacher/student pair, curating seed prompts, generating and filtering the teacher dataset, running the distillation fine-tune, modelling the crossover honestly, and telling you when distillation is not the right lever (when prompt caching, RAG, or plain fine-tuning would do). The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. Related: AWS credits for generative-AI startups and Bedrock POC funding.

pick the right cost lever

Distillation vs prompt caching vs fine-tuning vs RAG — on cost

The headline decision, on one screen. Each lever attacks a different driver of the Bedrock bill; match the row to where your money actually goes. The pattern to notice: only distillation changes <em>which model you run</em> — the others make a given model cheaper, more accurate, or better-behaved. Representative 2026 guidance, not quotes.

LeverWhat it attacksChanges which model runs?EffortCost shapeReach for it when…
Prompt cachingCost of repeated context tokensNoLowestPay-per-use; cheaper repeated inputA long fixed prompt/context repeats on every call
RAG (Knowledge Bases)Accuracy / missing facts (not cost)NoLow–mediumAdds tokens; embeddings + vector storeAnswers are wrong for lack of your facts
Plain fine-tuningBehaviour / format on an already-OK small modelNo (tunes the one you have)MediumOne-time training + standing PT hostingSmall model nearly good enough; need format locked in
Model distillationPer-token price of needing a big modelYes — big → smallMedium–highOne-time teacher gen + training + standing PT hostingOnly a frontier model is accurate enough but unaffordable at high volume
Just run the teacher on-demandNothing — baselineNoNonePure pay-per-token, no hostingVolume is low or spiky; nothing to amortise
These combine: a mature stack often distils for a cheap student, adds prompt caching on whatever fixed context remains, and feeds the student facts via RAG. Reach for the lightest combination that clears your cost and quality bar — and remember distillation produces a custom student model that needs a standing Provisioned Throughput hosting bill (§VI), which only a high-volume workload justifies.
before you pay frontier prices on every request
Get AWS credits that cover the teacher generation, the training, AND Provisioned Throughput hosting — and a partner to build it (you pay $0)
Get matched in 24h →
a recent match

A frontier-priced feature distilled to one-tenth the run rate — built on $0 — anonymized

inquiry · Series-A document-automation SaaS, Amsterdam
Series-A document-automation SaaS, 22 people, running a high-volume extraction feature on AWS

Situation: Their core feature — pulling structured fields out of customer documents — needed frontier-class accuracy, so they ran every request through a top-tier model on-demand. At a few hundred million tokens a month the Bedrock bill for that one feature had become the largest line in their cloud spend and was scaling linearly with usage, threatening the unit economics of the product. They suspected a smaller model could do it but were worried a cheap model would drop accuracy, and could not spare runway to find out.

What CloudRoute did: CloudRoute matched them in under 24 hours to a Netherlands-based AWS ML partner. The partner confirmed the workload was the textbook distillation case — narrow, stable, high-volume, frontier-quality-needed — and ran a managed Bedrock distillation: seeding from the team's real production prompts, using the existing frontier model as the teacher to generate a labelled dataset (with Bedrock synthesizing additional variations), then fine-tuning a small student model in the same family on that data. They hosted the distilled student on a single Provisioned Throughput model unit and ran a three-way evaluation — teacher vs student vs plain small base model — on a held-out set. They filed a Bedrock POC credit application plus an Activate Portfolio application to fund the entire experiment.

Outcome: On the held-out evaluation the distilled student landed within a small margin of the teacher on extraction accuracy while costing roughly an order of magnitude less per token and responding noticeably faster — comfortably clearing both bars (close to the teacher, clearly better than the plain small model). The teacher-generation tokens, the fine-tuning, and the Provisioned Throughput hosting during the build were all covered by the approved credits, so the team paid $0 during the build and proof-out, and moved the feature's steady-state run rate down by roughly 90%. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

method: distil frontier teacher → small student · quality: within a small margin of teacher · steady-state run rate: ~10× cheaper · out-of-pocket during build: $0

faq

Common questions

What is model distillation on Amazon Bedrock?
Model distillation transfers a large "teacher" model's quality on a specific task into a smaller, cheaper, faster "student" model. On Bedrock it is a managed two-step process: a strong teacher model generates a high-quality labelled training dataset from your prompts (Bedrock can synthesize this data for you), then a smaller student model is fine-tuned on that teacher-generated data. The result is a private custom student model in your account that imitates the teacher on your task — letting you pay the teacher's high per-token price once, during data generation, and then run the workload on the student's much lower cost and latency. Your prompts, the generated data, and the student model stay private to your account and region.
How is distillation different from fine-tuning?
Mechanically, distillation IS a kind of fine-tuning — the difference is where the training labels come from. In ordinary fine-tuning you supply hand-written prompt/completion pairs to adjust a base model's behaviour. In distillation the completions are generated by a teacher model, and the goal is specifically to shrink the model you run — pulling a big model's task quality into a small one. So plain fine-tuning is the right move when a small model is already nearly good enough and you just need a behaviour or format locked in; distillation is the heavier move when the small model is not good enough on its own and you need to transfer quality down from a much larger model.
When does distillation beat prompt caching, RAG, or plain fine-tuning on cost?
Match the lever to the cost driver. Prompt caching wins when most of your tokens are a long fixed context repeated on every call — it cuts that repeated cost without changing the model. RAG is for accuracy (missing facts), not cost — it adds tokens. Plain fine-tuning suits an already-good-enough small model that just needs a format locked in. Distillation is for the specific regime where only a frontier model is accurate enough but its per-token price is unaffordable at high volume: it moves the quality into a small, cheap student you run instead. If volume is low or spiky, none of these pay back and you should just run the strong model on-demand.
What is the quality-versus-cost tradeoff with distillation?
You trade a slice of quality — mostly general capability and robustness on off-task or out-of-distribution inputs — for a large, permanent cut in per-token cost and latency. On the narrow, stable task it was distilled for, a well-built student can come strikingly close to the teacher; the further inputs drift from the distillation data, the more the student's smaller capacity shows. Two hard limits: a student can never exceed its teacher, and it faithfully reproduces the teacher's mistakes and biases — so a genuinely strong teacher and representative training data are essential. The safe way to decide is measurement: adopt the student only if it is close enough to the teacher for the use case AND clearly better than the plain small base model.
Does a distilled model still need Provisioned Throughput to run?
Yes. A distilled student is a custom (fine-tuned) model, and custom models on Bedrock cannot be served on the shared on-demand, per-token path that base models use — they require Provisioned Throughput, dedicated capacity billed at a flat hourly rate for as long as the model is deployed, regardless of traffic. For distillation this is usually acceptable because the whole point is high, steady volume, which keeps the reserved capacity busy and efficient; and because the student is small, its per-hour rate is far lower than hosting a large model. But the standing hosting bill must be in your math alongside the one-time teacher-generation and training costs, and a 1- or 6-month commitment lowers the hourly rate.
How much can distillation actually save?
It depends entirely on volume, because distillation trades a one-time cost (teacher data generation + student fine-tuning) for a permanently lower per-token run rate plus a standing Provisioned Throughput hosting bill. Frontier models commonly cost roughly an order of magnitude more per token than small models in the same family, so at high volume a distilled student can cut a feature's steady-state inference run rate by something like 80–90% — but the upfront and hosting costs mean it is a net loss at low volume. Compute your own crossover with current AWS Bedrock rates: distillation pays off above some monthly token volume and never pays back below it. Figures are representative for 2026 — confirm on the AWS Bedrock pricing page.
Which models can be used as teacher and student on Bedrock?
Distillation pairs a strong teacher (a frontier-class model) with a smaller student in a supported pairing, and the supported teacher/student matrix changes as providers add and retire support. The durable pattern is that distillation works within model families and toward Amazon's own small, low-cost models (such as the Amazon Nova family) and other small open-weight models that support fine-tuning. Because the student is a fine-tuned custom model, your chosen student must be one Bedrock can fine-tune. Always confirm the current distillation support matrix and the eligible teacher/student pairs in the AWS Bedrock documentation before designing around a specific pairing — this is the most common place a distillation plan breaks.
How do I know the distilled student is good enough to ship?
Run a three-way evaluation on a held-out set the student never trained on: the same prompts through the teacher, the distilled student, and the plain small base model, scored on a task-appropriate metric (exact-match or schema-validity for structured extraction, a rubric or LLM-as-judge score for quality, accuracy/F1 for classification). Bedrock's model evaluation tooling can systematize this. The decision rule is concrete: ship the student only if it is close enough to the teacher that the quality loss is acceptable for the use case AND clearly better than the plain small base model — proving the distillation actually added value. If it cannot clear both bars, the task may genuinely need the teacher, or a different cost lever.
Can AWS credits cover a distillation project?
Yes — the teacher data-generation tokens, the student fine-tuning charge, the Provisioned Throughput hosting during the build and proof-out, plus any embeddings, vector store, and S3 storage around it are all credit-eligible, and credits apply automatically against your AWS bill. The relevant pools are AWS Activate (up to $100K), a Bedrock/GenAI POC pool ($10K–$50K) — a distillation experiment is exactly the kind of GenAI proof-out it funds — and the competitive GenAI Accelerator (up to $1M). These are largely partner-filed via the AWS Partner Network. CloudRoute routes you to the right pool and a vetted AWS ML partner who files the application and does the work (teacher/student selection, dataset generation, the distillation fine-tune, evaluation, and the crossover analysis) — customer pays $0, AWS funds it.

Distil on AWS's budget, not your runway

Distillation cuts a high-volume feature's per-token cost by paying the frontier teacher once and running the workload on a cheap student forever — but the teacher generation, training, and Provisioned Throughput hosting are an upfront hump spent before the savings arrive. AWS credits cover all of it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS ML partner who runs the distillation, models the crossover, and tells you honestly whether to distil at all. Customer pays $0.

matched within< 24h
GenAI credit ceilingup to $1M
cost to you$0
Amazon Bedrock model distillation — cut inference cost (2026) · CloudRoute