A complete, neutral reference for model distillation on Amazon Bedrock: what distillation is and how it transfers a large "teacher" model's quality into a smaller, cheaper, faster "student"; how Bedrock's managed distillation actually works (generate training data from the teacher, then fine-tune the student); when distillation beats prompt caching, fine-tuning, and RAG on cost; the quality-versus-cost tradeoff and where the quality tax really comes from; the Provisioned Throughput note for hosting the distilled model; and a worked savings example. Plus how AWS credits fund the distillation run so the build costs you $0.
Distillation transfers what a large, capable "teacher" model knows about a specific task into a smaller, cheaper, faster "student" model. You use the expensive model to teach the cheap model, then run your workload on the cheap one. On Bedrock the whole process is managed — you provide prompts, AWS uses a teacher model to generate the training data, fine-tunes the student for you, and hands back a private custom model in your account.
Large foundation models are expensive and slow precisely because they are large: more parameters means more compute per token, which means higher cost and higher latency on every single request. Most production tasks, though, are narrow — extracting fields from an invoice, classifying a support ticket, summarizing a call transcript into a fixed shape. On a narrow task you do not need the full breadth of a frontier model; you need its quality on that one task. Distillation is how you get exactly that: the small model never learns everything the big model knows, only how to imitate it on the slice of work you care about.
The mechanism is teacher–student transfer. A strong teacher model (a frontier-class model) produces high-quality outputs for your prompts. Those teacher outputs become the training targets for a much smaller student model, which is fine-tuned to reproduce them. After training, the student answers your production traffic on its own — at its own much lower per-token price and much lower latency — while staying close to the teacher's quality on the task it was distilled for. You have, in effect, paid the teacher's high price once to manufacture training data, and bought a permanently cheaper way to run the workload.
On Bedrock this is delivered as a managed distillation capability so you do not build the pipeline by hand. You select a teacher model and a student model, supply prompts (and optionally let Bedrock synthesize the teacher's responses for you, or use your own production logs as the seed), and Bedrock orchestrates generating the teacher data and fine-tuning the student. As with any customization on Bedrock, your prompts, the generated data, and the resulting student model stay private to your AWS account and region — they are not used to train base models or shared with the model provider. The output is a private custom student model registered in your account.
It is worth separating distillation from its neighbours up front. It is not RAG: RAG supplies fresh facts at request time and changes nothing about the model's size or cost. It is not ordinary fine-tuning on hand-labelled data: in distillation the labels come from a teacher model, and the goal is specifically to shrink the model you run, not just to adjust a base model's behaviour. And it is not prompt caching: caching makes a given model cheaper on repeated context but does not change which model you run. The amazon-bedrock-fine-tuning and amazon-bedrock-prompt-caching siblings cover those; the rest of this page is about distillation.
One caveat, stated once and meant throughout: the exact list of teacher/student pairs supported, precise pricing, and feature availability change frequently on Bedrock. The numbers and pairings here are representative as of 2026 to convey relative cost and the shape of the work. Always confirm the current distillation support matrix and pricing on the official AWS Bedrock documentation and pricing pages before committing.
Distillation = use a big teacher model to generate training data, then fine-tune a small student model to imitate it on your task. You pay the teacher's high price once (to make the data) and run the workload forever on the student's low cost and latency. The output is a private custom student model — which then needs Provisioned Throughput to serve.
Bedrock's managed distillation collapses into two conceptual steps: generate a teacher-quality training dataset, then fine-tune the student on it. Understanding the two steps separately is what lets you reason about cost, quality, and where things go wrong.
Distillation is, under the hood, a specialised use of fine-tuning — the difference is where the training labels come from. In ordinary fine-tuning you bring hand-written prompt/completion pairs; in distillation the completions are produced by the teacher model. That single substitution is what makes distillation powerful (you can manufacture far more high-quality labelled data than you could write by hand) and what makes its cost profile distinctive (you pay frontier-model token prices to generate that data).
You provide a set of prompts that represent your task — ideally drawn from real production traffic so the data matches what the student will actually see. Bedrock runs those prompts through the chosen teacher model to produce high-quality completions, assembling a labelled dataset of prompt → teacher-output pairs. Bedrock's managed distillation can synthesize this data for you, including generating variations to broaden coverage, so you are not limited to the exact prompts you supplied. This step is where the teacher's expensive per-token price is incurred — but only once, as a one-time data-generation cost, not on every future request. The richer and more representative this dataset, the better the student will be, so curating good seed prompts matters as much here as it does in any fine-tune.
Bedrock then runs a fine-tuning job on the smaller student model using the teacher-generated dataset as training data, nudging the student's weights toward reproducing the teacher's outputs on your prompts. This is mechanically the same managed training job described on the amazon-bedrock-fine-tuning page — you are choosing a base (student) model, pointing the job at data, and getting back a private custom model — except the dataset was authored by the teacher rather than by you. When the job completes, the distilled student model is registered in your account, ready to evaluate against both the teacher (for quality) and the original small base model (to confirm the distillation actually added value).
From this point the distilled student behaves like any custom model on Bedrock. The intended payoff — cheap, fast inference — is real on a per-token basis, but there is a hosting wrinkle covered in detail in §VI: a custom model is served on Provisioned Throughput, a standing hourly charge, not the shared on-demand path. For the high-volume workloads distillation targets that reserved capacity is usually busy enough to be worth it, but it must be in the math. The honest evaluation is therefore three-way: is the distilled student close enough to the teacher on quality, clearly better than the plain small base model, and cheap enough at your volume to beat just running the teacher on-demand?
The only structural difference between distillation and ordinary fine-tuning is the source of the training labels: a teacher model writes them instead of a human. Everything downstream — a managed fine-tuning job, a private custom student model, Provisioned Throughput to serve it — is the standard fine-tuning path.
The savings are not a discount or a pricing trick; they are structural. A smaller model does less arithmetic per token, so it costs less and responds faster on every request — and distillation is how you get a small model that is good enough to use.
Foundation-model inference cost scales with model size. A frontier model has many more parameters than a small one, so generating each token requires far more computation — and AWS prices that through, charging more per 1,000 input and output tokens for larger, more capable models. A small model is cheaper to run for the same reason it is less capable in general: there is simply less of it doing work per token. Distillation's whole point is to make a small model capable enough on your specific task that its lower price becomes usable rather than a false economy.
The cost difference between model tiers is large — frontier models commonly cost many times more per token than the small models in the same family (the gap is often roughly an order of magnitude, though exact ratios vary by family and change over time — see amazon-bedrock-pricing and amazon-nova-pricing for the current numbers). When a workload runs millions or billions of tokens a month, that multiple is the difference between an affordable feature and one that quietly dominates the cloud bill. Distillation lets you keep most of the quality while moving the recurring spend onto the cheaper tier.
Latency improves for the same structural reason: fewer computations per token means each response is produced faster. For interactive features (a chat reply, an autocomplete, a real-time classification) the student's lower latency is often as valuable as its lower cost — a smaller model can hit response-time targets a frontier model cannot, which can be the actual reason to distill rather than the savings alone.
The cost you are trading away is the one-time teacher-generation charge plus the fine-tuning training charge — both paid once — against a permanently lower per-token run rate. That is why distillation's value grows with volume: the fixed cost of producing the student is amortised across every future request, so the more traffic you run on the student, the more the upfront teacher cost is dwarfed by the inference you avoided paying frontier prices for. At low volume the upfront cost never pays back; at high volume it pays back fast. The worked example in §VII makes this concrete.
Distillation is one of several ways to cut Bedrock spend, and it is the right one only in a specific regime. The key is that each cost lever fixes a different problem — match the lever to where your money is actually going, not to which technique sounds most advanced.
Every cost-reduction technique on Bedrock targets a different driver of the bill. Diagnose which driver dominates your workload, then read across to the right tool:
Match the lever to the cost driver. Prompt caching cuts repeated-context cost on the same model. RAG fixes missing facts (and adds tokens — it is not a savings tool). Plain fine-tuning locks behaviour into an already-good-enough small model. Distillation is for when only a frontier model is accurate enough but its per-token price is unaffordable at high volume — it moves the quality down into a small, cheap student.
Distillation is a trade, not a free lunch. The student is smaller, so on average it will not match the teacher everywhere — the engineering goal is to make the gap negligible <em>on your task</em> while keeping the cost gap large. Knowing where the quality tax shows up is how you decide whether the trade is acceptable.
A distilled student approaches but rarely perfectly equals the teacher. Because it has far fewer parameters, it has less capacity to represent every nuance and edge case the teacher handles. On the narrow task it was distilled for, a well-built student can come strikingly close to teacher quality — close enough that the difference is immaterial for the application. The further a request drifts from the distribution of the distillation data, though, the more the student's smaller capacity shows: it generalizes less well to inputs unlike its training, and it lacks the teacher's broad reasoning and world knowledge for anything off-task. Distillation buys task-specific quality, not general capability.
This is exactly why distillation suits narrow, stable workloads and not broad or fast-changing ones. If the task is well-scoped and its input distribution is steady, you can saturate the distillation data with representative examples and the student will be reliable across the cases you actually see. If the task is open-ended, or its inputs shift over time, the student's blind spots multiply and you either keep re-distilling or accept degraded quality. The narrower and more stable the task, the smaller the quality tax — and the better the trade.
The quality of the teacher data caps the student. A student cannot exceed the teacher it learned from, and it will faithfully reproduce the teacher's mistakes, biases, and formatting quirks — distillation copies behaviour indiscriminately. So the teacher must be genuinely strong on the task, the seed prompts must be representative, and it is worth filtering or lightly curating the teacher's outputs before training rather than trusting them wholesale. Garbage from the teacher becomes garbage baked into the student.
The discipline that makes the tradeoff safe is measurement, covered as evaluation throughout this cluster: run the same held-out prompts through teacher, student, and the plain small base model, and score them on the metric that matters for the task (exact-match or schema-validity for extraction, a rubric or LLM-as-judge score for quality, accuracy/F1 for classification). The decision rule is concrete — adopt the student only if it is (a) close enough to the teacher that the quality loss is acceptable for the use case and (b) clearly better than the plain small base model, so the distillation actually earned its keep. If the student cannot clear both bars, the task may simply need the teacher, or a different lever.
You trade a slice of quality — mostly general capability and robustness on off-task or out-of-distribution inputs — for a large, permanent cut in per-token cost and latency. On a narrow, stable task a good student's quality loss can be negligible; off that task it is real. The student can never exceed its teacher and inherits the teacher's flaws, so a strong teacher and representative data are non-negotiable.
The savings story is about per-token price, but the distilled student is a custom model — and custom models on Bedrock are not served on the cheap shared on-demand path. To run yours, you buy Provisioned Throughput, a standing hourly cost. For distillation's high-volume use case this usually pencils out, but it must be in the calculation from the start.
A distilled student is, mechanically, a fine-tuned custom model, so it carries the same hosting requirement as any custom model on Bedrock: you cannot call it on the shared, per-token, on-demand path that base models use. Serving a custom model requires Provisioned Throughput — you purchase dedicated model capacity (measured in "model units") and pay a flat hourly rate for it continuously, the entire time the model is deployed, regardless of how many requests you send. An idle student on Provisioned Throughput still bills every hour. This is the same mechanic the amazon-bedrock-fine-tuning and amazon-bedrock-provisioned-throughput siblings cover in depth.
Why this is usually fine for distillation specifically: the entire reason you distil is high, steady volume. At that volume the reserved capacity is busy — you are getting work out of every hour you pay for — which is exactly the condition under which Provisioned Throughput is efficient rather than wasteful. The standing hourly bill that makes casual fine-tuning a bad idea is, for a genuinely high-throughput distilled workload, simply the cost of the dedicated lane you are keeping full. And because the student is small, its per-hour Provisioned Throughput rate is far lower than hosting a large model would be — a second structural saving on top of the per-token one.
The number to compute is the crossover: at your traffic, does (teacher-generation + training, one-time) + (student Provisioned Throughput, hourly × hours) come out below running the teacher on-demand at your token volume? Below some monthly volume the on-demand teacher is cheaper and you should not distil; above it the distilled student wins, and the margin widens as volume grows because the on-demand teacher cost scales linearly with tokens while the student's reserved capacity is a fixed hourly rate you have already saturated. If you can commit to a 1- or 6-month Provisioned Throughput term, the hourly rate drops and the crossover moves in distillation's favour. The worked example in the next section runs these numbers.
The distilled student is a custom model, so it can only be served on Provisioned Throughput — a flat hourly charge that accrues 24/7 whether or not the model is used. For distillation's high-volume target this reserved capacity is usually busy enough to be worth it (and a small student's hourly rate is low), but the standing hosting bill must be in the math alongside the one-time teacher and training charges — not just the per-token savings.
An illustrative walk-through of the crossover, with round, clearly-labelled representative numbers — not a quote. The point is the <em>shape</em> of the savings and when distillation flips from a loss to a large win, not these specific figures. Confirm current per-token and Provisioned Throughput rates on the AWS Bedrock pricing page.
Take a document-extraction feature that needs frontier-class accuracy and runs at high, steady volume — say 500 million tokens per month of combined input and output, the kind of load a busy back-office automation generates. Assume, purely to illustrate the arithmetic, a frontier teacher priced around $10 per million tokens (blended) and a small student in the same family priced around $1 per million tokens blended — roughly the order-of-magnitude gap typical between frontier and small tiers (your real ratio will differ; check amazon-bedrock-pricing).
Read the table as a crossover, not a fixed discount. The one-time teacher-generation and fine-tuning costs are a fixed investment; the recurring saving is the gap between the teacher's and the student's run rates (here ~$4,500/month) minus the Provisioned Throughput hosting. At 500M tokens/month the investment pays back almost immediately and the workload then runs roughly an order of magnitude cheaper every month thereafter. At 5M tokens/month the same fixed costs and the standing hosting bill would never pay back — which is the whole reason distillation is a high-volume tool. Before committing, compute your own crossover with current rates and your real traffic; a vetted AWS partner will model this for you (see §VIII), and AWS credits cover the entire experiment either way.
| Line item | Run the teacher on-demand | Distil + run the student | Notes |
|---|---|---|---|
| Monthly inference volume | 500M tokens | 500M tokens | Same workload either way |
| Per-token rate (blended) | ~$10 / M tokens | ~$1 / M tokens | Frontier vs small tier — illustrative |
| Monthly inference cost | ~$5,000 / mo | ~$500 / mo | Student runs the volume ~10× cheaper |
| One-time teacher data generation | — | ~$200–$1,000 once | Pay frontier price once to label the dataset |
| One-time student fine-tuning | — | ~tens–low-hundreds once | A normal Bedrock training charge |
| Student hosting (Provisioned Throughput) | — (on-demand, no PT) | flat hourly, 24/7 | Standing cost; small student = low hourly rate |
| Approx. all-in month 1 | ~$5,000 | ~$1,500–$2,500* | *incl. one-time setup + a month of PT |
| Approx. steady-state monthly | ~$5,000 | ~$500 + PT hours | One-time costs gone; only run-rate + hosting |
Everything above prices distillation if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the work with credits — and every line item in a distillation project draws those credits down before it ever touches your card.
Distillation is, in FinOps terms, a textbook unit-economics move: it attacks the recurring cost-per-request of an AI feature rather than trimming around the edges, which is exactly the kind of structural saving a cloud-cost practice exists to find. The irony teams hit is that the move itself has an upfront cost — the teacher's frontier-priced data generation, the fine-tuning job, and a month or more of Provisioned Throughput to evaluate the student — spent before the savings arrive, often out of scarce runway, just to learn whether the distilled student clears the quality bar. That upfront hump is what stops teams from trying distillation even when it would obviously pay off.
AWS credits remove the hump. The teacher-generation tokens, the student fine-tuning charge, the Provisioned Throughput hosting during the build and proof-out, the embeddings and vector store of any RAG you pair with the student, and the S3 storage for the datasets are all credit-eligible, and AWS credits apply automatically against your bill until exhausted. The relevant pools are AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed squarely at proving out a GenAI use case — and a distillation experiment is precisely that — and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). That means you can generate the teacher data, distil the student, run a proper three-way evaluation, and only commit real money to ongoing hosting once the savings are proven.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS ML partner who both files the credit application and does the work: choosing the teacher/student pair, curating seed prompts, generating and filtering the teacher dataset, running the distillation fine-tune, modelling the crossover honestly, and telling you when distillation is not the right lever (when prompt caching, RAG, or plain fine-tuning would do). The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. Related: AWS credits for generative-AI startups and Bedrock POC funding.
The headline decision, on one screen. Each lever attacks a different driver of the Bedrock bill; match the row to where your money actually goes. The pattern to notice: only distillation changes <em>which model you run</em> — the others make a given model cheaper, more accurate, or better-behaved. Representative 2026 guidance, not quotes.
| Lever | What it attacks | Changes which model runs? | Effort | Cost shape | Reach for it when… |
|---|---|---|---|---|---|
| Prompt caching | Cost of repeated context tokens | No | Lowest | Pay-per-use; cheaper repeated input | A long fixed prompt/context repeats on every call |
| RAG (Knowledge Bases) | Accuracy / missing facts (not cost) | No | Low–medium | Adds tokens; embeddings + vector store | Answers are wrong for lack of your facts |
| Plain fine-tuning | Behaviour / format on an already-OK small model | No (tunes the one you have) | Medium | One-time training + standing PT hosting | Small model nearly good enough; need format locked in |
| Model distillation | Per-token price of needing a big model | Yes — big → small | Medium–high | One-time teacher gen + training + standing PT hosting | Only a frontier model is accurate enough but unaffordable at high volume |
| Just run the teacher on-demand | Nothing — baseline | No | None | Pure pay-per-token, no hosting | Volume is low or spiky; nothing to amortise |
Situation: Their core feature — pulling structured fields out of customer documents — needed frontier-class accuracy, so they ran every request through a top-tier model on-demand. At a few hundred million tokens a month the Bedrock bill for that one feature had become the largest line in their cloud spend and was scaling linearly with usage, threatening the unit economics of the product. They suspected a smaller model could do it but were worried a cheap model would drop accuracy, and could not spare runway to find out.
What CloudRoute did: CloudRoute matched them in under 24 hours to a Netherlands-based AWS ML partner. The partner confirmed the workload was the textbook distillation case — narrow, stable, high-volume, frontier-quality-needed — and ran a managed Bedrock distillation: seeding from the team's real production prompts, using the existing frontier model as the teacher to generate a labelled dataset (with Bedrock synthesizing additional variations), then fine-tuning a small student model in the same family on that data. They hosted the distilled student on a single Provisioned Throughput model unit and ran a three-way evaluation — teacher vs student vs plain small base model — on a held-out set. They filed a Bedrock POC credit application plus an Activate Portfolio application to fund the entire experiment.
Outcome: On the held-out evaluation the distilled student landed within a small margin of the teacher on extraction accuracy while costing roughly an order of magnitude less per token and responding noticeably faster — comfortably clearing both bars (close to the teacher, clearly better than the plain small model). The teacher-generation tokens, the fine-tuning, and the Provisioned Throughput hosting during the build were all covered by the approved credits, so the team paid $0 during the build and proof-out, and moved the feature's steady-state run rate down by roughly 90%. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
method: distil frontier teacher → small student · quality: within a small margin of teacher · steady-state run rate: ~10× cheaper · out-of-pocket during build: $0
Distillation cuts a high-volume feature's per-token cost by paying the frontier teacher once and running the workload on a cheap student forever — but the teacher generation, training, and Provisioned Throughput hosting are an upfront hump spent before the savings arrive. AWS credits cover all of it. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS ML partner who runs the distillation, models the crossover, and tells you honestly whether to distil at all. Customer pays $0.