A complete, neutral reference for customizing models on Amazon Bedrock: when to fine-tune versus use RAG, prompt engineering, continued pre-training, or model distillation; which models support fine-tuning; how to format and prepare JSONL training data; how to run a fine-tuning job; the Provisioned Throughput requirement to host a custom model (and the standing cost it implies); how to evaluate the result; and what the whole thing costs. Plus how AWS credits fund the training and the hosting so the build costs you $0.
Fine-tuning takes a pre-trained foundation model and continues training it on your own labelled examples so it learns a specific style, format, or narrow task better than the base model does out of the box. On Bedrock the whole process is managed — you supply data, AWS runs the training, and you get a private custom model that lives in your own account.
A foundation model arrives knowing a great deal about language in general but nothing about your domain conventions, your output format, your tone of voice, or the specific way you want a task done. Fine-tuning closes that gap by showing the model many examples of the input it will see and the output you want, and nudging its weights toward producing that output. The result is a custom model: a private copy, derived from the base model, that behaves the way your examples taught it to.
On Bedrock this is delivered as a managed service. You do not provision GPUs, write a training loop, or manage checkpoints. You upload a training dataset to Amazon S3, point a fine-tuning job at a supported base model, set a few hyperparameters, and Bedrock handles the rest. When the job finishes, the custom model appears in your account — and, crucially, your training data and the resulting model stay private to your AWS account and region; they are not used to train the base model or shared with the model provider. That privacy posture is one of the main reasons teams fine-tune on Bedrock rather than on a general-purpose API.
It is important to separate fine-tuning from two things it is often confused with. It is not retrieval-augmented generation (RAG): RAG gives a model new facts at request time by retrieving documents and putting them in the prompt, while fine-tuning changes the model's behaviour and is poor at teaching fresh facts. And it is not the same as continued pre-training: fine-tuning learns from labelled prompt/response pairs (supervised), whereas continued pre-training learns from large volumes of unlabelled domain text. The next two sections lay out the full menu and how to choose.
One caveat, stated once and meant throughout: exact dollar figures, the precise list of fine-tunable models, and feature availability change frequently on Bedrock. The numbers here are representative as of 2026 to convey relative cost and the shape of the work. Always confirm the current fine-tunable model list and pricing on the official AWS Bedrock documentation and pricing pages before committing.
Fine-tuning = continue training a base model on your labelled examples to change how it behaves (style, format, narrow skill). It does not reliably teach new facts — that is RAG's job. The output is a private custom model in your account, which then needs Provisioned Throughput to serve.
Fine-tuning is one tool among five. Picking the wrong one is the most common and most expensive mistake teams make — fine-tuning to add facts (should have been RAG), or fine-tuning to fix a prompt (should have been prompt engineering). Here is the full menu, cheapest and lightest first.
Read these as a ladder. In most projects you climb it: start with prompt engineering, add RAG when the model needs your facts, and only fine-tune when behaviour still is not right after both. Continued pre-training and distillation are specialist rungs you reach for in specific situations, not defaults.
The lightest lever: improve the instructions, add few-shot examples, structure the system prompt, and constrain the output format. No training, no custom model, no hosting cost — you are simply asking the base model more precisely. With modern models, careful prompting plus a handful of in-context examples solves a surprising share of "we need to fine-tune" requests. Always exhaust this first; it is free to iterate and instant to deploy. On Bedrock, Prompt Management and the Converse API's system/tool fields are the tools here, and prompt caching keeps a long fixed prompt cheap.
Retrieval-augmented generation retrieves relevant chunks from your own documents and inserts them into the prompt so the model answers from your knowledge. This is the correct tool whenever the problem is "the model does not know our content / our docs / our latest data." On Bedrock, Knowledge Bases provide managed RAG (ingestion, chunking, embeddings, a vector store, and retrieval) so you do not build the pipeline yourself. RAG adds no training cost and no standing hosting bill — you pay for embeddings, the vector store, and the extra input tokens of the retrieved context. See the rag-on-aws and amazon-bedrock-knowledge-bases siblings.
Supervised fine-tuning trains the base model on labelled prompt/completion pairs to lock in a behaviour: a consistent JSON output shape, a brand voice, a domain-specific classification, a structured extraction the base model keeps getting subtly wrong. It is the right tool when you have good labelled examples of the exact task and prompting alone cannot make the behaviour reliable enough. It carries a one-time training cost and — the part to plan for — a standing hosting cost via Provisioned Throughput. The rest of this page is mostly about this rung.
Continued pre-training feeds the model large volumes of unlabelled domain text (legal corpora, medical literature, internal documentation) so it absorbs the vocabulary, patterns, and register of a specialist field. Unlike fine-tuning it does not need labelled input/output pairs — just a lot of representative text. It is heavier and more expensive than fine-tuning and is worth it when a domain's language is genuinely different from general text and you have a large corpus. Many teams pair it with a subsequent fine-tune (first teach the domain, then teach the task).
Distillation uses a large, capable "teacher" model to generate high-quality outputs that train a smaller, cheaper "student" model to mimic it on a specific task. The payoff is at inference time: you get close to the big model's quality on a narrow task at the small model's cost and latency. Bedrock offers managed distillation that can even synthesize training data from your prompts. This is the right tool for a high-volume, narrow workload where frontier-model quality is needed but frontier-model per-token cost is not affordable at scale.
The single most useful thing on this page is a clear decision rule. Match the method to the <em>kind</em> of problem you have — not to which technique sounds most sophisticated. Fine-tuning is rarely the first answer and almost never the only one.
Diagnose by asking what is actually wrong with the base model's output, then read across to the right tool:
Climb the ladder: prompt engineering → RAG → fine-tuning → continued pre-training → distillation. Each rung up costs more and adds operational weight. Fine-tune only when the cheaper rungs cannot make the behaviour reliable — and never to add facts (that is RAG).
The same five methods, lined up against the dimensions that drive the decision: what each one actually changes, whether it needs labelled data, what it costs to run, and whether it leaves you with a standing hosting bill.
| Method | What it changes | Data needed | Teaches new facts? | Run cost | Standing hosting cost? |
|---|---|---|---|---|---|
| Prompt engineering | Behaviour, via instructions | None (maybe few-shot examples) | Only what you paste in | Free to iterate | No |
| RAG (Knowledge Bases) | Facts available at request time | Your documents (unlabelled) | Yes — your retrieved content | Embeddings + vector store + extra input tokens | No (pay-per-use) |
| Fine-tuning | Behaviour: style, format, narrow skill | Labelled prompt/completion pairs (JSONL) | Poorly — not its job | One-time training charge | Yes — Provisioned Throughput |
| Continued pre-training | Domain language / register | Large unlabelled domain corpus | Somewhat (absorbs domain text) | Higher one-time training charge | Yes — Provisioned Throughput |
| Model distillation | Small model imitates a big one | Prompts (teacher can synthesize) | Inherits teacher's behaviour | Training charge (teacher + student) | Depends on how the student is served |
Not every model on Bedrock is fine-tunable, and the set changes as providers add and retire support. The durable rule: fine-tuning support is most common on Amazon's own models and several open-weight families; for some third-party frontier models, customization is offered through different mechanisms or not at all.
As a practical 2026 guide, fine-tuning (and in several cases continued pre-training) has typically been available for Amazon's own models — the Amazon Nova family and the Amazon Titan text models — and for open-weight families such as Meta Llama and Cohere models. Support for these is the most stable bet. Amazon-built models also tend to expose both supervised fine-tuning and continued pre-training, which is why they are a frequent starting point when a team specifically wants to own a customized model.
For some third-party frontier models, the provider may not expose weight-level fine-tuning on Bedrock at all, or may offer customization only through a separate managed path. In those cases the right move is usually to get the behaviour you need through prompting, RAG, and (where supported) distillation rather than classic fine-tuning. If your plan depends on tuning a specific named model, verify it is on the current fine-tunable list before you design around it — this is the most common place a customization plan breaks.
There is also a capacity dimension. Because a fine-tuned model is served on Provisioned Throughput (covered in §VII), the model you choose to tune affects the hosting cost: a small, efficient base model costs less per hour to host than a large one. For high-volume narrow tasks this is a reason to tune (or distill into) a smaller model rather than a frontier one — you get the customized behaviour and a far cheaper standing bill.
The fine-tunable model set changes. Amazon Nova, Amazon Titan, and open-weight families (Llama, Cohere) have been the most reliable to fine-tune on Bedrock; some frontier models offer customization only via other paths. Confirm your target model is currently fine-tunable on the AWS docs before building your plan around it.
Fine-tuning quality is mostly a data problem. The model can only learn what your examples demonstrate, so the format and the curation of the dataset matter more than any hyperparameter. Bedrock expects training data as a JSONL file in Amazon S3 — one labelled example per line.
The format is JSONL — "JSON Lines" — a plain-text file where each line is a single, self-contained JSON object describing one training example. For supervised fine-tuning of a text model, each line is a prompt/completion pair: the input the model will see and the exact output you want it to produce. Conceptually each line looks like {"prompt": "<the input text>", "completion": "<the desired output>"} (the exact field names and structure depend on the model and the task — chat-formatted models use a messages-style schema; check the AWS docs for the schema your chosen model expects). You upload the file (and usually a separate validation file) to an S3 bucket in the same region as the job.
The examples must mirror production. The single biggest determinant of a good fine-tune is that the training prompts look like the prompts the model will actually receive, and the completions look exactly like the output you want back — same format, same length profile, same tone. If production prompts include a system instruction and retrieved context, your training examples should too. A clean dataset of a few hundred to a few thousand high-quality, representative pairs typically beats a much larger noisy one.
Curation and hygiene matter. De-duplicate examples, remove contradictions (two near-identical prompts with different desired outputs confuse the model), balance the classes or formats you care about, and strip anything you would not want the model to imitate. Hold out a validation set the model does not train on so you can measure generalization rather than memorization. And because the data leaves your hands into a training process, scrub or tokenize PII and secrets you do not want baked into a model.
Practically, getting from "raw logs / spreadsheets / documents" to a clean JSONL dataset is where most of the human effort in a fine-tuning project goes. It is also exactly the kind of work a vetted AWS ML partner does efficiently — and, because the engagement is credit-funded, without the customer paying for it (see §IX).
With a clean dataset in S3, running the job itself is straightforward. The part that determines whether fine-tuning is economically sensible is what happens after the job succeeds: to actually serve your custom model, Bedrock requires you to buy Provisioned Throughput — a standing hourly cost.
In the Bedrock console (or via API/SDK) you create a model customization job: choose the base model, point it at your training (and validation) data in S3, name the output custom model, set a few hyperparameters — typically the number of epochs (passes over the data), learning-rate multiplier, and batch size — and give the job an IAM role that can read your S3 data and write the result. Bedrock provisions the training infrastructure, runs the job, and reports training and (if you supplied validation data) validation loss metrics when it finishes. Jobs run from minutes to hours depending on dataset size and epochs. The output is a private custom model registered in your account.
Here is the part to internalize before you start: you cannot call a fine-tuned model on the cheap on-demand, per-token path the way you call base models. To serve a custom model, Bedrock requires Provisioned Throughput — you purchase dedicated model capacity (measured in "model units") and pay a flat hourly rate for it continuously, the entire time the model is deployed, regardless of how many requests you send. A custom model sitting idle on Provisioned Throughput still bills every hour.
The consequence is stark: the fine-tuning training charge might be tens or low-hundreds of dollars one time, but hosting the result can cost far more per month than that — and far more than the equivalent on-demand inference would have cost on a base model. This single fact flips the economics of most casual fine-tuning ideas. It is the reason the honest default is: do not fine-tune-and-host unless the volume and quality gains clearly justify a standing hourly bill. For low or spiky traffic, a base model with good prompting and RAG is almost always cheaper overall.
Where fine-tuning does pay off, it is usually high, steady volume on a narrow task: at that point you are running enough traffic that (a) the quality/consistency gain is worth a lot and (b) the reserved capacity is busy enough to be efficient. If you can commit to a 1- or 6-month Provisioned Throughput term, the hourly rate drops, improving the math further. See the amazon-bedrock-provisioned-throughput sibling for the capacity mechanics, and amazon-bedrock-pricing for how PT sits among the four pricing modes.
Training a custom model is a small one-time charge. Hosting it is the real cost: a fine-tuned model can only be served on Provisioned Throughput, a flat hourly charge that accrues 24/7 whether or not the model is used. Budget for the standing hosting bill — not just the training — and only fine-tune-and-host for high, steady volume.
A finished fine-tune is a hypothesis, not a result. Before you put a standing Provisioned Throughput bill behind it, prove it actually beats the base model on your task — and that the improvement justifies the cost and the operational weight.
Start with the training and validation loss the job reports: validation loss falling alongside training loss is a healthy sign; training loss falling while validation loss rises means it is overfitting (memorizing rather than generalizing) — usually a cue to reduce epochs or get more/cleaner data. But loss numbers are only a proxy. The real test is task performance on a held-out evaluation set the model has never seen.
Run a head-to-head: the same prompts through the base model and through your custom model, scored on the metric that matters for your task — exact-match or schema-validity for structured extraction, a rubric or LLM-as-judge score for style/quality, accuracy/F1 for classification. Bedrock's model evaluation tooling can run automated and human-in-the-loop evaluations to make this systematic. The bar to clear is not "is the custom model good" — it is "is it enough better than the base model (with good prompting and RAG) to justify a standing hosting cost."
That framing is the honest test of whether fine-tuning is worth it at all. Tally the full cost: one-time training + ongoing Provisioned Throughput hosting + the human effort to build and maintain the dataset (a fine-tune drifts as your task evolves and may need re-training). Set it against the measured quality and consistency gain over the cheaper alternatives. Fine-tuning wins cleanly when the task is narrow, stable, high-volume, and format/behaviour-sensitive; it loses when traffic is low or spiky, the task keeps changing, or the real gap was missing facts (RAG) or a weak prompt (prompt engineering) all along.
Everything above prices fine-tuning if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the work with credits — and both the one-time training charge and the ongoing Provisioned Throughput hosting draw those credits down before they ever touch your card.
Fine-tuning, continued pre-training, custom-model hosting on Provisioned Throughput, the embeddings and vector store behind any RAG you pair with it, and the S3 storage for your datasets are all credit-eligible, and AWS credits apply automatically against your bill until exhausted. The relevant pools are AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed specifically at proving out a GenAI use case — which is exactly what a fine-tuning experiment is — and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups).
This matters more for fine-tuning than for plain inference precisely because of the standing hosting cost. The line item that makes teams hesitate to fine-tune — Provisioned Throughput running 24/7 — is fully covered by credits during the build and proof-out period. That changes the calculus: you can stand up a custom model, run a proper head-to-head evaluation against the base model, and only commit real money to hosting once you have proven the gain and the volume justify it.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS ML partner who both files the credit application and does the work: curating the JSONL dataset, running the fine-tuning job, setting up evaluation, and deciding honestly whether fine-tuning or a cheaper RAG/prompt approach is the right answer. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. Related: AWS credits for generative-AI startups and Bedrock POC funding.
The headline decision, on one screen. Match the row to the problem you actually have. The pattern to notice: the lighter tools (prompt engineering, RAG) have no standing model-hosting bill, while fine-tuning and continued pre-training do. Representative 2026 guidance, not quotes.
| Method | Best when… | Effort | Adds facts? | Changes behaviour? | Cost shape | Reach for it… |
|---|---|---|---|---|---|---|
| Prompt engineering | The prompt/examples just are not good enough yet | Lowest | Only inline | Yes (via instructions) | Free to iterate; no hosting | First, always |
| RAG (Knowledge Bases) | The model lacks your facts / docs / latest data | Low–medium | Yes | No | Embeddings + vector store + tokens; no model hosting | Second, for any knowledge gap |
| Fine-tuning | Need a locked-in style/format/narrow skill, prompting unreliable | Medium | No | Yes | One-time training + standing PT hosting | Third, with good labelled data |
| Continued pre-training | Whole domain language is alien; large unlabelled corpus | High | Somewhat | Yes (domain register) | Higher one-time training + standing PT hosting | Specialist domains |
| Model distillation | Need big-model quality at small-model cost, high volume | Medium–high | Inherits teacher | Yes (mimics teacher) | Training; cheap student inference | High-volume narrow tasks |
Situation: The team was convinced they needed to fine-tune a model on their contract corpus because the base model "did not know their clauses." They had budgeted for a fine-tuning project and were worried about the cost of training and hosting a custom model — and about spending scarce runway to find out whether it would even work.
What CloudRoute did: CloudRoute matched them in under 24 hours to a German AWS ML partner. On a discovery call the partner diagnosed the real problem: the model lacked <em>facts</em> (the clauses), not the right <em>behaviour</em> — a RAG problem, not a fine-tuning one. The partner built a Bedrock Knowledge Base over the contract corpus (managed RAG), then added a small fine-tune purely to lock the output into the firm's structured review format, hosted on a single Provisioned Throughput unit. They filed a Bedrock POC credit application plus an Activate Portfolio application to fund the whole build.
Outcome: The assistant shipped with RAG carrying the facts and a narrow fine-tune carrying the format — far cheaper and more accurate than fine-tuning everything would have been. Training, the PT hosting, embeddings, and the vector store were all covered by the approved credits, so the team paid $0 during the build and proof-out. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
method: RAG for facts + narrow fine-tune for format · credits secured: POC + Activate · out-of-pocket during build: $0
Training is cheap; hosting a custom model on Provisioned Throughput is the standing cost that makes teams hesitate. AWS credits cover both. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS ML partner who preps the data, runs the job, and tells you honestly whether to fine-tune at all. Customer pays $0.