A neutral, end-to-end reference for fine-tuning a large language model on AWS — across all three paths. First the decision that should come before any of them: fine-tune vs RAG vs prompt engineering. Then the three ways to actually do it — Amazon Bedrock fine-tuning (fully managed), Amazon SageMaker training (full control), and SageMaker JumpStart (the guided middle) — with how data prep works, the workflow on each path, the GPU and AWS Trainium cost of the training run, how you host the tuned model afterwards (Bedrock Provisioned Throughput vs a SageMaker endpoint), how to evaluate it, and a decision table across all three. Plus how AWS credits fund the GPU/Trainium training and the hosting so the build costs you $0.
Fine-tuning continues training a pre-trained LLM on your own labelled examples so it learns a specific style, format, or narrow skill better than the base model does out of the box. The most valuable thing to settle before picking an AWS path is whether you need fine-tuning at all — because most teams who think they do actually have a RAG or prompt problem.
A foundation model arrives fluent in language in general but knowing nothing about your output format, your tone, your domain conventions, or the precise way you want a task done. Fine-tuning closes that gap by showing the model many examples of the input it will see and the output you want, and nudging its weights toward producing that output. The result is a custom model: a private derivative of the base model that behaves the way your examples taught it to. Critically, fine-tuning changes how the model behaves, not what facts it knows.
That distinction drives the whole decision. If your model gives wrong answers because it does not know your content — your docs, your product, last night's data — that is a facts problem, and the fix is retrieval-augmented generation (RAG): retrieve the relevant passages and put them in the prompt at request time. Fine-tuning is poor at teaching facts and goes stale the moment your data changes. If the model can clearly do the task but does it inconsistently, that is usually a prompt problem — better instructions and a few in-context examples — which is free and instant to iterate. Only when you have good labelled examples and a behaviour that prompting still cannot make reliable does fine-tuning become the right tool.
When fine-tuning is right, it tends to be for a narrow, stable, behaviour-sensitive task: a strict JSON output schema the base model keeps subtly breaking, a consistent brand voice, a domain-specific classification or extraction. Those are the cases where locking the behaviour into the weights pays off. The companion sibling rag-on-aws covers the retrieval path in depth; the rest of this page assumes you have made the decision to fine-tune and focuses on how to do it on AWS — across the three paths.
One caveat, stated once and meant throughout: exact dollar figures, the precise list of fine-tunable models, the available training instances, and feature details change frequently across Bedrock and SageMaker. Every number here is representative as of 2026 to convey relative cost and the shape of the work. Always confirm the current model list, instance types, and pricing on the official AWS documentation and pricing pages before committing.
Missing facts → RAG. Weak instructions → prompt engineering. A behaviour (style/format/narrow skill) that prompting cannot make reliable, and you have good labelled examples → fine-tune. Climb the ladder; fine-tune last, and never to add facts.
AWS gives you three distinct ways to fine-tune an LLM, trading control for convenience. Picking the right one for your team and task is most of the work — the wrong choice means either fighting a managed service that cannot do what you need, or hand-building infrastructure you did not have to. Here they are, most-managed first.
The honest framing: start as managed as your requirements allow, and move toward full control only when a specific need forces it. Most teams should try Bedrock fine-tuning or a JumpStart recipe before writing a custom SageMaker training job, because those reach a working custom model in a fraction of the time. Conversely, teams with a hard requirement — a model Bedrock cannot tune, a custom training objective, or full ownership of the artifact — will be happier going to SageMaker from the start than fighting the managed path.
The least-effort path. You upload a JSONL training dataset to Amazon S3, point a Bedrock model customization job at a supported base model, set a few hyperparameters, and AWS runs the entire training process on managed infrastructure — you never see a GPU, a training loop, or a checkpoint. The output is a private custom model in your account, and your training data and the resulting model stay in your account and region (not used to train the base model). Choose this when: your target model is on Bedrock's fine-tunable list (Amazon Nova, Amazon Titan, and open-weight families like Llama and Cohere have been the most reliable), and you want the shortest path to a working custom model. The trade is the least control and that serving the result requires Provisioned Throughput (covered in §VI). The amazon-bedrock-fine-tuning sibling is the deep dive.
The most flexible path. You write (or adapt) a training script — typically using Hugging Face Transformers, PyTorch, and parameter-efficient methods like LoRA/QLoRA — and run it as a SageMaker training job on GPU instances (the ml.p and ml.g families) or AWS Trainium instances (ml.trn, via the Neuron SDK). You control the model, the objective, the data pipeline, distributed-training strategy, and every hyperparameter; you own the resulting weights as an artifact in S3. Choose this when: you need a model or technique Bedrock does not offer, full ownership of the weights, a custom training objective, or large-scale distributed training. The trade is real ML engineering effort. The amazon-sagemaker sibling covers the platform; aws-trainium covers the cheaper-than-GPU training silicon.
The middle path: most of SageMaker's flexibility, far less of the boilerplate. JumpStart is a hub of pre-trained open models (Llama, Mistral, Falcon, and many more) with built-in fine-tuning recipes — you select a model, point the recipe at your dataset, set a handful of parameters, and JumpStart handles the training script, the instance selection, and the deployment scaffolding for you. You get a SageMaker endpoint hosting your tuned open model without writing the training loop yourself. Choose this when: the model you want is a popular open one available in JumpStart, you want more control and model choice than Bedrock but do not want to hand-write a training job. It is the pragmatic default for fine-tuning open-weight LLMs on AWS.
Need it managed and your model is on Bedrock's list → Bedrock fine-tuning. Want an open model with minimal code → SageMaker JumpStart. Need full control, custom techniques, or to own the weights → SageMaker training. Move toward control only when a concrete requirement forces it.
The same three paths, lined up against the dimensions that actually drive the choice: how much you build, how much control and model choice you get, what silicon the training runs on, and — the part teams overlook — how you host the result and what that standing cost looks like.
| Dimension | Bedrock fine-tuning (managed) | SageMaker JumpStart (guided) | SageMaker training (full control) |
|---|---|---|---|
| Effort | Lowest — upload JSONL, run a job | Low–medium — pick a recipe, set params | Highest — write/adapt a training script |
| Control | Minimal (a few hyperparameters) | Moderate (recipe parameters) | Total (objective, data, distribution) |
| Model choice | Bedrock fine-tunable list (Nova, Titan, Llama, Cohere…) | JumpStart open models (Llama, Mistral, Falcon…) | Any open model you can train |
| Training silicon | Managed (abstracted away) | GPU or Trainium (ml.p/ml.g/ml.trn) | GPU or Trainium, your choice (ml.p/ml.g/ml.trn) |
| Methods | Managed SFT (+ continued pre-training on some models) | Recipe-driven (often LoRA/QLoRA) | Anything — full SFT, LoRA/QLoRA, custom |
| Who owns the weights | Private custom model in Bedrock | Model artifact in your S3 + endpoint | Model artifact in your S3 — fully yours |
| How you host it | Provisioned Throughput (hourly) | SageMaker endpoint (hourly instance) | SageMaker endpoint (hourly instance) |
| Best for | Fastest path; managed models | Open models with minimal code | Custom needs, full ownership, scale |
Fine-tuning quality is mostly a data problem, and this is true on all three paths. The model can only learn what your examples demonstrate, so curation and format matter more than any hyperparameter. The shape differs slightly by path, but the discipline is identical.
The format. On Bedrock, training data is a JSONL file ("JSON Lines") in Amazon S3 — one self-contained JSON object per line, each a prompt/completion pair for supervised fine-tuning (chat-formatted models use a messages-style schema; the exact field names depend on the model). On SageMaker training and JumpStart, the format is whatever your script or the chosen recipe expects — commonly JSONL or CSV with an instruction/response (or chat) structure, also staged in S3. In all cases each example pairs the input the model will see with the exact output you want it to produce.
The examples must mirror production. The single biggest determinant of a good fine-tune is that the training prompts look like the prompts the model will actually receive, and the completions look exactly like the output you want back — same format, same length profile, same tone. If production prompts include a system instruction and retrieved context, your training examples should too. A clean dataset of a few hundred to a few thousand high-quality, representative pairs typically beats a much larger noisy one.
Curation and hygiene. De-duplicate examples, remove contradictions (two near-identical prompts with different desired outputs confuse the model), balance the classes or formats you care about, and strip anything you would not want the model to imitate. Always hold out a validation set the model does not train on, so you can measure generalization rather than memorization. And because the data feeds a training process that bakes patterns into weights, scrub or tokenize PII and secrets you do not want learned. This work is path-independent — it matters exactly as much whether you fine-tune on Bedrock, JumpStart, or a custom SageMaker job.
Practically, getting from raw logs, spreadsheets, and documents to a clean training set is where most of the human effort in any fine-tuning project goes — far more than running the job itself. It is also exactly the kind of work a vetted AWS ML partner does efficiently, and, because the engagement is credit-funded, without the customer paying for it (see §VIII).
With a clean dataset in S3, the act of running the fine-tune differs by path: a managed job on Bedrock, a recipe on JumpStart, or a training job you script on SageMaker. Here is what each looks like, end to end.
In the Bedrock console (or via API/SDK) you create a model customization job: choose the base model, point it at your training (and validation) data in S3, name the output custom model, set a few hyperparameters — typically epochs (passes over the data), learning-rate multiplier, and batch size — and grant an IAM role that can read your S3 data and write the result. Bedrock provisions the training infrastructure, runs the job, and reports training and validation loss when it finishes. Jobs run from minutes to hours depending on dataset size and epochs. The output is a private custom model registered in your account, ready to evaluate. No infrastructure to manage.
In SageMaker Studio's JumpStart hub you select a fine-tunable open model (e.g. a Llama or Mistral variant), point the built-in fine-tuning recipe at your dataset in S3, choose a training instance (a GPU ml.g/ml.p or a Trainium ml.trn type, depending on the recipe), and set the exposed parameters — epochs, learning rate, and usually whether to use a parameter-efficient method like LoRA. JumpStart supplies the training script and orchestration; you launch the job and it produces a tuned model artifact plus the scaffolding to deploy it to a SageMaker endpoint. You get most of SageMaker's flexibility without writing the training loop.
For full control you write or adapt a training script (Hugging Face Transformers + PyTorch is the common stack), choose the instance type and count — GPU (ml.p/ml.g) or Trainium (ml.trn) — and launch a SageMaker training job that spins the cluster up, runs your script against your S3 data, writes the model artifact back to S3, and tears the cluster down. You decide the technique: full supervised fine-tuning or, far more commonly for LLMs, a parameter-efficient method (LoRA / QLoRA) that trains a small set of adapter weights instead of the whole model — dramatically cheaper in memory and compute, and the standard way to fine-tune large open models affordably. For multi-GPU or multi-node runs you configure distributed training. The resulting weights are entirely yours.
Full fine-tuning updates every weight in the model — expensive in GPU memory and time for a large LLM. Parameter-efficient fine-tuning (LoRA, and quantized QLoRA) trains a small set of adapter weights instead, cutting training cost and memory by a large factor while keeping most of the quality. On SageMaker and JumpStart it is the default way to fine-tune large open models on modest hardware; it is a big reason the GPU/Trainium bill for a fine-tune is often far smaller than people expect.
Fine-tuning has two very different costs, and confusing them is the most common budgeting mistake. The training run is a one-time charge driven by the silicon and how long it runs. Hosting the result is an ongoing charge that, on every path, usually dwarfs the training.
A fine-tuning job runs on accelerated compute, and you pay for the instance-time it consumes. On SageMaker and JumpStart you choose the instance and pay its hourly rate for the duration of the job: GPU instances (the ml.p family — high-end NVIDIA for the largest jobs; the ml.g family — cheaper GPUs for smaller fine-tunes) or AWS Trainium instances (ml.trn), AWS's custom training silicon, which is typically meaningfully cheaper per unit of training throughput than comparable GPUs and is accessed via the Neuron SDK (well-supported by JumpStart recipes and Hugging Face). On Bedrock the training infrastructure is abstracted away — you are billed for the customization itself (commonly priced per 1,000 training tokens × epochs) rather than for instance-hours.
The encouraging part: for most fine-tunes — especially LoRA/QLoRA on an open model, or a managed Bedrock job on a typical dataset — the training run is a modest one-time cost, frequently tens to low-hundreds of dollars. Two levers cut it further: parameter-efficient methods (LoRA/QLoRA) slash the compute needed, and Trainium or EC2 Spot-backed training capacity lowers the hourly rate. Training is rarely the line item that makes or breaks the economics.
The real cost is serving the model, and the mechanism differs by path. A Bedrock custom model cannot be called on the cheap on-demand per-token path that base models use — to serve it you must buy Provisioned Throughput: dedicated capacity billed at a flat hourly rate, continuously, the entire time the model is deployed, regardless of traffic. A SageMaker or JumpStart tuned model is deployed to a real-time endpoint — one or more instances (often a GPU ml.g/ml.p) that bill by the hour for as long as the endpoint is up, whether or not requests arrive. Either way, a tuned model sitting idle still bills every hour.
The consequence is stark: the training might be a few hundred dollars once, but hosting the result can cost far more per month than that — and far more than equivalent on-demand inference on a base model would have. This single fact flips the economics of most casual fine-tuning ideas. The honest default: do not fine-tune-and-host unless the volume and quality gains clearly justify a standing hourly bill. For low or spiky traffic, a base model with good prompting and RAG is almost always cheaper overall. Where fine-tuning does pay off, it is high, steady volume on a narrow task — enough traffic that the quality gain is worth a lot and the reserved capacity stays busy. (Bedrock 1- or 6-month Provisioned Throughput commitments, and SageMaker Savings Plans or endpoint auto-scaling, lower the hosting math.) The amazon-bedrock-provisioned-throughput sibling covers the Bedrock side; amazon-sagemaker-pricing covers endpoints.
Training a tuned LLM is a small one-time charge — smaller still with LoRA/QLoRA and Trainium. Hosting it is the real cost: Bedrock custom models need Provisioned Throughput; SageMaker/JumpStart models need an always-on endpoint — both flat hourly charges that accrue 24/7 whether or not the model is used. Budget for the standing hosting bill, and only fine-tune-and-host for high, steady volume.
A finished fine-tune is a hypothesis, not a result. Before you put a standing hosting bill behind it on any path, prove it actually beats the base model on your task — and that the improvement justifies the cost and the operational weight.
Start with the training and validation loss the job reports (Bedrock and SageMaker both surface these): validation loss falling alongside training loss is healthy; training loss falling while validation loss rises means overfitting (memorizing rather than generalizing) — a cue to reduce epochs or get more/cleaner data. But loss is only a proxy. The real test is task performance on a held-out evaluation set the model has never seen.
Run a head-to-head: the same prompts through the base model and through your tuned model, scored on the metric that matters for your task — exact-match or schema-validity for structured extraction, a rubric or LLM-as-judge score for style/quality, accuracy/F1 for classification. On the Bedrock path, Bedrock model evaluation can run automated and human-in-the-loop evaluations; on the SageMaker path, SageMaker Clarify / model-evaluation tooling and open-source frameworks do the same. The bar to clear is not "is the tuned model good" — it is "is it enough better than the base model (with good prompting and RAG) to justify a standing hosting cost."
That framing is the honest test of whether fine-tuning was worth it at all. Tally the full cost: one-time GPU/Trainium (or Bedrock) training + ongoing hosting (Provisioned Throughput or endpoint) + the human effort to build and maintain the dataset (a fine-tune drifts as the task evolves and may need re-training). Set it against the measured gain over the cheaper alternatives. Fine-tuning wins cleanly when the task is narrow, stable, high-volume, and format/behaviour-sensitive; it loses when traffic is low or spiky, the task keeps changing, or the real gap was missing facts (RAG) or a weak prompt (prompt engineering) all along.
Everything above prices a fine-tune if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the work with credits — and the GPU/Trainium training run, the Bedrock customization charge, and the ongoing hosting all draw those credits down before they ever touch your card.
Across all three paths the cost lines are credit-eligible: Bedrock model customization and Provisioned Throughput hosting; SageMaker training jobs on GPU or Trainium instances and the real-time endpoints that host the result; the S3 storage for your datasets and artifacts; and the embeddings and vector store behind any RAG you pair with it. AWS credits apply automatically against your bill until exhausted. The relevant pools are AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case — which is exactly what a fine-tuning experiment is — and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups).
This matters more for fine-tuning than for plain inference precisely because of the standing hosting cost. The line item that makes teams hesitate — Provisioned Throughput or an always-on endpoint running 24/7 — is fully covered by credits during the build and proof-out period. That changes the calculus: you can run the GPU/Trainium training, stand up the tuned model, run a proper head-to-head evaluation against the base, and only commit real money to hosting once you have proven the gain and the volume justify it.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS ML partner who both files the credit application and does the work: choosing the path (Bedrock vs JumpStart vs SageMaker), curating the dataset, running the fine-tune on GPU or Trainium, setting up evaluation, and deciding honestly whether fine-tuning — or a cheaper RAG/prompt approach — is the right answer at all. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. Related: AWS credits for generative-AI startups and Bedrock POC funding.
Two decisions on one screen. First (top three rows): should you fine-tune at all, or is this really a RAG or prompt problem? Then (bottom three rows): if fine-tuning is genuinely right, which of the three AWS paths fits. Representative 2026 guidance, not quotes.
| Approach | Best when… | Effort | Teaches facts? | Changes behaviour? | Ongoing cost shape |
|---|---|---|---|---|---|
| Prompt engineering | The prompt/examples just are not good enough yet | Lowest | Only inline | Yes (via instructions) | None — free to iterate, no hosting |
| RAG (Knowledge Bases) | The model lacks your facts / docs / latest data | Low–medium | Yes | No | Embeddings + vector store + tokens; no model hosting |
| Fine-tuning (any path below) | Need a locked-in style/format/skill; prompting unreliable; good labelled data | Medium–high | No | Yes | One-time training + standing hosting |
| ↳ Bedrock fine-tuning | Fastest path; model is on Bedrock's list; want it managed | Low | No | Yes | Training charge + Provisioned Throughput (hourly) |
| ↳ SageMaker JumpStart | Popular open model; want control with minimal code | Medium | No | Yes | GPU/Trainium training + endpoint (hourly) |
| ↳ SageMaker training | Custom technique/model; full ownership; scale | High | No | Yes | GPU/Trainium training + endpoint (hourly) |
Situation: The team needed an open LLM that produced code-review comments in their exact house format and rubric — a behaviour problem prompting had not made reliable. They wanted to fine-tune an open model (not a Bedrock-only one), owned end to end, but had no ML-platform engineer, were nervous about GPU training cost, and worried a standing hosting bill would burn runway before they knew it even worked.
What CloudRoute did: CloudRoute matched them in under 24 hours to a North-American AWS ML partner. The partner diagnosed it as a genuine behaviour/format problem (not facts), chose the <strong>SageMaker</strong> path for full ownership of an open model, and fine-tuned it with <strong>QLoRA</strong> on <strong>Trainium (ml.trn)</strong> instances to keep the training run cheap. They paired it with a small Bedrock Knowledge Base so the assistant could cite the team's style guide, deployed the tuned model to a SageMaker endpoint with auto-scaling, and built an evaluation harness scoring schema-validity and a review rubric against the base model. They filed a Bedrock/GenAI POC credit application plus an Activate Portfolio application to fund the whole build.
Outcome: The tuned model cleared the team's rubric and schema-validity bar head-to-head against the base; QLoRA-on-Trainium kept the training run to a small one-time cost. The training, the endpoint hosting through the proof-out period, S3, and the RAG embeddings were all covered by the approved credits, so the team paid $0 during the build. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
path: SageMaker · method: QLoRA on Trainium + small RAG · credits secured: POC + Activate · out-of-pocket during build: $0
Training on GPU or Trainium is cheap; hosting a tuned model — Provisioned Throughput on Bedrock or an always-on SageMaker endpoint — is the standing cost that makes teams hesitate. AWS credits cover both. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS ML partner who picks the path (Bedrock, JumpStart, or SageMaker), preps the data, runs the fine-tune, and tells you honestly whether to fine-tune at all. Customer pays $0.