SageMaker training · the complete 2026 guide

Amazon SageMaker training — how to train and fine-tune models on managed AWS compute, explained.

Q: How do I launch a training job on SageMaker?

The common way is the SageMaker Python SDK. You create an Estimator (a framework estimator like PyTorch, TensorFlow, HuggingFace, or XGBoost, or a generic one) configured with your entry-point script, instance type and count, hyperparameters, IAM role, and output path, then call estimator.fit({'train': s3_uri}) with your data channels — that single call creates the training job. You can also launch jobs from the AWS SDK (boto3 create_training_job), the CLI, or as a step inside a SageMaker Pipeline for automated, repeatable training in an MLOps flow.

Q: What are the three ways to bring training code to SageMaker?

Built-in algorithms (AWS-provided implementations like XGBoost, linear learner, k-means, and DeepAR — you write no training code, just point the algorithm at your data); script mode (you write a normal PyTorch/TensorFlow/Hugging Face/scikit-learn script and run it inside an AWS-maintained framework container — the common path for custom models and fine-tuning); and bring-your-own-container (you build a Docker image conforming to SageMaker's training contract for full control over the environment). Most teams use script mode; built-in algorithms suit classical/tabular ML; bring-your-own-container is for unusual frameworks or strict reproducibility needs.

Q: What is the difference between data-parallel and model-parallel training?

Data parallelism replicates the whole model on every GPU, splits the training batch across them, and synchronizes (all-reduces) gradients each step — use it when the model fits on one GPU but you want more throughput or to train on more data; scaling out adds GPUs and reduces wall-clock roughly proportionally. Model parallelism splits the model itself across multiple GPUs because it is too large to fit in one device's memory (via tensor parallelism, pipeline parallelism, and sharded techniques) — use it when the model cannot fit on a single GPU. Large foundation-model training is usually hybrid: model-parallel within a node, data-parallel across nodes.

Q: How does managed Spot training save money, and what is the catch?

Managed Spot training runs your job on spare EC2 capacity at a steep discount — commonly up to about 90% off on-demand — by accepting interruptibility: AWS can reclaim the capacity, and the job resumes when capacity returns. You enable it with use_spot_instances=True and bound it with max_wait and max_run. The catch is interruptions, and the fix is checkpointing: set checkpoint_s3_uri so training state is snapshotted to S3 and the run resumes from the last checkpoint instead of restarting from zero. With checkpoints in place, Spot is the single biggest cost lever for the training bill after instance choice, and the worst case of an interruption is a delay, not lost work.

Q: Why do I need checkpoints for SageMaker training?

A checkpoint is a periodic snapshot of training state (weights, optimizer state, and the step/epoch counter) written to a local path that SageMaker syncs to checkpoint_s3_uri in S3. Checkpoints let a run resume from where it left off after an interruption or failure rather than starting over — which is mandatory for managed Spot training (because Spot capacity can be reclaimed mid-run) and strongly advised for any long, multi-hour or multi-day job. They also let you keep the best-validation checkpoint rather than just the final state. Tune the checkpoint frequency to balance redone-work-on-interruption against the time and I/O of writing snapshots.

Q: What are SageMaker Warm Pools and when should I use them?

Normally every training job provisions a fresh cluster, adding minutes of start-up latency before your code runs. Warm Pools (set via keep_alive_period_in_seconds) keep the provisioned cluster alive for a configured window after a job finishes, so the next matching job reuses it and starts almost immediately, skipping re-provisioning and container download. Use them for iterative work — rapid experimentation, hyperparameter sweeps, and debugging with repeated short runs — where per-job start-up overhead would otherwise dominate. You pay for the kept-alive time, so size the window to your iteration cadence. Note Spot and Warm Pools serve opposite goals: Spot for cheap one-off long runs, Warm Pools for fast repeated short runs.

Q: What is hyperparameter tuning (HPO) on SageMaker?

Automatic Model Tuning (HPO) is a tuning job that launches many training jobs across a hyperparameter search space and converges on the best configuration against a chosen objective metric, instead of hand-tuning. It supports Bayesian optimization (learns from prior trials), random and grid search, and Hyperband (early-stops weak trials to concentrate compute on promising ones). Because HPO multiplies the number of training runs, it multiplies cost — so run tuning jobs on managed Spot with checkpoints, use Warm Pools so the many short trials skip re-provisioning, and prefer Hyperband to kill unpromising trials early. SageMaker Experiments tracks each run so the trials stay comparable and reproducible.

Q: When should I use SageMaker training instead of Bedrock fine-tuning?

Use Bedrock fine-tuning when your target is a foundation model Bedrock already supports and dataset-level customization is enough — you provide a labeled dataset, Bedrock produces a private customized version you call through the same managed API, and you manage no infrastructure. Use a SageMaker training job when you need to train from scratch, fine-tune an open-weights model Bedrock does not host, fine-tune more deeply than managed customization allows (full fine-tuning, LoRA, continued pre-training), run classical/tabular ML that is not a foundation model at all, or control the training method and hardware (instances, distribution, Trainium). Control and breadth versus zero-ops convenience is the trade; many teams do both.

Q: Can AWS credits cover SageMaker training and GPU costs?

Yes. AWS credits apply to SageMaker training compute (training jobs, including GPU and multi-node clusters), storage, and features just like any other AWS service, auto-applying to your monthly bill until exhausted. Eligible programs include Activate Portfolio (up to about $100K), Bedrock/GenAI PoC funding ($10K–$50K), and the Generative AI Accelerator (up to $1M). Credits also stack on top of managed Spot savings and AWS Trainium savings, so disciplined cost management makes them last far longer. CloudRoute routes you to a vetted AWS partner who both runs the training and files the credit application; the customer pays $0 because AWS funds the pool and the partner pays CloudRoute a routing commission.

A SageMaker training job is managed, ephemeral compute that runs your training code on the GPUs you choose, writes the model artifact to S3, and tears the cluster down — so you pay only for the seconds it runs. This guide covers training jobs and the SDK estimators, the three ways to bring your code (built-in algorithms, script mode, your own container), distributed training (data- and model-parallel), Spot training and Warm Pools for cost, instance and GPU selection, checkpoints, Experiments and hyperparameter tuning — and when SageMaker training beats Bedrock fine-tuning. Plus how AWS credits fund the GPU bill so you pay $0.

Get matched in 24h →→ training vs Bedrock fine-tuning

compute model

ephemeral

billing granularity

per second

Spot saving

up to ~90%

credits to fund it

up to $1M

TL;DR

A SageMaker training job is managed, ephemeral compute for model training: you specify the instance type and count, the framework container, the data in S3, and the hyperparameters; SageMaker provisions the cluster, runs your code, writes the model artifact back to S3, and shuts the cluster down. You pay per second for the time the cluster exists, then it disappears — no idle infrastructure, unlike an inference endpoint.
There are three ways to bring your training code: built-in algorithms (AWS-provided, point at data and go), script mode (your PyTorch/TensorFlow/Hugging Face script in a managed framework container — the common case), and bring-your-own-container (your own Docker image for full control). Distributed training (data-parallel for big datasets, model-parallel for models too large for one GPU), managed Spot training (up to ~90% cheaper, interruptible, checkpoint-protected), Warm Pools, Experiments, and automatic hyperparameter tuning (HPO) sit on top of all three.
Use SageMaker training when you train from scratch, fine-tune open-weights models deeply (full or parameter-efficient), or run classical/tabular ML — full control over data, method, and hardware. Use Bedrock fine-tuning when you want to customize a supported foundation model with no infrastructure to manage. AWS credits (Activate up to $100K, Bedrock/GenAI PoC $10K–$50K, GenAI Accelerator up to $1M) cover the GPU training bill; CloudRoute routes you to the partner who both runs the training and files the credit application — you pay $0, AWS funds it.

definition

IWhat a SageMaker training job actually is

A SageMaker training job is the managed, ephemeral compute that runs your training code on AWS-provisioned instances, produces a trained model artifact in S3, and then disappears — the build side of the ML lifecycle, where data plus code becomes a model.

The cleanest one-line definition: a SageMaker training job is a managed compute run that spins up the cluster you specify, executes your training code against your data in S3, writes the resulting model artifact back to S3, and tears the cluster down when it finishes — so you pay only for the seconds it runs. Where a raw EC2 GPU instance is a bare box you must launch, configure, secure, and remember to turn off, a training job is a declarative request: you describe what you want run and on what, and SageMaker handles provisioning, execution, log capture, and teardown.

This ephemerality is the defining property and the main contrast with an inference endpoint. An endpoint persists and bills for uptime; a training job spikes and vanishes and bills only for its runtime. There is no idle training cost — when the job ends, the compute is gone and the meter stops. That makes it safe to launch large, expensive clusters for a bounded run: a 16-GPU job that runs for three hours costs three hours of 16 GPUs and not a cent more.

Mechanically, you hand SageMaker a small set of inputs and it does the rest. You specify the instance type and count (one CPU box, one GPU, or a multi-node GPU cluster), the container/framework that holds your training environment, the input data channels (locations in S3, plus how the data is fed in), the hyperparameters, and an output path in S3 for the artifact. SageMaker provisions the instances, pulls the container, streams your data in, runs your code, captures stdout/stderr and metrics to CloudWatch, uploads the artifact, and shuts everything down. The same primitive trains a gradient-boosted tree on one CPU and a multi-billion-parameter transformer across dozens of GPUs.

A note on what a training job is not: it is not where you serve predictions (that is an endpoint, a separate and separately-billed step), and it is not how you fine-tune Amazon's managed foundation models with zero infrastructure (that is Bedrock fine-tuning — covered in section VIII). A SageMaker training job trains or fine-tunes your model — from scratch, from an open-weights checkpoint, or a classical algorithm — on hardware you choose and control.

training job vs endpoint

A training job is ephemeral: it provisions compute, runs, writes a model artifact to S3, and tears the cluster down — billing only for the seconds it runs, with no idle cost. An endpoint is persistent: it stays up to serve predictions and bills for uptime. The artifact a training job produces is what you later deploy to an endpoint. Two separate steps, two separate bills.

how you launch one

IIThe SDK and estimators — how you actually launch a training job

In practice you rarely click through the console to train. You launch training jobs from the SageMaker Python SDK using an Estimator — the object that bundles everything a job needs into a single fit() call.

The Estimator is the central abstraction. You instantiate one with the choices that define the run — the container/framework, the entry-point script, the instance type and count, the hyperparameters, the IAM role, and the output location — and then call estimator.fit({...}) with your S3 data channels. That single call creates the training job: SageMaker provisions the cluster, runs your code, and returns when the artifact is in S3. The estimator is, in effect, the training job expressed as code.

The SDK ships framework-specific estimators that wrap the right managed container for you. A PyTorch, TensorFlow, HuggingFace, XGBoost, or SKLearn estimator points at the corresponding AWS-maintained Deep Learning Container so you do not build or manage the image yourself — you supply your training script and the estimator supplies the environment. There is also a generic Estimator for built-in algorithms (you pass the algorithm's container URI) and for bring-your-own-container (you pass your own image URI). Which estimator you reach for maps directly onto the three code paths in the next section.

A few estimator arguments carry most of the cost-and-behaviour weight. instance_type and instance_count set the hardware and the cluster size (and so the bill). use_spot_instances, max_run, and max_wait turn on and bound managed Spot training. checkpoint_s3_uri wires up checkpointing (essential for Spot and for long runs). distribution configures data- or model-parallel training across a multi-node cluster. keep_alive_period_in_seconds enables Warm Pools to cut start-up latency between back-to-back jobs. Each of these is a section of this guide; the estimator is where they all come together.

You can also launch the same jobs from the AWS SDK (boto3 create_training_job), the CLI, or as a step inside a SageMaker Pipeline for automated, repeatable training in an MLOps flow. The underlying training job is identical regardless of how it is launched — the SDK estimator is simply the most ergonomic front door for day-to-day work and for the experimentation that precedes production.

the estimator in one line

Instantiate an Estimator (framework container + entry-point script + instance_type/instance_count + hyperparameters + IAM role + output path), then call estimator.fit({'train': s3_uri}). That one call is the training job. The arguments you pass are the decisions — instance choice, Spot, checkpoints, distribution, Warm Pools — covered through the rest of this guide.

bringing your code

IIIThree ways to bring your training code

There are three distinct ways to get your training logic onto SageMaker, trading convenience for control. Knowing which one fits a given project saves a lot of wasted setup — most teams live in the middle option.

The three paths differ in how much of the training environment you own. With built-in algorithms AWS owns everything and you just supply data; with script mode you own the training script and AWS owns the container; with bring-your-own-container you own the entire image. They are not mutually exclusive across a team — a project might use a built-in algorithm for a baseline and script mode for the real model.

Built-in algorithms (AWS owns the code)

What it is: SageMaker provides a library of optimized, pre-implemented algorithms — XGBoost, linear learner, k-means, DeepAR forecasting, image classification, object detection, BlazingText, and more. You do not write training code at all: you point the algorithm's container at your data in S3, set its hyperparameters, and run. AWS has tuned these to scale and to use GPU/distributed compute where applicable.

When to use it: when a built-in algorithm already matches your problem — especially classical/tabular ML (XGBoost for fraud/churn/ranking, linear learner, k-means clustering, DeepAR for time-series forecasting). It is the fastest path to a trained model because there is nothing to implement or containerize. The trade-off is flexibility: you get the algorithm as AWS built it, with no room for custom architecture or training-loop changes.

Script mode (you own the script, AWS owns the container)

What it is: you write a normal training script in PyTorch, TensorFlow, JAX, Hugging Face, scikit-learn, or XGBoost, and run it inside an AWS-maintained framework container (a Deep Learning Container) via the matching estimator. SageMaker injects your data, hyperparameters, and environment configuration; your script trains and writes the model to the expected output path. You bring the modeling code; AWS brings a maintained, optimized, security-patched environment.

When to use it: this is the default and the most common path for custom models and for fine-tuning open-weights models. You get full freedom over the architecture and training loop while skipping the work of building and maintaining a Docker image — the framework, CUDA, and dependencies are AWS's responsibility. Most fine-tuning of Llama/Mistral/Falcon-class models, most custom deep-learning training, and most "I have a PyTorch script and a dataset" work lands here.

Bring-your-own-container (you own the whole image)

What it is: you build a Docker image that conforms to SageMaker's training contract (where it reads inputs, where it writes the model and checkpoints, how it surfaces metrics) and register it; SageMaker runs your container as the training job exactly as it would a built-in one. You control the OS, the framework version, every dependency, and any custom system-level setup.

When to use it: when script mode's managed containers do not fit — an unusual or bleeding-edge framework, specific native dependencies or compiled extensions, a proprietary training stack, or strict reproducibility requirements that demand you pin the entire image. It is the most work (you own image builds, patching, and CUDA/driver compatibility) but the most control. Reach for it only when the managed containers genuinely cannot accommodate the workload — otherwise script mode is less to maintain.

which path fits

Problem matches an AWS algorithm (XGBoost, k-means, DeepAR, classical/tabular) → built-in algorithm (no code). Custom model or fine-tuning a standard framework → script mode (your script, AWS's container — the common case). Unusual framework, native deps, or full reproducibility control → bring-your-own-container (your image, most work). When in doubt, start with script mode.

training at scale

IVDistributed training — data-parallel and model-parallel

When a model or dataset outgrows a single GPU, you train across many. SageMaker supports two complementary strategies — data parallelism and model parallelism — and large modern training runs often combine both.

The reason to distribute is one of two pressures, and they call for different answers. Either the dataset is too big to train in reasonable time on one device (you want more throughput), or the model itself is too big to fit in a single GPU's memory (you have no choice but to split it). Data parallelism addresses the first; model parallelism addresses the second; very large foundation-model training combines them.

Data parallelism (replicate the model, split the data)

What it is: the model is replicated on every GPU, the training batch is split across them, each GPU computes gradients on its shard, and the gradients are synchronized (all-reduced) every step so all replicas stay identical. More GPUs means more samples processed per step — near-linear throughput scaling when communication is efficient. SageMaker's distributed data-parallel library is optimized for AWS networking to keep that gradient synchronization fast, and standard PyTorch DDP / torchrun also work.

When to use it: the common case — the model fits on one GPU but you want to train faster or on more data. You scale out by raising instance_count (and using multi-GPU instances), and the per-epoch wall-clock falls roughly in proportion to the number of GPUs, up to the point communication overhead dominates.

Model parallelism (split the model itself)

What it is: the model is partitioned across multiple GPUs because it is too large to fit in one device's memory. SageMaker's model-parallel library shards parameters, gradients, and optimizer state across devices and coordinates the forward/backward passes — supporting tensor parallelism, pipeline parallelism, and sharded-data-parallel techniques (in the spirit of ZeRO) so multi-billion-parameter models can train where they otherwise would not fit. It integrates with the framework so you adapt rather than rewrite your training code.

When to use it: training or deeply fine-tuning large models — large language and foundation models especially — whose parameters plus optimizer state exceed a single GPU's memory. In practice large runs are hybrid: model-parallel to fit the model across a node, data-parallel across nodes for throughput. This is how billion-parameter-scale training is done on SageMaker.

data- vs model-parallel in one line

Data parallel = replicate the model, split the batch — use it when the model fits on one GPU but you want more speed/throughput. Model parallel = split the model across GPUs — use it when the model is too big to fit on one GPU. Large foundation-model runs combine both (model-parallel within a node, data-parallel across nodes).

training cost

VCutting training cost — Spot training, Warm Pools, instance choice

Training compute is the dominant cost of building a model, and GPU instance choice is the single biggest lever. On top of that, managed Spot training and Warm Pools cut the bill and the wait without changing your code much.

Three levers, in rough order of impact: which instance you train on, whether you use Spot, and how you handle repeated/iterative runs. Get the first two right and you have addressed most of a training bill; Warm Pools then save wall-clock (and a little cost) on iterative work.

Instance and GPU selection

GPU choice dominates the bill — a high-end multi-GPU accelerator instance can cost many multiples per hour of a single smaller GPU or a CPU box. The discipline is to match the instance to the job: train classical/tabular models (XGBoost, linear learner) on CPU instances; use a single mid-range GPU for modest deep-learning and for most fine-tuning experiments; reserve large multi-GPU instances and multi-node clusters for genuinely large models or large datasets. Because a training job is ephemeral, a brief run on a big instance is often cheaper overall than a long run on an undersized one — and finishing sooner frees the GPUs.

AWS silicon is the other axis. AWS Trainium (the trn1/trn2 instances, via the Neuron SDK) is AWS's purpose-built training accelerator, positioned as cheaper per unit of training throughput than equivalent GPU instances for supported models. The migration effort is real (your stack must run on Neuron), but for large, repeated training it can compound into a meaningful saving — see the dedicated Trainium page.

Managed Spot training

What it is: SageMaker runs your training job on spare EC2 capacity (Spot) at a steep discount — commonly up to ~90% off on-demand — in exchange for interruptibility: AWS can reclaim the capacity, and the job resumes from its last checkpoint when capacity returns. You enable it with use_spot_instances=True and bound the wait with max_wait (total time including waiting for Spot) versus max_run (compute time).

The catch and the fix: because Spot can be interrupted, checkpointing is essential — without it an interruption restarts the run from zero. Wire up checkpoint_s3_uri so progress is saved to S3 and resumed automatically. With checkpoints in place, Spot is close to free money for most training: the worst case is a delay, not lost work. It is the single biggest cost lever for the training bill after instance choice, and it applies to almost any non-deadline-critical run.

Warm Pools

What it is: normally every training job provisions a fresh cluster, which adds minutes of start-up latency before your code runs. Warm Pools (via keep_alive_period_in_seconds) keep the provisioned cluster alive for a configured window after a job finishes, so the next matching job reuses it and starts almost immediately — skipping re-provisioning and container download.

When to use it: iterative work — rapid experimentation, hyperparameter sweeps, debugging a training script with repeated short runs — where the per-job start-up overhead would otherwise dominate. You pay for the kept-alive time, so size the window to your iteration cadence; the payoff is much faster feedback loops during active development. Note that Spot (interruptible) and Warm Pools (held capacity) serve opposite goals: Spot for cheap one-off long runs, Warm Pools for fast repeated short runs.

the training cost stack

Right-size the instance/GPU first (CPU for classical, one GPU for most fine-tuning, multi-GPU only for large models; consider Trainium for big repeated runs). Then turn on managed Spot (up to ~90% off) with checkpoints for any non-urgent run. Use Warm Pools to kill start-up latency on iterative work. Then credits cover what remains — and disciplined sizing makes them last far longer.

resilience

VICheckpoints — why long and Spot training depend on them

A checkpoint is a periodic snapshot of training state saved to S3 so a run can resume from where it left off instead of starting over. For long runs and for Spot, checkpointing is not optional — it is what makes them safe.

During a training job, your code can write checkpoints — model weights plus optimizer state and the step/epoch counter — at intervals to a local path that SageMaker continuously syncs to checkpoint_s3_uri in S3. If the job is interrupted (a reclaimed Spot instance) or fails partway, a restart reads the latest checkpoint from S3 and continues from that point rather than from scratch. On a multi-hour or multi-day run, that is the difference between losing minutes and losing the whole job.

This is exactly why checkpointing and Spot training go together. Managed Spot's discount comes from accepting interruptions; checkpoints make interruptions cheap, because resumption is automatic and only the work since the last checkpoint is repeated. Enable both and you get most of the cost saving with little of the risk. The tuning question is checkpoint frequency: too infrequent and an interruption costs more redone work; too frequent and you spend time and I/O writing snapshots — a sensible cadence balances the two for your run length.

Checkpoints serve a second purpose beyond resilience: they let you keep the best model, not just the last one. Saving checkpoints across epochs means you can select the checkpoint with the best validation metric (early-stopping in spirit) rather than whatever state training happened to end on, and you retain intermediate artifacts for analysis or warm-starting a later run. For any non-trivial training job — and unconditionally for Spot — wire checkpointing in from the start.

checkpoints in one line

Set checkpoint_s3_uri so training state is snapshotted to S3 and a run resumes from the last checkpoint after an interruption or failure instead of restarting. It is mandatory for managed Spot (which can be interrupted) and strongly advised for any long run — and it lets you keep the best-validation checkpoint, not just the final state.

finding the best model

VIIExperiments and hyperparameter tuning (HPO)

Training is rarely one run — it is many, and you need to keep them comparable and find the best configuration. SageMaker Experiments tracks the runs; Automatic Model Tuning (HPO) searches the hyperparameter space for you.

These two features address the iterative reality of model building: you will train the same model dozens of times with different settings, and doing that by hand — both the bookkeeping and the search — is slow and error-prone. Experiments solves the bookkeeping; HPO automates the search.

SageMaker Experiments (track and compare runs)

What it is: automatic logging and organization of training runs — each run's hyperparameters, metrics, inputs, and resulting artifacts are recorded and grouped, so you can compare runs side by side rather than losing results in scattered notebook cells or filenames. It captures the lineage of which data and which settings produced which model.

Why it matters: reproducibility and comparability. When run #47 is the best, Experiments tells you exactly what made it best and lets you reproduce it — and it feeds the governance trail (which artifact, from which data and hyperparameters) that the Model Registry and pipelines rely on. It is the difference between disciplined iteration and guesswork.

Automatic Model Tuning / HPO

What it is: a tuning job that launches many training jobs across a hyperparameter search space and converges on the best configuration against a chosen objective metric, instead of you hand-tuning. SageMaker supports several search strategies — Bayesian optimization (learns from prior trials to pick promising next ones), random and grid search, and Hyperband (early-stops weak trials to spend compute on promising ones). You define the ranges and the objective; the tuner orchestrates the search.

The cost note: HPO multiplies training runs, so it multiplies cost — which is exactly why it pairs with the levers from section V. Run tuning jobs on Spot (with checkpoints) to slash the per-trial cost, use Warm Pools so the many short trials skip re-provisioning, and prefer Hyperband to kill unpromising trials early rather than running every combination to completion. Tuning is where good cost discipline pays off most, because it is where the run count explodes.

experiments + HPO together

Experiments records every run so you can compare and reproduce them; HPO (Automatic Model Tuning) launches many runs to find the best hyperparameters automatically (Bayesian, random, grid, or Hyperband). Because HPO multiplies runs, pair it with Spot + checkpoints + Warm Pools and Hyperband early-stopping to keep the tuning bill in check.

the key distinction

VIIISageMaker training vs Bedrock fine-tuning — when to use each

A fair question before you launch a training job at all: should you be training a model yourself, or customizing a managed foundation model through Amazon Bedrock fine-tuning? The two solve different problems, and the right answer depends on how much control you actually need.

Bedrock fine-tuning is "customize a managed model with no infrastructure." You take a supported foundation model on Bedrock, provide a labeled dataset, and Bedrock produces a customized private version of that model that you then call through the same managed API — paying for the customization plus storage/throughput for the custom model. You never see a GPU, a container, or a training cluster; you cannot change the architecture or the training method, but you also do not have to manage any of it. It is the shortest path to a model that speaks your domain or format, on top of an existing foundation model.

SageMaker training is "train or fine-tune the model yourself, with full control." You choose the model (including from-scratch architectures and any open-weights checkpoint), the framework, the training method (full fine-tuning, parameter-efficient methods like LoRA, continued pre-training, or training from zero), the instance type, and the distribution strategy — and you own the operational and cost decisions in return. It is the only option for classical/tabular ML (which is not a foundation model at all), for proprietary architectures, and for deep customization that goes beyond what Bedrock's managed fine-tuning exposes.

The deciding questions: is your target a foundation model Bedrock already supports, and is dataset-level fine-tuning enough? If yes — you want a Claude/Llama/Nova-class model adapted to your domain, with zero infrastructure — Bedrock fine-tuning is the shorter path. If no — you need to train from scratch, fine-tune an open-weights model Bedrock does not host, control the training method or hardware, run classical ML, or fine-tune more deeply than managed customization allows — SageMaker training is the right tool. Control and breadth versus convenience and zero-ops is the trade.

And, as with serving, the two coexist cleanly. A team might fine-tune a foundation model on Bedrock for its customer-facing generative features while training proprietary models on SageMaker (the recommender, the forecaster, a domain model on open weights) in the same account. You can also fine-tune an open foundation model on SageMaker when you want full control over an open-weights model rather than Bedrock's managed path. The comparison table below lays the two side by side; the dedicated Bedrock fine-tuning and Bedrock vs SageMaker pages go deeper.

the training decision in one line

Customizing a supported foundation model with a dataset and no infrastructure → Bedrock fine-tuning. Training from scratch, fine-tuning open-weights models deeply, running classical/tabular ML, or needing control over method and hardware → a SageMaker training job. Many teams do both — Bedrock fine-tunes the FM, SageMaker trains the proprietary models.

the decision that matters most

SageMaker training vs Bedrock fine-tuning — side by side

The clearest way to choose: line the two up on the dimensions that actually drive the decision. Bedrock fine-tuning optimizes for zero-ops convenience on supported foundation models; SageMaker training optimizes for control, breadth, and the ability to train anything.

Dimension	Bedrock fine-tuning	SageMaker training
What it is	Managed customization of a supported foundation model	Run your own training/fine-tuning on compute you choose
You manage infrastructure?	No — no GPUs, containers, or clusters	Yes — you pick instances, containers, distribution
Train from scratch?	No	Yes
Fine-tune open-weights models?	No (only supported FMs)	Yes (any framework, full or parameter-efficient)
Classical / tabular ML?	No (foundation models only)	Yes (XGBoost, linear learner, k-means, etc.)
Control over training method	Limited (managed dataset fine-tuning)	Full (full FT, LoRA, continued pre-training, scratch)
Distributed / large-model training	Handled for you, not exposed	Data- and model-parallel, Trainium, multi-node
Pricing basis	Per customization + custom-model storage/throughput	Per instance-second of training compute (Spot-eligible)
Best for	Adapting an existing FM, zero ops	Custom/proprietary models, deep control, classical ML

Not mutually exclusive. A common pattern fine-tunes a foundation model on Bedrock for generative features and trains proprietary/classical models on SageMaker in the same account. See the dedicated Bedrock fine-tuning and Bedrock vs SageMaker pages for the deep comparisons.

GPU training runs add up fast

Fund your SageMaker training and fine-tuning with AWS credits — pay $0

Get matched in 24h →

a recent match

A from-scratch training bill cut and taken to $0 — anonymized

inquiry · seed-stage computer-vision AI, Germany

Seed-stage computer-vision startup, 11 people, training a proprietary defect-detection model on AWS

Situation: Their core product was a custom vision model trained on their own labeled imagery — not something an off-the-shelf foundation model or Bedrock fine-tuning could produce, so they needed full SageMaker training. The model was large enough to need multi-GPU training, and the team was running a wide hyperparameter sweep on always-on on-demand GPU instances. Training compute had climbed past ~$9K/month during the build, most of it from on-demand GPU time and an HPO sweep that ran every combination to completion — more than the seed budget could absorb.

What CloudRoute did: Routed within 21 hours to an EU-Central partner with an ML / SageMaker training track record. The partner moved the training jobs onto managed Spot instances with checkpointing to S3 (so interruptions resumed automatically), switched the hyperparameter tuning to Hyperband with Warm Pools (early-stopping weak trials and skipping re-provisioning between the many short runs), and right-sized the cluster with data-parallel training — then, in parallel, filed an Activate Portfolio credit application plus a GenAI PoC application for the workload.

Outcome: Spot plus the HPO and Warm-Pool changes cut the training run-rate from ~$9K to ~$3K/month (Spot discount on the long runs, Hyperband killing weak trials early, right-sized data-parallel cluster). Credits approved within 15 days then covered the remaining bill — taking effective training cost to ~$0 through the credit window, during which the team finished training and shipped the model. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.

training bill cut: ~65% before credits · then to ~$0 on credits · matched in: < 24h · cost to customer: $0

faq

Common questions

What is a SageMaker training job?

A SageMaker training job is managed, ephemeral compute that runs your training code on AWS-provisioned instances. You specify the instance type and count, the framework container, the data location in S3, and the hyperparameters; SageMaker provisions the cluster, runs your code, writes the trained model artifact back to S3, captures logs and metrics to CloudWatch, and tears the cluster down when it finishes. You pay per second only for the time the cluster exists — unlike an inference endpoint, there is no idle training cost, because the compute disappears when the job ends.

How do I launch a training job on SageMaker?

The common way is the SageMaker Python SDK. You create an Estimator (a framework estimator like PyTorch, TensorFlow, HuggingFace, or XGBoost, or a generic one) configured with your entry-point script, instance type and count, hyperparameters, IAM role, and output path, then call estimator.fit({'train': s3_uri}) with your data channels — that single call creates the training job. You can also launch jobs from the AWS SDK (boto3 create_training_job), the CLI, or as a step inside a SageMaker Pipeline for automated, repeatable training in an MLOps flow.

What are the three ways to bring training code to SageMaker?

Built-in algorithms (AWS-provided implementations like XGBoost, linear learner, k-means, and DeepAR — you write no training code, just point the algorithm at your data); script mode (you write a normal PyTorch/TensorFlow/Hugging Face/scikit-learn script and run it inside an AWS-maintained framework container — the common path for custom models and fine-tuning); and bring-your-own-container (you build a Docker image conforming to SageMaker's training contract for full control over the environment). Most teams use script mode; built-in algorithms suit classical/tabular ML; bring-your-own-container is for unusual frameworks or strict reproducibility needs.

What is the difference between data-parallel and model-parallel training?

Data parallelism replicates the whole model on every GPU, splits the training batch across them, and synchronizes (all-reduces) gradients each step — use it when the model fits on one GPU but you want more throughput or to train on more data; scaling out adds GPUs and reduces wall-clock roughly proportionally. Model parallelism splits the model itself across multiple GPUs because it is too large to fit in one device's memory (via tensor parallelism, pipeline parallelism, and sharded techniques) — use it when the model cannot fit on a single GPU. Large foundation-model training is usually hybrid: model-parallel within a node, data-parallel across nodes.

How does managed Spot training save money, and what is the catch?

Managed Spot training runs your job on spare EC2 capacity at a steep discount — commonly up to about 90% off on-demand — by accepting interruptibility: AWS can reclaim the capacity, and the job resumes when capacity returns. You enable it with use_spot_instances=True and bound it with max_wait and max_run. The catch is interruptions, and the fix is checkpointing: set checkpoint_s3_uri so training state is snapshotted to S3 and the run resumes from the last checkpoint instead of restarting from zero. With checkpoints in place, Spot is the single biggest cost lever for the training bill after instance choice, and the worst case of an interruption is a delay, not lost work.

Why do I need checkpoints for SageMaker training?

A checkpoint is a periodic snapshot of training state (weights, optimizer state, and the step/epoch counter) written to a local path that SageMaker syncs to checkpoint_s3_uri in S3. Checkpoints let a run resume from where it left off after an interruption or failure rather than starting over — which is mandatory for managed Spot training (because Spot capacity can be reclaimed mid-run) and strongly advised for any long, multi-hour or multi-day job. They also let you keep the best-validation checkpoint rather than just the final state. Tune the checkpoint frequency to balance redone-work-on-interruption against the time and I/O of writing snapshots.

What are SageMaker Warm Pools and when should I use them?

Normally every training job provisions a fresh cluster, adding minutes of start-up latency before your code runs. Warm Pools (set via keep_alive_period_in_seconds) keep the provisioned cluster alive for a configured window after a job finishes, so the next matching job reuses it and starts almost immediately, skipping re-provisioning and container download. Use them for iterative work — rapid experimentation, hyperparameter sweeps, and debugging with repeated short runs — where per-job start-up overhead would otherwise dominate. You pay for the kept-alive time, so size the window to your iteration cadence. Note Spot and Warm Pools serve opposite goals: Spot for cheap one-off long runs, Warm Pools for fast repeated short runs.

What is hyperparameter tuning (HPO) on SageMaker?

Automatic Model Tuning (HPO) is a tuning job that launches many training jobs across a hyperparameter search space and converges on the best configuration against a chosen objective metric, instead of hand-tuning. It supports Bayesian optimization (learns from prior trials), random and grid search, and Hyperband (early-stops weak trials to concentrate compute on promising ones). Because HPO multiplies the number of training runs, it multiplies cost — so run tuning jobs on managed Spot with checkpoints, use Warm Pools so the many short trials skip re-provisioning, and prefer Hyperband to kill unpromising trials early. SageMaker Experiments tracks each run so the trials stay comparable and reproducible.

When should I use SageMaker training instead of Bedrock fine-tuning?

Use Bedrock fine-tuning when your target is a foundation model Bedrock already supports and dataset-level customization is enough — you provide a labeled dataset, Bedrock produces a private customized version you call through the same managed API, and you manage no infrastructure. Use a SageMaker training job when you need to train from scratch, fine-tune an open-weights model Bedrock does not host, fine-tune more deeply than managed customization allows (full fine-tuning, LoRA, continued pre-training), run classical/tabular ML that is not a foundation model at all, or control the training method and hardware (instances, distribution, Trainium). Control and breadth versus zero-ops convenience is the trade; many teams do both.

Can AWS credits cover SageMaker training and GPU costs?

Yes. AWS credits apply to SageMaker training compute (training jobs, including GPU and multi-node clusters), storage, and features just like any other AWS service, auto-applying to your monthly bill until exhausted. Eligible programs include Activate Portfolio (up to about $100K), Bedrock/GenAI PoC funding ($10K–$50K), and the Generative AI Accelerator (up to $1M). Credits also stack on top of managed Spot savings and AWS Trainium savings, so disciplined cost management makes them last far longer. CloudRoute routes you to a vetted AWS partner who both runs the training and files the credit application; the customer pays $0 because AWS funds the pool and the partner pays CloudRoute a routing commission.

Train it on SageMaker — funded by AWS credits

CloudRoute connects ML and data-science teams with vetted AWS partners who run SageMaker training and fine-tuning, optimize the GPU bill (Spot, Trainium, distributed training), and file the credit applications that fund it. Customer pays $0 — AWS funds it.

Get matched in 24h →→ see the data-AI persona detail

matched within< 24h

credit ceilingup to $1M

cost to you$0