A SageMaker training job is managed, ephemeral compute that runs your training code on the GPUs you choose, writes the model artifact to S3, and tears the cluster down — so you pay only for the seconds it runs. This guide covers training jobs and the SDK estimators, the three ways to bring your code (built-in algorithms, script mode, your own container), distributed training (data- and model-parallel), Spot training and Warm Pools for cost, instance and GPU selection, checkpoints, Experiments and hyperparameter tuning — and when SageMaker training beats Bedrock fine-tuning. Plus how AWS credits fund the GPU bill so you pay $0.
A SageMaker training job is the managed, ephemeral compute that runs your training code on AWS-provisioned instances, produces a trained model artifact in S3, and then disappears — the build side of the ML lifecycle, where data plus code becomes a model.
The cleanest one-line definition: a SageMaker training job is a managed compute run that spins up the cluster you specify, executes your training code against your data in S3, writes the resulting model artifact back to S3, and tears the cluster down when it finishes — so you pay only for the seconds it runs. Where a raw EC2 GPU instance is a bare box you must launch, configure, secure, and remember to turn off, a training job is a declarative request: you describe what you want run and on what, and SageMaker handles provisioning, execution, log capture, and teardown.
This ephemerality is the defining property and the main contrast with an inference endpoint. An endpoint persists and bills for uptime; a training job spikes and vanishes and bills only for its runtime. There is no idle training cost — when the job ends, the compute is gone and the meter stops. That makes it safe to launch large, expensive clusters for a bounded run: a 16-GPU job that runs for three hours costs three hours of 16 GPUs and not a cent more.
Mechanically, you hand SageMaker a small set of inputs and it does the rest. You specify the instance type and count (one CPU box, one GPU, or a multi-node GPU cluster), the container/framework that holds your training environment, the input data channels (locations in S3, plus how the data is fed in), the hyperparameters, and an output path in S3 for the artifact. SageMaker provisions the instances, pulls the container, streams your data in, runs your code, captures stdout/stderr and metrics to CloudWatch, uploads the artifact, and shuts everything down. The same primitive trains a gradient-boosted tree on one CPU and a multi-billion-parameter transformer across dozens of GPUs.
A note on what a training job is not: it is not where you serve predictions (that is an endpoint, a separate and separately-billed step), and it is not how you fine-tune Amazon's managed foundation models with zero infrastructure (that is Bedrock fine-tuning — covered in section VIII). A SageMaker training job trains or fine-tunes your model — from scratch, from an open-weights checkpoint, or a classical algorithm — on hardware you choose and control.
A training job is ephemeral: it provisions compute, runs, writes a model artifact to S3, and tears the cluster down — billing only for the seconds it runs, with no idle cost. An endpoint is persistent: it stays up to serve predictions and bills for uptime. The artifact a training job produces is what you later deploy to an endpoint. Two separate steps, two separate bills.
In practice you rarely click through the console to train. You launch training jobs from the SageMaker Python SDK using an Estimator — the object that bundles everything a job needs into a single fit() call.
The Estimator is the central abstraction. You instantiate one with the choices that define the run — the container/framework, the entry-point script, the instance type and count, the hyperparameters, the IAM role, and the output location — and then call estimator.fit({...}) with your S3 data channels. That single call creates the training job: SageMaker provisions the cluster, runs your code, and returns when the artifact is in S3. The estimator is, in effect, the training job expressed as code.
The SDK ships framework-specific estimators that wrap the right managed container for you. A PyTorch, TensorFlow, HuggingFace, XGBoost, or SKLearn estimator points at the corresponding AWS-maintained Deep Learning Container so you do not build or manage the image yourself — you supply your training script and the estimator supplies the environment. There is also a generic Estimator for built-in algorithms (you pass the algorithm's container URI) and for bring-your-own-container (you pass your own image URI). Which estimator you reach for maps directly onto the three code paths in the next section.
A few estimator arguments carry most of the cost-and-behaviour weight. instance_type and instance_count set the hardware and the cluster size (and so the bill). use_spot_instances, max_run, and max_wait turn on and bound managed Spot training. checkpoint_s3_uri wires up checkpointing (essential for Spot and for long runs). distribution configures data- or model-parallel training across a multi-node cluster. keep_alive_period_in_seconds enables Warm Pools to cut start-up latency between back-to-back jobs. Each of these is a section of this guide; the estimator is where they all come together.
You can also launch the same jobs from the AWS SDK (boto3 create_training_job), the CLI, or as a step inside a SageMaker Pipeline for automated, repeatable training in an MLOps flow. The underlying training job is identical regardless of how it is launched — the SDK estimator is simply the most ergonomic front door for day-to-day work and for the experimentation that precedes production.
Instantiate an Estimator (framework container + entry-point script + instance_type/instance_count + hyperparameters + IAM role + output path), then call estimator.fit({'train': s3_uri}). That one call is the training job. The arguments you pass are the decisions — instance choice, Spot, checkpoints, distribution, Warm Pools — covered through the rest of this guide.
There are three distinct ways to get your training logic onto SageMaker, trading convenience for control. Knowing which one fits a given project saves a lot of wasted setup — most teams live in the middle option.
The three paths differ in how much of the training environment you own. With built-in algorithms AWS owns everything and you just supply data; with script mode you own the training script and AWS owns the container; with bring-your-own-container you own the entire image. They are not mutually exclusive across a team — a project might use a built-in algorithm for a baseline and script mode for the real model.
What it is: SageMaker provides a library of optimized, pre-implemented algorithms — XGBoost, linear learner, k-means, DeepAR forecasting, image classification, object detection, BlazingText, and more. You do not write training code at all: you point the algorithm's container at your data in S3, set its hyperparameters, and run. AWS has tuned these to scale and to use GPU/distributed compute where applicable.
When to use it: when a built-in algorithm already matches your problem — especially classical/tabular ML (XGBoost for fraud/churn/ranking, linear learner, k-means clustering, DeepAR for time-series forecasting). It is the fastest path to a trained model because there is nothing to implement or containerize. The trade-off is flexibility: you get the algorithm as AWS built it, with no room for custom architecture or training-loop changes.
What it is: you write a normal training script in PyTorch, TensorFlow, JAX, Hugging Face, scikit-learn, or XGBoost, and run it inside an AWS-maintained framework container (a Deep Learning Container) via the matching estimator. SageMaker injects your data, hyperparameters, and environment configuration; your script trains and writes the model to the expected output path. You bring the modeling code; AWS brings a maintained, optimized, security-patched environment.
When to use it: this is the default and the most common path for custom models and for fine-tuning open-weights models. You get full freedom over the architecture and training loop while skipping the work of building and maintaining a Docker image — the framework, CUDA, and dependencies are AWS's responsibility. Most fine-tuning of Llama/Mistral/Falcon-class models, most custom deep-learning training, and most "I have a PyTorch script and a dataset" work lands here.
What it is: you build a Docker image that conforms to SageMaker's training contract (where it reads inputs, where it writes the model and checkpoints, how it surfaces metrics) and register it; SageMaker runs your container as the training job exactly as it would a built-in one. You control the OS, the framework version, every dependency, and any custom system-level setup.
When to use it: when script mode's managed containers do not fit — an unusual or bleeding-edge framework, specific native dependencies or compiled extensions, a proprietary training stack, or strict reproducibility requirements that demand you pin the entire image. It is the most work (you own image builds, patching, and CUDA/driver compatibility) but the most control. Reach for it only when the managed containers genuinely cannot accommodate the workload — otherwise script mode is less to maintain.
Problem matches an AWS algorithm (XGBoost, k-means, DeepAR, classical/tabular) → built-in algorithm (no code). Custom model or fine-tuning a standard framework → script mode (your script, AWS's container — the common case). Unusual framework, native deps, or full reproducibility control → bring-your-own-container (your image, most work). When in doubt, start with script mode.
When a model or dataset outgrows a single GPU, you train across many. SageMaker supports two complementary strategies — data parallelism and model parallelism — and large modern training runs often combine both.
The reason to distribute is one of two pressures, and they call for different answers. Either the dataset is too big to train in reasonable time on one device (you want more throughput), or the model itself is too big to fit in a single GPU's memory (you have no choice but to split it). Data parallelism addresses the first; model parallelism addresses the second; very large foundation-model training combines them.
What it is: the model is replicated on every GPU, the training batch is split across them, each GPU computes gradients on its shard, and the gradients are synchronized (all-reduced) every step so all replicas stay identical. More GPUs means more samples processed per step — near-linear throughput scaling when communication is efficient. SageMaker's distributed data-parallel library is optimized for AWS networking to keep that gradient synchronization fast, and standard PyTorch DDP / torchrun also work.
When to use it: the common case — the model fits on one GPU but you want to train faster or on more data. You scale out by raising instance_count (and using multi-GPU instances), and the per-epoch wall-clock falls roughly in proportion to the number of GPUs, up to the point communication overhead dominates.
What it is: the model is partitioned across multiple GPUs because it is too large to fit in one device's memory. SageMaker's model-parallel library shards parameters, gradients, and optimizer state across devices and coordinates the forward/backward passes — supporting tensor parallelism, pipeline parallelism, and sharded-data-parallel techniques (in the spirit of ZeRO) so multi-billion-parameter models can train where they otherwise would not fit. It integrates with the framework so you adapt rather than rewrite your training code.
When to use it: training or deeply fine-tuning large models — large language and foundation models especially — whose parameters plus optimizer state exceed a single GPU's memory. In practice large runs are hybrid: model-parallel to fit the model across a node, data-parallel across nodes for throughput. This is how billion-parameter-scale training is done on SageMaker.
Data parallel = replicate the model, split the batch — use it when the model fits on one GPU but you want more speed/throughput. Model parallel = split the model across GPUs — use it when the model is too big to fit on one GPU. Large foundation-model runs combine both (model-parallel within a node, data-parallel across nodes).
Training compute is the dominant cost of building a model, and GPU instance choice is the single biggest lever. On top of that, managed Spot training and Warm Pools cut the bill and the wait without changing your code much.
Three levers, in rough order of impact: which instance you train on, whether you use Spot, and how you handle repeated/iterative runs. Get the first two right and you have addressed most of a training bill; Warm Pools then save wall-clock (and a little cost) on iterative work.
GPU choice dominates the bill — a high-end multi-GPU accelerator instance can cost many multiples per hour of a single smaller GPU or a CPU box. The discipline is to match the instance to the job: train classical/tabular models (XGBoost, linear learner) on CPU instances; use a single mid-range GPU for modest deep-learning and for most fine-tuning experiments; reserve large multi-GPU instances and multi-node clusters for genuinely large models or large datasets. Because a training job is ephemeral, a brief run on a big instance is often cheaper overall than a long run on an undersized one — and finishing sooner frees the GPUs.
AWS silicon is the other axis. AWS Trainium (the trn1/trn2 instances, via the Neuron SDK) is AWS's purpose-built training accelerator, positioned as cheaper per unit of training throughput than equivalent GPU instances for supported models. The migration effort is real (your stack must run on Neuron), but for large, repeated training it can compound into a meaningful saving — see the dedicated Trainium page.
What it is: SageMaker runs your training job on spare EC2 capacity (Spot) at a steep discount — commonly up to ~90% off on-demand — in exchange for interruptibility: AWS can reclaim the capacity, and the job resumes from its last checkpoint when capacity returns. You enable it with use_spot_instances=True and bound the wait with max_wait (total time including waiting for Spot) versus max_run (compute time).
The catch and the fix: because Spot can be interrupted, checkpointing is essential — without it an interruption restarts the run from zero. Wire up checkpoint_s3_uri so progress is saved to S3 and resumed automatically. With checkpoints in place, Spot is close to free money for most training: the worst case is a delay, not lost work. It is the single biggest cost lever for the training bill after instance choice, and it applies to almost any non-deadline-critical run.
What it is: normally every training job provisions a fresh cluster, which adds minutes of start-up latency before your code runs. Warm Pools (via keep_alive_period_in_seconds) keep the provisioned cluster alive for a configured window after a job finishes, so the next matching job reuses it and starts almost immediately — skipping re-provisioning and container download.
When to use it: iterative work — rapid experimentation, hyperparameter sweeps, debugging a training script with repeated short runs — where the per-job start-up overhead would otherwise dominate. You pay for the kept-alive time, so size the window to your iteration cadence; the payoff is much faster feedback loops during active development. Note that Spot (interruptible) and Warm Pools (held capacity) serve opposite goals: Spot for cheap one-off long runs, Warm Pools for fast repeated short runs.
Right-size the instance/GPU first (CPU for classical, one GPU for most fine-tuning, multi-GPU only for large models; consider Trainium for big repeated runs). Then turn on managed Spot (up to ~90% off) with checkpoints for any non-urgent run. Use Warm Pools to kill start-up latency on iterative work. Then credits cover what remains — and disciplined sizing makes them last far longer.
A checkpoint is a periodic snapshot of training state saved to S3 so a run can resume from where it left off instead of starting over. For long runs and for Spot, checkpointing is not optional — it is what makes them safe.
During a training job, your code can write checkpoints — model weights plus optimizer state and the step/epoch counter — at intervals to a local path that SageMaker continuously syncs to checkpoint_s3_uri in S3. If the job is interrupted (a reclaimed Spot instance) or fails partway, a restart reads the latest checkpoint from S3 and continues from that point rather than from scratch. On a multi-hour or multi-day run, that is the difference between losing minutes and losing the whole job.
This is exactly why checkpointing and Spot training go together. Managed Spot's discount comes from accepting interruptions; checkpoints make interruptions cheap, because resumption is automatic and only the work since the last checkpoint is repeated. Enable both and you get most of the cost saving with little of the risk. The tuning question is checkpoint frequency: too infrequent and an interruption costs more redone work; too frequent and you spend time and I/O writing snapshots — a sensible cadence balances the two for your run length.
Checkpoints serve a second purpose beyond resilience: they let you keep the best model, not just the last one. Saving checkpoints across epochs means you can select the checkpoint with the best validation metric (early-stopping in spirit) rather than whatever state training happened to end on, and you retain intermediate artifacts for analysis or warm-starting a later run. For any non-trivial training job — and unconditionally for Spot — wire checkpointing in from the start.
Set checkpoint_s3_uri so training state is snapshotted to S3 and a run resumes from the last checkpoint after an interruption or failure instead of restarting. It is mandatory for managed Spot (which can be interrupted) and strongly advised for any long run — and it lets you keep the best-validation checkpoint, not just the final state.
Training is rarely one run — it is many, and you need to keep them comparable and find the best configuration. SageMaker Experiments tracks the runs; Automatic Model Tuning (HPO) searches the hyperparameter space for you.
These two features address the iterative reality of model building: you will train the same model dozens of times with different settings, and doing that by hand — both the bookkeeping and the search — is slow and error-prone. Experiments solves the bookkeeping; HPO automates the search.
What it is: automatic logging and organization of training runs — each run's hyperparameters, metrics, inputs, and resulting artifacts are recorded and grouped, so you can compare runs side by side rather than losing results in scattered notebook cells or filenames. It captures the lineage of which data and which settings produced which model.
Why it matters: reproducibility and comparability. When run #47 is the best, Experiments tells you exactly what made it best and lets you reproduce it — and it feeds the governance trail (which artifact, from which data and hyperparameters) that the Model Registry and pipelines rely on. It is the difference between disciplined iteration and guesswork.
What it is: a tuning job that launches many training jobs across a hyperparameter search space and converges on the best configuration against a chosen objective metric, instead of you hand-tuning. SageMaker supports several search strategies — Bayesian optimization (learns from prior trials to pick promising next ones), random and grid search, and Hyperband (early-stops weak trials to spend compute on promising ones). You define the ranges and the objective; the tuner orchestrates the search.
The cost note: HPO multiplies training runs, so it multiplies cost — which is exactly why it pairs with the levers from section V. Run tuning jobs on Spot (with checkpoints) to slash the per-trial cost, use Warm Pools so the many short trials skip re-provisioning, and prefer Hyperband to kill unpromising trials early rather than running every combination to completion. Tuning is where good cost discipline pays off most, because it is where the run count explodes.
Experiments records every run so you can compare and reproduce them; HPO (Automatic Model Tuning) launches many runs to find the best hyperparameters automatically (Bayesian, random, grid, or Hyperband). Because HPO multiplies runs, pair it with Spot + checkpoints + Warm Pools and Hyperband early-stopping to keep the tuning bill in check.
A fair question before you launch a training job at all: should you be training a model yourself, or customizing a managed foundation model through Amazon Bedrock fine-tuning? The two solve different problems, and the right answer depends on how much control you actually need.
Bedrock fine-tuning is "customize a managed model with no infrastructure." You take a supported foundation model on Bedrock, provide a labeled dataset, and Bedrock produces a customized private version of that model that you then call through the same managed API — paying for the customization plus storage/throughput for the custom model. You never see a GPU, a container, or a training cluster; you cannot change the architecture or the training method, but you also do not have to manage any of it. It is the shortest path to a model that speaks your domain or format, on top of an existing foundation model.
SageMaker training is "train or fine-tune the model yourself, with full control." You choose the model (including from-scratch architectures and any open-weights checkpoint), the framework, the training method (full fine-tuning, parameter-efficient methods like LoRA, continued pre-training, or training from zero), the instance type, and the distribution strategy — and you own the operational and cost decisions in return. It is the only option for classical/tabular ML (which is not a foundation model at all), for proprietary architectures, and for deep customization that goes beyond what Bedrock's managed fine-tuning exposes.
The deciding questions: is your target a foundation model Bedrock already supports, and is dataset-level fine-tuning enough? If yes — you want a Claude/Llama/Nova-class model adapted to your domain, with zero infrastructure — Bedrock fine-tuning is the shorter path. If no — you need to train from scratch, fine-tune an open-weights model Bedrock does not host, control the training method or hardware, run classical ML, or fine-tune more deeply than managed customization allows — SageMaker training is the right tool. Control and breadth versus convenience and zero-ops is the trade.
And, as with serving, the two coexist cleanly. A team might fine-tune a foundation model on Bedrock for its customer-facing generative features while training proprietary models on SageMaker (the recommender, the forecaster, a domain model on open weights) in the same account. You can also fine-tune an open foundation model on SageMaker when you want full control over an open-weights model rather than Bedrock's managed path. The comparison table below lays the two side by side; the dedicated Bedrock fine-tuning and Bedrock vs SageMaker pages go deeper.
Customizing a supported foundation model with a dataset and no infrastructure → Bedrock fine-tuning. Training from scratch, fine-tuning open-weights models deeply, running classical/tabular ML, or needing control over method and hardware → a SageMaker training job. Many teams do both — Bedrock fine-tunes the FM, SageMaker trains the proprietary models.
The clearest way to choose: line the two up on the dimensions that actually drive the decision. Bedrock fine-tuning optimizes for zero-ops convenience on supported foundation models; SageMaker training optimizes for control, breadth, and the ability to train anything.
| Dimension | Bedrock fine-tuning | SageMaker training |
|---|---|---|
| What it is | Managed customization of a supported foundation model | Run your own training/fine-tuning on compute you choose |
| You manage infrastructure? | No — no GPUs, containers, or clusters | Yes — you pick instances, containers, distribution |
| Train from scratch? | No | Yes |
| Fine-tune open-weights models? | No (only supported FMs) | Yes (any framework, full or parameter-efficient) |
| Classical / tabular ML? | No (foundation models only) | Yes (XGBoost, linear learner, k-means, etc.) |
| Control over training method | Limited (managed dataset fine-tuning) | Full (full FT, LoRA, continued pre-training, scratch) |
| Distributed / large-model training | Handled for you, not exposed | Data- and model-parallel, Trainium, multi-node |
| Pricing basis | Per customization + custom-model storage/throughput | Per instance-second of training compute (Spot-eligible) |
| Best for | Adapting an existing FM, zero ops | Custom/proprietary models, deep control, classical ML |
Situation: Their core product was a custom vision model trained on their own labeled imagery — not something an off-the-shelf foundation model or Bedrock fine-tuning could produce, so they needed full SageMaker training. The model was large enough to need multi-GPU training, and the team was running a wide hyperparameter sweep on always-on on-demand GPU instances. Training compute had climbed past ~$9K/month during the build, most of it from on-demand GPU time and an HPO sweep that ran every combination to completion — more than the seed budget could absorb.
What CloudRoute did: Routed within 21 hours to an EU-Central partner with an ML / SageMaker training track record. The partner moved the training jobs onto managed Spot instances with checkpointing to S3 (so interruptions resumed automatically), switched the hyperparameter tuning to Hyperband with Warm Pools (early-stopping weak trials and skipping re-provisioning between the many short runs), and right-sized the cluster with data-parallel training — then, in parallel, filed an Activate Portfolio credit application plus a GenAI PoC application for the workload.
Outcome: Spot plus the HPO and Warm-Pool changes cut the training run-rate from ~$9K to ~$3K/month (Spot discount on the long runs, Hyperband killing weak trials early, right-sized data-parallel cluster). Credits approved within 15 days then covered the remaining bill — taking effective training cost to ~$0 through the credit window, during which the team finished training and shipped the model. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.
training bill cut: ~65% before credits · then to ~$0 on credits · matched in: < 24h · cost to customer: $0
CloudRoute connects ML and data-science teams with vetted AWS partners who run SageMaker training and fine-tuning, optimize the GPU bill (Spot, Trainium, distributed training), and file the credit applications that fund it. Customer pays $0 — AWS funds it.