A working ML system is not a model — it is a loop. Data prep, training, evaluation, deployment, monitoring, and retraining have to run as one governed pipeline or the model quietly rots in production. This is the definitive 2026 guide to building that loop on AWS: the SageMaker stack stage by stage, the Pipelines + Model Registry backbone that holds it together, and the honest line between custom ML and Amazon Bedrock.
The single most expensive misconception in applied ML is that shipping a model is the finish line. It is the starting line. A model is a function fit to a snapshot of the world, and the world keeps moving — so the real engineering problem is not training a good model once, it is keeping a good model good.
Picture the naive version first, because almost every team starts here. A data scientist pulls a dataset, opens a notebook, engineers some features, trains a model, gets an AUC they are happy with, exports a pickle file, and hands it to engineering to "put behind an API." Six weeks later the model is live. Everyone moves on. This is the moment the trouble starts, not the moment it ends.
Within a few months, three things have quietly happened. The input data has drifted — what users send in production no longer matches the training snapshot. The relationship between inputs and outcome has shifted too (concept drift), because the business changed, a competitor launched, prices moved, or a season turned. And nobody can reproduce the model: the notebook has been edited, the dataset overwritten, the exact features a mystery. The model is now a liability no one can confidently retrain or roll back.
The mature version treats ML as a closed loop with feedback. Data flows in and is prepared the same way every time. Training is a job that runs from versioned code against versioned data and produces a versioned model. Evaluation is a gate, not a vibe. Deployment is controlled and reversible. Monitoring watches both the infrastructure and the statistics of what the model sees and predicts. And when monitoring fires, the loop closes — retraining kicks off, produces a candidate, evaluates it against the incumbent, and promotes it only if it wins. Every arrow in that loop is automated, logged, and reproducible.
AWS's entire ML platform is organized around making that loop cheap to build. SageMaker is not one product; it is roughly a dozen services that each own one stage of the loop, plus two services — Pipelines and the Model Registry — whose only job is to connect the stages into something repeatable. The rest of this guide walks the loop stage by stage, names the AWS service that owns each stage, and is honest about where the seams are and what teams get wrong.
MLOps is the practice of making every stage of the ML lifecycle — data, training, evaluation, deployment, monitoring, retraining — reproducible, automated, and governed, so that a model can be rebuilt, audited, and rolled back without heroics. If you cannot answer "exactly which data and code produced the model currently serving traffic?" you do not yet have MLOps.
More ML projects fail on data plumbing than on modeling. The two failures that matter most are training/serving skew (the features at training time are computed differently than at inference time) and unreproducible feature engineering (nobody can recreate the exact inputs to a past model). AWS gives you three tools that map cleanly to those problems.
Start with raw data movement and transformation at scale. AWS Glue is the serverless ETL workhorse — it crawls sources, infers schemas into a central Data Catalog, and runs Spark or Python-shell jobs to clean, join, and reshape data into the curated tables that feed everything downstream. For a team standardizing a lake, Glue is where the heavy scheduled transformation lives, and its Data Catalog becomes the schema authority that Athena, EMR, and SageMaker all read against.
For the more interactive, ML-specific phase of preparation, SageMaker Data Wrangler sits closer to the data scientist. It is a visual flow for importing data, profiling it, applying 300-plus built-in transforms, detecting leakage and bias, and — critically — exporting the entire flow as a reproducible processing job or a step inside a pipeline. The point is not the GUI; the point is that the exploratory work you do by hand becomes a versioned artifact you can run unattended, which is exactly the property the naive notebook lacks.
The third tool solves the most under-appreciated problem in production ML. SageMaker Feature Store is a repository for curated features with two synchronized tiers: an offline store (in S3, for building training datasets and backfills) and an online store (low-latency, for real-time inference). It exists because of training/serving skew — if your "30-day rolling average order value" is computed one way in a training notebook and another way in the serving API, the model silently degrades. Feature Store lets you define the feature once, materialize it to both tiers, and guarantee training and serving read the identical definition, with time-travel semantics to reconstruct what a feature looked like at the moment of a historical event.
Glue — large-scale, scheduled ETL; building and maintaining the curated lake; schema governance via the Data Catalog. This is data-engineering territory and usually predates the ML team.
Data Wrangler — the ML-specific transform and feature-engineering layer; profiling, leakage detection, and turning ad-hoc cleaning into a repeatable processing step. This is where data science and engineering meet.
Feature Store — the system of record for features that are reused across models and that must be identical at training and serving time. Adopt it the moment a feature is shared by more than one model or computed in more than one place.
Training/serving skew is the most common silent killer of production models. It rarely shows up as an error — the model just gets quietly worse. A Feature Store is the structural fix because the feature definition lives in exactly one place. If you take one architectural decision from this section, it is: shared features get a Feature Store, not a copy-pasted SQL snippet.
Training on AWS has three on-ramps depending on how much of the model you actually need to build yourself: ephemeral training jobs for your own code, JumpStart for pre-built and foundation models you fine-tune, and Trainium when the economics of large-scale training start to hurt. The unifying idea is that training is a job — it spins up, runs, writes an artifact, and tears down — never a long-lived server you babysit.
The core primitive is the SageMaker training job. You hand it a container (a built-in framework image for PyTorch/TensorFlow/XGBoost, or your own), point it at input data in S3, declare an instance type and count, and SageMaker provisions the cluster, runs your script, streams logs and metrics to CloudWatch, writes the model artifact to S3, and tears the cluster down so you stop paying. Because the cluster is ephemeral, you pay for training by the second of actual compute, not for an always-on box — and this is where managed Spot training earns its keep: for interruption-tolerant jobs with checkpointing, Spot can cut training compute cost by a large margin, often well over half.
Hyperparameter tuning and distributed training are first-class here, not afterthoughts. Automatic Model Tuning runs a search (Bayesian or otherwise) across hyperparameter ranges as a managed fleet of training jobs and returns the best configuration. For models too big for one accelerator, SageMaker's distributed training libraries handle data-parallel and model-parallel sharding, and SageMaker Experiments tracks every run's parameters, metrics, and artifacts so the search is auditable rather than a folder of mystery checkpoints.
When you do not want to write training code at all, SageMaker JumpStart is the catalog of pre-trained models and end-to-end solution templates — hundreds of open and proprietary models you can deploy as-is or fine-tune on your own data with a few parameters. For many teams JumpStart is the fastest honest path: take a strong open foundation or vision model, fine-tune it on proprietary data, and skip months of from-scratch training. It is the bridge between "use someone else's model" and "train your own."
The third on-ramp is about cost and scale. AWS Trainium is Amazon's purpose-built training accelerator (Trn1/Trn2 instances), designed to bring down the price-performance of large-scale training relative to general-purpose GPU instances. You reach for Trainium when training runs are large and frequent enough that the per-job GPU bill becomes a line item leadership notices — pre-training or heavy fine-tuning of large models is the canonical case. The tradeoff is the software path: workloads run through the AWS Neuron SDK, so there is integration work, and the win is largest for sustained, large training rather than occasional small jobs. The companion chip for the serving side, Inferentia, shows up in the deployment section.
Bring your own training code → SageMaker training jobs. You have proprietary data and a model architecture that matters. Ephemeral clusters, Spot for cost, Experiments for tracking.
Start from a strong existing model → JumpStart. Fine-tune an open or proprietary foundation/vision model instead of training from scratch. Fastest path to a good-enough custom model.
Training cost is hurting → Trainium. Large, frequent training where GPU price-performance is the constraint. Real Neuron-SDK integration work; biggest payoff at sustained scale.
Evaluation is where most pipelines are weakest, because in a notebook it is a single cell someone eyeballs. In production MLOps it has to be a deterministic gate: a step that computes metrics against a held-out set, compares the candidate against the current production model, checks for fairness and regressions, and emits a pass/fail that the pipeline acts on automatically.
The mechanical part is straightforward — a SageMaker Processing job runs your evaluation script against a held-out dataset and writes a structured evaluation report (metrics as JSON) to S3. The discipline is in what the gate checks. A serious evaluation step does at least four things: it measures the headline metrics on a frozen test set the model never saw; it compares the candidate to the incumbent production model on the same data so you are measuring improvement, not absolute numbers in a vacuum; it checks slice-level performance so a model that improves on average but collapses on an important subgroup gets caught; and it runs bias and fairness checks.
AWS gives you two named tools here. SageMaker Clarify computes bias metrics (pre-training data bias and post-training model bias across sensitive groups) and feature-importance explanations, so "is this model fair and explainable?" produces an artifact rather than a shrug — which matters enormously once governance and regulators enter the picture. And the evaluation report feeds the Model Registry: the metrics travel with the model version, so an approver later can see exactly how a model scored before it was promoted.
The architectural move that separates teams who ship safely from teams who ship and pray is wiring the evaluation result into a conditional pipeline step. The pipeline computes the report, then a condition step reads it: if the candidate beats the incumbent and clears the fairness and minimum-quality thresholds, register the model and (optionally) trigger deployment; if not, stop and alert. The gate is code. It runs the same way every time. Nobody promotes a worse model because they were in a hurry on a Friday.
The number that matters is rarely "the candidate scored 0.91." It is "the candidate scored 0.91 versus the production model's 0.89 on the identical held-out set, with no fairness regression." Champion/challenger comparison baked into the evaluation gate is what stops silent quality drift across retrains.
Deployment is where ML meets real traffic, and AWS deliberately offers four serving modes because "deploy a model" means very different things for a fraud check (milliseconds, always on), a nightly scoring run (millions of rows, no latency need), an occasional internal tool (mostly idle), and a document-processing job (minutes per request). Picking the wrong one is the most common source of either a runaway bill or a missed SLA.
Real-time endpoints are the default mental model: a persistent HTTPS endpoint backed by one or more instances, with autoscaling, low and predictable latency, and a constant baseline cost because the instances are always warm. Use them for synchronous, latency-sensitive predictions — fraud scoring, recommendations, anything in a user-facing request path. The cost discipline here is right-sizing and autoscaling, because an over-provisioned always-on endpoint is the single most common line of ML waste on an AWS bill.
Serverless inference removes the always-on instance entirely: SageMaker provisions capacity per request and scales to zero when idle, so you pay only for inference duration. This is the right call for spiky or intermittent traffic and for internal tools that sit idle most of the day — the classic case where a real-time endpoint would bleed money doing nothing. The tradeoff is cold starts, so it is a poor fit for strict, constant low-latency SLAs.
Asynchronous inference queues requests and processes them in the background, returning results to S3, and is built for large payloads and long-running inferences (large documents, audio, big batches per call) where holding open a synchronous connection makes no sense; it can also scale to zero when the queue is empty. Batch transform is the fourth mode and is not an endpoint at all — it is a job that scores an entire dataset in S3 and shuts down, ideal for periodic offline scoring of millions of records with no real-time requirement.
Two cross-cutting capabilities matter regardless of mode. Multi-model and multi-container endpoints host many models behind one endpoint to amortize infrastructure across individually low-traffic models — a major lever for teams serving dozens of small models. And on hardware, AWS Inferentia (Inf2 instances, the inference counterpart to Trainium) targets high-throughput, lower-cost-per-inference serving via the Neuron SDK; for steady, high-volume real-time inference it is one of the strongest cost levers available, with the same caveat that it needs compilation work and pays off most at sustained scale.
| Mode | Shape of workload | Latency | Scales to zero? | Pay for | Canonical use |
|---|---|---|---|---|---|
| Real-time endpoint | Steady, synchronous | Low + predictable | No (always warm) | Provisioned instance-time | Fraud, recommendations, user-facing |
| Serverless inference | Spiky / intermittent | Higher (cold starts) | Yes | Inference duration only | Internal tools, bursty traffic |
| Async inference | Large payloads, long jobs | Background (queued) | Yes | Processing time | Big documents, audio, large batches/call |
| Batch transform | Whole-dataset, offline | N/A (not an endpoint) | N/A (job ends) | Job compute | Nightly scoring of millions of rows |
Monitoring is the stage that closes the loop, and it has two layers that teams routinely confuse. Operational monitoring (is the endpoint up, fast, and not erroring?) is necessary but not sufficient. The layer that makes it ML monitoring is statistical: is the data the model sees, and the predictions it makes, still behaving like they did at training time?
The operational layer is standard cloud telemetry — CloudWatch tracks endpoint latency, error rates, invocation counts, and instance utilization, and you alarm on them like any other service. This catches outages and saturation. It tells you nothing about whether the model is still correct, which is the failure mode unique to ML and the one that does the most quiet damage.
The ML layer is SageMaker Model Monitor, which captures an endpoint's live inputs and outputs and compares them, on a schedule, against a baseline computed from your training data. It watches four things: data-quality drift (have feature statistics moved away from the training distribution — means, ranges, missing-value rates, new categories?); model-quality drift (once ground-truth labels arrive, has actual accuracy/precision/recall degraded?); bias drift (have fairness metrics shifted in production, via Clarify?); and feature-attribution drift (has the relative importance of features changed, which often signals concept drift before headline accuracy moves?).
The conceptual distinction worth internalizing is data drift versus concept drift. Data drift is the inputs changing — your users are different from your training population. Concept drift is the relationship between inputs and outcome changing — the same input now implies a different result because the world shifted underneath the model. Data drift you can often see immediately from inputs alone; concept drift typically only reveals itself once labels come back, which is why closing the loop with ground-truth capture matters so much. A model that monitors inputs but never ingests outcomes is half-blind.
Monitoring is not the end — it is the trigger. The whole point of detecting drift is that a threshold breach becomes a signal: it raises a CloudWatch alarm, which can notify a human, open a ticket, or — in a mature setup — directly kick off the retraining pipeline. That arrow, from "Model Monitor detected drift" back to "training job starts," is the line that closes the loop and turns a static deployment into a living system.
Operational monitoring (CloudWatch: latency, errors, utilization) tells you the endpoint is alive. Model Monitor (drift on data, quality, bias, attribution) tells you the model is still right. Plenty of teams have a green dashboard over a model that has been silently wrong for a quarter. You need both layers, and the statistical one is the one most teams are missing.
Retraining is the same lifecycle run again — but the maturity question is what pulls the trigger and what guards the promotion. Done well, retraining is a button nobody has to press and a gate nothing bad gets through. Done badly, it is a manual scramble every time someone notices the model is off.
There are three honest retraining strategies, and the right one depends on how fast your world moves. Scheduled retraining runs the pipeline on a fixed cadence (weekly, monthly) regardless of drift — simple, predictable, and fine for slowly changing domains. Triggered retraining fires when Model Monitor detects drift past a threshold — more efficient because you only retrain when reality has actually moved, and the gold standard for fast-moving domains. Continuous/online approaches retrain on a near-constant stream and are rare, justified only when the environment shifts genuinely fast (some fraud and recommendation systems). Most teams should run scheduled-plus-triggered: a baseline cadence so the model never goes stale, with drift-based triggers in between.
The non-negotiable rule of automated retraining is that a fresh model never goes straight to production. Retraining produces a candidate, and the candidate runs the same evaluation gate as the original — scored on a held-out set, compared champion-versus-challenger against the incumbent, checked for fairness regressions. Only if it wins does it get registered and promoted; if it loses, the incumbent stays and the pipeline alerts a human. Automated retraining without an automated evaluation gate is how teams confidently deploy a worse model on a schedule.
Two practical guards matter. First, retraining can amplify bias if last period's predictions feed back into this period's training data (feedback loops), so the same Clarify checks belong in the retrain gate. Second, promotion should remain reversible — the Model Registry keeps every prior version, so if a freshly promoted model misbehaves in ways evaluation did not catch, rolling back is selecting a previous approved version, not an archaeology project. This is exactly why the next section treats the Registry and Pipelines as the backbone of everything above.
Everything above describes individual stages. What turns six stages into MLOps — reproducible, automated, governed — is two services whose entire purpose is connective tissue: SageMaker Pipelines orchestrates the stages into one DAG, and the Model Registry is the versioned, approval-gated source of truth for every model the pipeline produces.
SageMaker Pipelines is the orchestrator. You define the lifecycle as a directed acyclic graph of typed steps — a Processing step for data prep, a Training step, a Processing step for evaluation, a Condition step that reads the evaluation report, a RegisterModel step, and optionally a deployment step. The definition is code, versioned in git, parameterized so the same pipeline runs on new data with new hyperparameters, and every execution is logged with full lineage: which data, which code, which parameters, which metrics, which model. This is the single biggest leap from "experiments" to "MLOps," because it makes the loop reproducible by construction — you can re-run any historical execution and get the same model.
The Model Registry is the source of truth. Every model the pipeline produces is registered as a versioned entry in a model group, carrying its evaluation metrics, its lineage back to the exact training run and data, and an approval status (PendingManualApproval / Approved / Rejected). That status is the governance gate: deployment automation only promotes models in the Approved state, so a human — or an automated rule that checked the metrics — signs off before anything reaches production. It is also what makes rollback trivial, since every prior approved version is right there.
Put the two together and the loop becomes mechanical: a pipeline runs on a schedule or a drift trigger → trains a candidate → evaluates it through the conditional gate → registers the winner as pending approval → an approval flips it to Approved → a deployment step or EventBridge rule rolls it out → Model Monitor watches it → drift fires an alarm → the pipeline runs again. That is the entire end-to-end ML lifecycle on AWS, expressed as two services wrapping the other ten. If you remember one thing from this guide: Pipelines + Model Registry are the backbone, and everything else hangs off them.
The reason this backbone matters beyond reproducibility is that governance is a property it emits for free — and a nightmare you retrofit if it is missing. For regulated teams (finance, healthcare, anything touching personal data) it is the difference between an ML platform you can put in front of an auditor and one you cannot, and three things make ML governable on AWS, all of which fall out of the architecture above.
Lineage. SageMaker ML Lineage Tracking plus the Pipelines/Registry metadata record the full provenance of every model — data, code, hyperparameters, metrics, and the approval trail — so "prove which data and code produced this model, and who approved it" becomes a query, not an investigation.
Access control with an approval gate. IAM scopes who can train, who can register, and — most importantly — who can approve a model for production, giving you separation of duties: the person who trains a model is not the person who promotes it. SageMaker Model Cards add the documentation layer (intended use, evaluation, limitations) kept with the model.
The strategic point: governance and velocity are only opposed when you retrofit governance onto a pile of notebooks. When the backbone carries the metadata automatically, governance is mostly a reporting view over data you already have. The litmus test — can you, within an hour, produce any production model's exact data, code, metrics, fairness checks, and approver? If yes, you are governable; if it would take a week of archaeology, you have a notebook problem dressed up as an ML platform.
No 2026 MLOps guide is honest if it pretends every problem needs a training pipeline. For a large and growing class of problems — language, reasoning, summarization, classification of text, image understanding — a foundation model already does the job, and the right move is to skip the entire training half of the lifecycle. Amazon Bedrock is the managed front door to that path. The architectural skill is knowing which problems belong on which side.
Amazon Bedrock serves leading foundation models (Anthropic's Claude, Meta Llama, Mistral, Amazon's own Nova and Titan, and others) behind a single managed API, with no infrastructure to run and consumption-based pricing. For a language or reasoning task, this collapses the lifecycle dramatically: there is no data-labeling-for-training, no training job, no Trainium decision, no model artifact to version — you prompt a hosted model, optionally ground it in your data with retrieval (RAG via Knowledge Bases), optionally fine-tune it on your examples, and you ship. The half of this guide about training largely disappears.
But Bedrock does not make the lifecycle vanish — it relocates it. You still prepare data (RAG corpora, fine-tuning sets), still evaluate rigorously (now over prompts, outputs, hallucination rate, and task success — harder to measure than a clean AUC), still deploy and monitor (latency, cost-per-call, output quality, input drift), and still govern (which model version, which prompts, what guardrails). The stages persist; what changes is that you operate on top of a model someone else trained instead of training one yourself.
The decision rule is about the nature of the problem and the data. Reach for Bedrock when the task is general intelligence — language, summarization, extraction, conversational agents, reasoning — and a foundation model already does it well; when time-to-market dominates; and when you have no proprietary signal a custom model would learn that a prompt cannot convey. Reach for custom ML on SageMaker when you own proprietary structured/tabular data and the model needs to learn patterns specific to your business (churn, fraud, demand, pricing, recommendation, risk scoring); when you need a small, cheap, low-latency model at high volume where per-call foundation-model pricing would be punishing; when explainability or full control of the model is a hard requirement; or when the task simply is not something a language model does.
In practice, mature organizations run both and the interesting design work is at the seam. A fraud system might use a custom SageMaker gradient-boosted model for the real-time risk score (proprietary tabular data, millisecond latency, high volume — squarely custom) while a Bedrock model drafts the human-readable case summary for the analyst (language — squarely Bedrock). Treat it as a portfolio: route each problem to the side that fits, and let the SageMaker backbone govern the custom models while Bedrock's managed surface handles the foundation-model ones.
The whole loop on one page. Read it top to bottom as the path a model travels, and note the last column — every service exists to prevent a specific, expensive failure mode, and skipping a stage means accepting that failure.
| Stage | Primary AWS service(s) | What it produces | The failure it prevents |
|---|---|---|---|
| 1 · Prepare data | Glue · Data Wrangler · Feature Store | Curated, reproducible features | Training/serving skew; unrepeatable inputs |
| 2 · Train | SageMaker Training · JumpStart · Trainium | A versioned model artifact | Unreproducible, unaffordable training |
| 3 · Evaluate | Processing jobs · Clarify | A pass/fail evaluation report | Promoting a worse or unfair model |
| 4 · Deploy | Endpoints (real-time / serverless / async) · Batch transform | A served model behind the right mode | Runaway cost or a missed latency SLA |
| 5 · Monitor | Model Monitor · CloudWatch · Clarify | Drift + operational signals | A silently-wrong model on a green dashboard |
| 6 · Retrain | Pipelines (triggered by Monitor) | A re-evaluated candidate model | Models that rot; manual retraining scrambles |
| Backbone | SageMaker Pipelines + Model Registry | Orchestration + governed source of truth | Experiments that never became a system |
Situation: Three valuable models, all trained in notebooks and served behind hand-rolled real-time endpoints that ran 24/7 even though two of them took intermittent traffic. No Feature Store (features were re-implemented in the serving code, and the team suspected skew), no Model Registry, no drift monitoring — a key churn model had quietly degraded for a quarter before anyone noticed. Nobody could confidently retrain or roll back. Leadership wanted governance ahead of a SOC 2 cycle and the AWS bill flagged the always-on endpoints.
What CloudRoute did: Routed within 22 hours to a vetted AWS partner with a SageMaker MLOps and FinOps track record. The partner stood up the backbone first: Feature Store for the shared churn/usage features (killing the skew), a SageMaker Pipeline per model (prep → train → evaluate-with-Clarify → conditional register), and a Model Registry with manual approval as the promotion gate. The two intermittent models moved from always-on real-time endpoints to serverless inference; the high-volume one stayed real-time but was right-sized with autoscaling. Model Monitor was wired to CloudWatch alarms that trigger the retraining pipeline on drift.
Outcome: Inference spend on the ML workloads dropped by roughly 55% (serverless + right-sizing), the skew bug was eliminated, and every production model now has full lineage, an approval trail, and drift alarms feeding automated retraining — which carried the SOC 2 ML-governance evidence cleanly. Scoped under AWS Well-Architected / POC funding, so the engagement was AWS-funded and the customer paid $0; CloudRoute was paid by the partner.
engagement window: 7 weeks · models migrated to the loop: 3 · ML inference spend: −~55% · governance: audit-ready · cost to customer: $0
CloudRoute routes ML and data teams to vetted AWS partners who build the end-to-end lifecycle — Feature Store, Pipelines, Model Registry, drift monitoring, cost-right serving. Often scoped under AWS funding (POC / Well-Architected / MAP), so the customer pays $0.