MLOps on AWS · end-to-end · 2026

The ML lifecycle on AWS, end to end (2026).

A working ML system is not a model — it is a loop. Data prep, training, evaluation, deployment, monitoring, and retraining have to run as one governed pipeline or the model quietly rots in production. This is the definitive 2026 guide to building that loop on AWS: the SageMaker stack stage by stage, the Pipelines + Model Registry backbone that holds it together, and the honest line between custom ML and Amazon Bedrock.

lifecycle stages
6
orchestration
Pipelines
source of truth
Model Registry
the goal
a loop, not a model
TL;DR
  • The ML lifecycle is a closed loop with six stages — prepare data, train, evaluate, deploy, monitor, retrain — and the discipline that makes it production-grade is treating every stage as code that runs the same way twice. On AWS, that loop is built with the SageMaker family (Data Wrangler, Feature Store, Training, Endpoints, Model Monitor) stitched together by SageMaker Pipelines.
  • Two artifacts hold the whole thing together: the SageMaker Model Registry (the versioned, approval-gated source of truth for every model) and SageMaker Pipelines (the DAG that turns "a notebook that trained a model once" into a repeatable, auditable workflow). Without those two, you have experiments, not MLOps.
  • Bedrock and custom ML are not competitors — they are different tools for different problems. If your task is language, reasoning, or summarization and a foundation model already does it well, Bedrock skips the entire training half of the lifecycle. If you have proprietary tabular or domain data and need a model that is yours, SageMaker is the path. Most mature teams run both, and the architectural question is which problems belong on which side.
the shape of the problem

IWhy the ML lifecycle is a loop, not a pipeline

The single most expensive misconception in applied ML is that shipping a model is the finish line. It is the starting line. A model is a function fit to a snapshot of the world, and the world keeps moving — so the real engineering problem is not training a good model once, it is keeping a good model good.

Picture the naive version first, because almost every team starts here. A data scientist pulls a dataset, opens a notebook, engineers some features, trains a model, gets an AUC they are happy with, exports a pickle file, and hands it to engineering to "put behind an API." Six weeks later the model is live. Everyone moves on. This is the moment the trouble starts, not the moment it ends.

Within a few months, three things have quietly happened. The input data has drifted — what users send in production no longer matches the training snapshot. The relationship between inputs and outcome has shifted too (concept drift), because the business changed, a competitor launched, prices moved, or a season turned. And nobody can reproduce the model: the notebook has been edited, the dataset overwritten, the exact features a mystery. The model is now a liability no one can confidently retrain or roll back.

The mature version treats ML as a closed loop with feedback. Data flows in and is prepared the same way every time. Training is a job that runs from versioned code against versioned data and produces a versioned model. Evaluation is a gate, not a vibe. Deployment is controlled and reversible. Monitoring watches both the infrastructure and the statistics of what the model sees and predicts. And when monitoring fires, the loop closes — retraining kicks off, produces a candidate, evaluates it against the incumbent, and promotes it only if it wins. Every arrow in that loop is automated, logged, and reproducible.

AWS's entire ML platform is organized around making that loop cheap to build. SageMaker is not one product; it is roughly a dozen services that each own one stage of the loop, plus two services — Pipelines and the Model Registry — whose only job is to connect the stages into something repeatable. The rest of this guide walks the loop stage by stage, names the AWS service that owns each stage, and is honest about where the seams are and what teams get wrong.

the one-sentence definition of MLOps

MLOps is the practice of making every stage of the ML lifecycle — data, training, evaluation, deployment, monitoring, retraining — reproducible, automated, and governed, so that a model can be rebuilt, audited, and rolled back without heroics. If you cannot answer "exactly which data and code produced the model currently serving traffic?" you do not yet have MLOps.

stage 1 — data

IIData preparation: Glue, Data Wrangler, and Feature Store

More ML projects fail on data plumbing than on modeling. The two failures that matter most are training/serving skew (the features at training time are computed differently than at inference time) and unreproducible feature engineering (nobody can recreate the exact inputs to a past model). AWS gives you three tools that map cleanly to those problems.

Start with raw data movement and transformation at scale. AWS Glue is the serverless ETL workhorse — it crawls sources, infers schemas into a central Data Catalog, and runs Spark or Python-shell jobs to clean, join, and reshape data into the curated tables that feed everything downstream. For a team standardizing a lake, Glue is where the heavy scheduled transformation lives, and its Data Catalog becomes the schema authority that Athena, EMR, and SageMaker all read against.

For the more interactive, ML-specific phase of preparation, SageMaker Data Wrangler sits closer to the data scientist. It is a visual flow for importing data, profiling it, applying 300-plus built-in transforms, detecting leakage and bias, and — critically — exporting the entire flow as a reproducible processing job or a step inside a pipeline. The point is not the GUI; the point is that the exploratory work you do by hand becomes a versioned artifact you can run unattended, which is exactly the property the naive notebook lacks.

The third tool solves the most under-appreciated problem in production ML. SageMaker Feature Store is a repository for curated features with two synchronized tiers: an offline store (in S3, for building training datasets and backfills) and an online store (low-latency, for real-time inference). It exists because of training/serving skew — if your "30-day rolling average order value" is computed one way in a training notebook and another way in the serving API, the model silently degrades. Feature Store lets you define the feature once, materialize it to both tiers, and guarantee training and serving read the identical definition, with time-travel semantics to reconstruct what a feature looked like at the moment of a historical event.

When to use which

Glue — large-scale, scheduled ETL; building and maintaining the curated lake; schema governance via the Data Catalog. This is data-engineering territory and usually predates the ML team.

Data Wrangler — the ML-specific transform and feature-engineering layer; profiling, leakage detection, and turning ad-hoc cleaning into a repeatable processing step. This is where data science and engineering meet.

Feature Store — the system of record for features that are reused across models and that must be identical at training and serving time. Adopt it the moment a feature is shared by more than one model or computed in more than one place.

the skew trap

Training/serving skew is the most common silent killer of production models. It rarely shows up as an error — the model just gets quietly worse. A Feature Store is the structural fix because the feature definition lives in exactly one place. If you take one architectural decision from this section, it is: shared features get a Feature Store, not a copy-pasted SQL snippet.

stage 2 — train

IIITraining: SageMaker training jobs, JumpStart, and Trainium

Training on AWS has three on-ramps depending on how much of the model you actually need to build yourself: ephemeral training jobs for your own code, JumpStart for pre-built and foundation models you fine-tune, and Trainium when the economics of large-scale training start to hurt. The unifying idea is that training is a job — it spins up, runs, writes an artifact, and tears down — never a long-lived server you babysit.

The core primitive is the SageMaker training job. You hand it a container (a built-in framework image for PyTorch/TensorFlow/XGBoost, or your own), point it at input data in S3, declare an instance type and count, and SageMaker provisions the cluster, runs your script, streams logs and metrics to CloudWatch, writes the model artifact to S3, and tears the cluster down so you stop paying. Because the cluster is ephemeral, you pay for training by the second of actual compute, not for an always-on box — and this is where managed Spot training earns its keep: for interruption-tolerant jobs with checkpointing, Spot can cut training compute cost by a large margin, often well over half.

Hyperparameter tuning and distributed training are first-class here, not afterthoughts. Automatic Model Tuning runs a search (Bayesian or otherwise) across hyperparameter ranges as a managed fleet of training jobs and returns the best configuration. For models too big for one accelerator, SageMaker's distributed training libraries handle data-parallel and model-parallel sharding, and SageMaker Experiments tracks every run's parameters, metrics, and artifacts so the search is auditable rather than a folder of mystery checkpoints.

When you do not want to write training code at all, SageMaker JumpStart is the catalog of pre-trained models and end-to-end solution templates — hundreds of open and proprietary models you can deploy as-is or fine-tune on your own data with a few parameters. For many teams JumpStart is the fastest honest path: take a strong open foundation or vision model, fine-tune it on proprietary data, and skip months of from-scratch training. It is the bridge between "use someone else's model" and "train your own."

The third on-ramp is about cost and scale. AWS Trainium is Amazon's purpose-built training accelerator (Trn1/Trn2 instances), designed to bring down the price-performance of large-scale training relative to general-purpose GPU instances. You reach for Trainium when training runs are large and frequent enough that the per-job GPU bill becomes a line item leadership notices — pre-training or heavy fine-tuning of large models is the canonical case. The tradeoff is the software path: workloads run through the AWS Neuron SDK, so there is integration work, and the win is largest for sustained, large training rather than occasional small jobs. The companion chip for the serving side, Inferentia, shows up in the deployment section.

Three on-ramps, one decision

Bring your own training code → SageMaker training jobs. You have proprietary data and a model architecture that matters. Ephemeral clusters, Spot for cost, Experiments for tracking.

Start from a strong existing model → JumpStart. Fine-tune an open or proprietary foundation/vision model instead of training from scratch. Fastest path to a good-enough custom model.

Training cost is hurting → Trainium. Large, frequent training where GPU price-performance is the constraint. Real Neuron-SDK integration work; biggest payoff at sustained scale.

stage 3 — evaluate

IVEvaluation: making "is it good enough?" a gate, not an opinion

Evaluation is where most pipelines are weakest, because in a notebook it is a single cell someone eyeballs. In production MLOps it has to be a deterministic gate: a step that computes metrics against a held-out set, compares the candidate against the current production model, checks for fairness and regressions, and emits a pass/fail that the pipeline acts on automatically.

The mechanical part is straightforward — a SageMaker Processing job runs your evaluation script against a held-out dataset and writes a structured evaluation report (metrics as JSON) to S3. The discipline is in what the gate checks. A serious evaluation step does at least four things: it measures the headline metrics on a frozen test set the model never saw; it compares the candidate to the incumbent production model on the same data so you are measuring improvement, not absolute numbers in a vacuum; it checks slice-level performance so a model that improves on average but collapses on an important subgroup gets caught; and it runs bias and fairness checks.

AWS gives you two named tools here. SageMaker Clarify computes bias metrics (pre-training data bias and post-training model bias across sensitive groups) and feature-importance explanations, so "is this model fair and explainable?" produces an artifact rather than a shrug — which matters enormously once governance and regulators enter the picture. And the evaluation report feeds the Model Registry: the metrics travel with the model version, so an approver later can see exactly how a model scored before it was promoted.

The architectural move that separates teams who ship safely from teams who ship and pray is wiring the evaluation result into a conditional pipeline step. The pipeline computes the report, then a condition step reads it: if the candidate beats the incumbent and clears the fairness and minimum-quality thresholds, register the model and (optionally) trigger deployment; if not, stop and alert. The gate is code. It runs the same way every time. Nobody promotes a worse model because they were in a hurry on a Friday.

evaluate against the incumbent, not against zero

The number that matters is rarely "the candidate scored 0.91." It is "the candidate scored 0.91 versus the production model's 0.89 on the identical held-out set, with no fairness regression." Champion/challenger comparison baked into the evaluation gate is what stops silent quality drift across retrains.

stage 4 — deploy

VDeployment: the four SageMaker endpoint types

Deployment is where ML meets real traffic, and AWS deliberately offers four serving modes because "deploy a model" means very different things for a fraud check (milliseconds, always on), a nightly scoring run (millions of rows, no latency need), an occasional internal tool (mostly idle), and a document-processing job (minutes per request). Picking the wrong one is the most common source of either a runaway bill or a missed SLA.

Real-time endpoints are the default mental model: a persistent HTTPS endpoint backed by one or more instances, with autoscaling, low and predictable latency, and a constant baseline cost because the instances are always warm. Use them for synchronous, latency-sensitive predictions — fraud scoring, recommendations, anything in a user-facing request path. The cost discipline here is right-sizing and autoscaling, because an over-provisioned always-on endpoint is the single most common line of ML waste on an AWS bill.

Serverless inference removes the always-on instance entirely: SageMaker provisions capacity per request and scales to zero when idle, so you pay only for inference duration. This is the right call for spiky or intermittent traffic and for internal tools that sit idle most of the day — the classic case where a real-time endpoint would bleed money doing nothing. The tradeoff is cold starts, so it is a poor fit for strict, constant low-latency SLAs.

Asynchronous inference queues requests and processes them in the background, returning results to S3, and is built for large payloads and long-running inferences (large documents, audio, big batches per call) where holding open a synchronous connection makes no sense; it can also scale to zero when the queue is empty. Batch transform is the fourth mode and is not an endpoint at all — it is a job that scores an entire dataset in S3 and shuts down, ideal for periodic offline scoring of millions of records with no real-time requirement.

Two cross-cutting capabilities matter regardless of mode. Multi-model and multi-container endpoints host many models behind one endpoint to amortize infrastructure across individually low-traffic models — a major lever for teams serving dozens of small models. And on hardware, AWS Inferentia (Inf2 instances, the inference counterpart to Trainium) targets high-throughput, lower-cost-per-inference serving via the Neuron SDK; for steady, high-volume real-time inference it is one of the strongest cost levers available, with the same caveat that it needs compilation work and pays off most at sustained scale.

SageMaker serving modes · how to choose · 2026
ModeShape of workloadLatencyScales to zero?Pay forCanonical use
Real-time endpointSteady, synchronousLow + predictableNo (always warm)Provisioned instance-timeFraud, recommendations, user-facing
Serverless inferenceSpiky / intermittentHigher (cold starts)YesInference duration onlyInternal tools, bursty traffic
Async inferenceLarge payloads, long jobsBackground (queued)YesProcessing timeBig documents, audio, large batches/call
Batch transformWhole-dataset, offlineN/A (not an endpoint)N/A (job ends)Job computeNightly scoring of millions of rows
Choosing the serving mode is a cost-and-SLA decision before it is a technical one. The most expensive mistake is putting intermittent or offline workloads on always-on real-time endpoints — and the second is putting strict-latency traffic on serverless and getting bitten by cold starts.
stage 5 — monitor

VIMonitoring: Model Monitor, drift, and the feedback signal

Monitoring is the stage that closes the loop, and it has two layers that teams routinely confuse. Operational monitoring (is the endpoint up, fast, and not erroring?) is necessary but not sufficient. The layer that makes it ML monitoring is statistical: is the data the model sees, and the predictions it makes, still behaving like they did at training time?

The operational layer is standard cloud telemetry — CloudWatch tracks endpoint latency, error rates, invocation counts, and instance utilization, and you alarm on them like any other service. This catches outages and saturation. It tells you nothing about whether the model is still correct, which is the failure mode unique to ML and the one that does the most quiet damage.

The ML layer is SageMaker Model Monitor, which captures an endpoint's live inputs and outputs and compares them, on a schedule, against a baseline computed from your training data. It watches four things: data-quality drift (have feature statistics moved away from the training distribution — means, ranges, missing-value rates, new categories?); model-quality drift (once ground-truth labels arrive, has actual accuracy/precision/recall degraded?); bias drift (have fairness metrics shifted in production, via Clarify?); and feature-attribution drift (has the relative importance of features changed, which often signals concept drift before headline accuracy moves?).

The conceptual distinction worth internalizing is data drift versus concept drift. Data drift is the inputs changing — your users are different from your training population. Concept drift is the relationship between inputs and outcome changing — the same input now implies a different result because the world shifted underneath the model. Data drift you can often see immediately from inputs alone; concept drift typically only reveals itself once labels come back, which is why closing the loop with ground-truth capture matters so much. A model that monitors inputs but never ingests outcomes is half-blind.

Monitoring is not the end — it is the trigger. The whole point of detecting drift is that a threshold breach becomes a signal: it raises a CloudWatch alarm, which can notify a human, open a ticket, or — in a mature setup — directly kick off the retraining pipeline. That arrow, from "Model Monitor detected drift" back to "training job starts," is the line that closes the loop and turns a static deployment into a living system.

two layers, do not skip the second

Operational monitoring (CloudWatch: latency, errors, utilization) tells you the endpoint is alive. Model Monitor (drift on data, quality, bias, attribution) tells you the model is still right. Plenty of teams have a green dashboard over a model that has been silently wrong for a quarter. You need both layers, and the statistical one is the one most teams are missing.

stage 6 — retrain

VIIRetraining: closing the loop automatically

Retraining is the same lifecycle run again — but the maturity question is what pulls the trigger and what guards the promotion. Done well, retraining is a button nobody has to press and a gate nothing bad gets through. Done badly, it is a manual scramble every time someone notices the model is off.

There are three honest retraining strategies, and the right one depends on how fast your world moves. Scheduled retraining runs the pipeline on a fixed cadence (weekly, monthly) regardless of drift — simple, predictable, and fine for slowly changing domains. Triggered retraining fires when Model Monitor detects drift past a threshold — more efficient because you only retrain when reality has actually moved, and the gold standard for fast-moving domains. Continuous/online approaches retrain on a near-constant stream and are rare, justified only when the environment shifts genuinely fast (some fraud and recommendation systems). Most teams should run scheduled-plus-triggered: a baseline cadence so the model never goes stale, with drift-based triggers in between.

The non-negotiable rule of automated retraining is that a fresh model never goes straight to production. Retraining produces a candidate, and the candidate runs the same evaluation gate as the original — scored on a held-out set, compared champion-versus-challenger against the incumbent, checked for fairness regressions. Only if it wins does it get registered and promoted; if it loses, the incumbent stays and the pipeline alerts a human. Automated retraining without an automated evaluation gate is how teams confidently deploy a worse model on a schedule.

Two practical guards matter. First, retraining can amplify bias if last period's predictions feed back into this period's training data (feedback loops), so the same Clarify checks belong in the retrain gate. Second, promotion should remain reversible — the Model Registry keeps every prior version, so if a freshly promoted model misbehaves in ways evaluation did not catch, rolling back is selecting a previous approved version, not an archaeology project. This is exactly why the next section treats the Registry and Pipelines as the backbone of everything above.

the backbone

VIIISageMaker Pipelines + Model Registry: the spine that holds it together

Everything above describes individual stages. What turns six stages into MLOps — reproducible, automated, governed — is two services whose entire purpose is connective tissue: SageMaker Pipelines orchestrates the stages into one DAG, and the Model Registry is the versioned, approval-gated source of truth for every model the pipeline produces.

SageMaker Pipelines is the orchestrator. You define the lifecycle as a directed acyclic graph of typed steps — a Processing step for data prep, a Training step, a Processing step for evaluation, a Condition step that reads the evaluation report, a RegisterModel step, and optionally a deployment step. The definition is code, versioned in git, parameterized so the same pipeline runs on new data with new hyperparameters, and every execution is logged with full lineage: which data, which code, which parameters, which metrics, which model. This is the single biggest leap from "experiments" to "MLOps," because it makes the loop reproducible by construction — you can re-run any historical execution and get the same model.

The Model Registry is the source of truth. Every model the pipeline produces is registered as a versioned entry in a model group, carrying its evaluation metrics, its lineage back to the exact training run and data, and an approval status (PendingManualApproval / Approved / Rejected). That status is the governance gate: deployment automation only promotes models in the Approved state, so a human — or an automated rule that checked the metrics — signs off before anything reaches production. It is also what makes rollback trivial, since every prior approved version is right there.

Put the two together and the loop becomes mechanical: a pipeline runs on a schedule or a drift trigger → trains a candidate → evaluates it through the conditional gate → registers the winner as pending approval → an approval flips it to Approved → a deployment step or EventBridge rule rolls it out → Model Monitor watches it → drift fires an alarm → the pipeline runs again. That is the entire end-to-end ML lifecycle on AWS, expressed as two services wrapping the other ten. If you remember one thing from this guide: Pipelines + Model Registry are the backbone, and everything else hangs off them.

Governance falls out of the backbone

The reason this backbone matters beyond reproducibility is that governance is a property it emits for free — and a nightmare you retrofit if it is missing. For regulated teams (finance, healthcare, anything touching personal data) it is the difference between an ML platform you can put in front of an auditor and one you cannot, and three things make ML governable on AWS, all of which fall out of the architecture above.

Lineage. SageMaker ML Lineage Tracking plus the Pipelines/Registry metadata record the full provenance of every model — data, code, hyperparameters, metrics, and the approval trail — so "prove which data and code produced this model, and who approved it" becomes a query, not an investigation.

Access control with an approval gate. IAM scopes who can train, who can register, and — most importantly — who can approve a model for production, giving you separation of duties: the person who trains a model is not the person who promotes it. SageMaker Model Cards add the documentation layer (intended use, evaluation, limitations) kept with the model.

The strategic point: governance and velocity are only opposed when you retrofit governance onto a pile of notebooks. When the backbone carries the metadata automatically, governance is mostly a reporting view over data you already have. The litmus test — can you, within an hour, produce any production model's exact data, code, metrics, fairness checks, and approver? If yes, you are governable; if it would take a week of archaeology, you have a notebook problem dressed up as an ML platform.

the build-vs-buy line

IXWhere Bedrock fits vs custom ML

No 2026 MLOps guide is honest if it pretends every problem needs a training pipeline. For a large and growing class of problems — language, reasoning, summarization, classification of text, image understanding — a foundation model already does the job, and the right move is to skip the entire training half of the lifecycle. Amazon Bedrock is the managed front door to that path. The architectural skill is knowing which problems belong on which side.

Amazon Bedrock serves leading foundation models (Anthropic's Claude, Meta Llama, Mistral, Amazon's own Nova and Titan, and others) behind a single managed API, with no infrastructure to run and consumption-based pricing. For a language or reasoning task, this collapses the lifecycle dramatically: there is no data-labeling-for-training, no training job, no Trainium decision, no model artifact to version — you prompt a hosted model, optionally ground it in your data with retrieval (RAG via Knowledge Bases), optionally fine-tune it on your examples, and you ship. The half of this guide about training largely disappears.

But Bedrock does not make the lifecycle vanish — it relocates it. You still prepare data (RAG corpora, fine-tuning sets), still evaluate rigorously (now over prompts, outputs, hallucination rate, and task success — harder to measure than a clean AUC), still deploy and monitor (latency, cost-per-call, output quality, input drift), and still govern (which model version, which prompts, what guardrails). The stages persist; what changes is that you operate on top of a model someone else trained instead of training one yourself.

The decision rule is about the nature of the problem and the data. Reach for Bedrock when the task is general intelligence — language, summarization, extraction, conversational agents, reasoning — and a foundation model already does it well; when time-to-market dominates; and when you have no proprietary signal a custom model would learn that a prompt cannot convey. Reach for custom ML on SageMaker when you own proprietary structured/tabular data and the model needs to learn patterns specific to your business (churn, fraud, demand, pricing, recommendation, risk scoring); when you need a small, cheap, low-latency model at high volume where per-call foundation-model pricing would be punishing; when explainability or full control of the model is a hard requirement; or when the task simply is not something a language model does.

In practice, mature organizations run both and the interesting design work is at the seam. A fraud system might use a custom SageMaker gradient-boosted model for the real-time risk score (proprietary tabular data, millisecond latency, high volume — squarely custom) while a Bedrock model drafts the human-readable case summary for the analyst (language — squarely Bedrock). Treat it as a portfolio: route each problem to the side that fits, and let the SageMaker backbone govern the custom models while Bedrock's managed surface handles the foundation-model ones.

the lifecycle at a glance

Six stages, the AWS service that owns each, and the failure it prevents

The whole loop on one page. Read it top to bottom as the path a model travels, and note the last column — every service exists to prevent a specific, expensive failure mode, and skipping a stage means accepting that failure.

StagePrimary AWS service(s)What it producesThe failure it prevents
1 · Prepare dataGlue · Data Wrangler · Feature StoreCurated, reproducible featuresTraining/serving skew; unrepeatable inputs
2 · TrainSageMaker Training · JumpStart · TrainiumA versioned model artifactUnreproducible, unaffordable training
3 · EvaluateProcessing jobs · ClarifyA pass/fail evaluation reportPromoting a worse or unfair model
4 · DeployEndpoints (real-time / serverless / async) · Batch transformA served model behind the right modeRunaway cost or a missed latency SLA
5 · MonitorModel Monitor · CloudWatch · ClarifyDrift + operational signalsA silently-wrong model on a green dashboard
6 · RetrainPipelines (triggered by Monitor)A re-evaluated candidate modelModels that rot; manual retraining scrambles
BackboneSageMaker Pipelines + Model RegistryOrchestration + governed source of truthExperiments that never became a system
A team can use every individual service in column two and still not have MLOps — what makes it MLOps is the backbone row stitching the stages into a reproducible, governed, automatically-closing loop.
turning notebooks into a governed loop?
Get matched with an AWS partner who builds the SageMaker MLOps backbone for you
Start in 3 minutes →
a recent match

From notebooks to a closed loop — anonymized

inquiry · series-b vertical-SaaS, data/AI team, remote-US
Series-B vertical SaaS, 9-person data team, ~$9K/month AWS, three models in production scoring customer churn and usage risk

Situation: Three valuable models, all trained in notebooks and served behind hand-rolled real-time endpoints that ran 24/7 even though two of them took intermittent traffic. No Feature Store (features were re-implemented in the serving code, and the team suspected skew), no Model Registry, no drift monitoring — a key churn model had quietly degraded for a quarter before anyone noticed. Nobody could confidently retrain or roll back. Leadership wanted governance ahead of a SOC 2 cycle and the AWS bill flagged the always-on endpoints.

What CloudRoute did: Routed within 22 hours to a vetted AWS partner with a SageMaker MLOps and FinOps track record. The partner stood up the backbone first: Feature Store for the shared churn/usage features (killing the skew), a SageMaker Pipeline per model (prep → train → evaluate-with-Clarify → conditional register), and a Model Registry with manual approval as the promotion gate. The two intermittent models moved from always-on real-time endpoints to serverless inference; the high-volume one stayed real-time but was right-sized with autoscaling. Model Monitor was wired to CloudWatch alarms that trigger the retraining pipeline on drift.

Outcome: Inference spend on the ML workloads dropped by roughly 55% (serverless + right-sizing), the skew bug was eliminated, and every production model now has full lineage, an approval trail, and drift alarms feeding automated retraining — which carried the SOC 2 ML-governance evidence cleanly. Scoped under AWS Well-Architected / POC funding, so the engagement was AWS-funded and the customer paid $0; CloudRoute was paid by the partner.

engagement window: 7 weeks · models migrated to the loop: 3 · ML inference spend: −~55% · governance: audit-ready · cost to customer: $0

faq

Common questions

What is the difference between the ML lifecycle and MLOps?
The ML lifecycle is the sequence of stages a model passes through — prepare data, train, evaluate, deploy, monitor, retrain. MLOps is the engineering practice that makes those stages reproducible, automated, and governed, so the loop runs reliably and auditably rather than as a one-off notebook effort. Put simply: the lifecycle is the what; MLOps is the how. On AWS, MLOps is largely the work of wiring the lifecycle stages together with SageMaker Pipelines and governing the output with the Model Registry.
Do I need SageMaker Pipelines, or can I just use a notebook?
A notebook is fine for experimentation. It is not fine for production, because it cannot reliably reproduce a model, cannot gate promotion on evaluation, and leaves no lineage. SageMaker Pipelines turns the lifecycle into a versioned, parameterized DAG where every execution is logged and re-runnable — which is the single biggest step from "we trained a model once" to "we operate models." If you have more than one model in production or any governance requirement, you need an orchestrator, and Pipelines is the native one.
When should I use Bedrock instead of training a custom model?
Use Bedrock when the task is general intelligence — language, summarization, extraction, reasoning, conversational agents — and a foundation model already does it well, especially when time-to-market matters and there is no proprietary signal a custom model would learn. Train a custom model on SageMaker when you own proprietary structured/tabular data (churn, fraud, demand, pricing, risk), need a small cheap low-latency model at high volume, or have explainability and full control as hard requirements. Most mature teams run both and route each problem to the side that fits.
What is the difference between data drift and concept drift?
Data drift is the input distribution changing — the data the model sees in production no longer matches its training data (new value ranges, new categories, shifted means). Concept drift is the relationship between inputs and the outcome changing — the same input now implies a different result because the world shifted. Data drift is often detectable from inputs alone; concept drift usually only reveals itself once ground-truth labels arrive, which is why capturing outcomes and feeding them back is essential. SageMaker Model Monitor watches both, plus bias drift and feature-attribution drift.
Which SageMaker endpoint type should I use?
Real-time endpoints for steady, latency-sensitive, synchronous traffic (fraud, recommendations) — always warm, so always costing. Serverless inference for spiky or intermittent traffic and idle-most-of-the-day internal tools — scales to zero, but has cold starts. Asynchronous inference for large payloads and long-running jobs (big documents, audio) processed in the background. Batch transform — not an endpoint — for scoring an entire dataset offline on a schedule. The choice is a cost-and-SLA decision first; the classic mistake is putting intermittent or offline work on always-on real-time endpoints.
How do I reduce the cost of training and inference on AWS?
On training: use ephemeral training jobs (you pay only for the seconds of compute), add managed Spot training with checkpointing for interruption-tolerant jobs, and consider AWS Trainium for large, frequent training where GPU price-performance is the constraint. On inference: match the serving mode to the workload (serverless or batch instead of always-on real-time where appropriate), right-size and autoscale real-time endpoints, host low-traffic models behind multi-model endpoints, and consider AWS Inferentia for steady high-volume serving. Over-provisioned always-on endpoints are the most common source of ML waste on an AWS bill.
What does ML governance on AWS actually require?
Three things, all of which fall out of a well-built backbone: lineage (full provenance of every model — data, code, hyperparameters, metrics, approvals — via Pipelines, the Model Registry, and ML Lineage Tracking); access control with an approval gate (IAM scoping who can train, register, and approve a model for production, ideally with separation of duties); and documentation (SageMaker Model Cards recording intended use, evaluation, and limitations). The litmus test is whether you can produce, within an hour, the exact data, code, metrics, fairness checks, and approver for any production model.
How does automated retraining avoid deploying a worse model?
By never letting a retrained model go straight to production. Retraining produces a candidate that runs the identical evaluation gate as the original — scored on a held-out set, compared champion-versus-challenger against the incumbent, and checked for fairness regressions with Clarify. Only a candidate that wins is registered and promoted; a candidate that loses leaves the incumbent in place and alerts a human. Because the Model Registry keeps every prior approved version, promotion stays reversible. Automated retraining without an automated evaluation gate is exactly how teams deploy a worse model on a schedule.

Want the SageMaker MLOps loop built for you — and AWS-funded?

CloudRoute routes ML and data teams to vetted AWS partners who build the end-to-end lifecycle — Feature Store, Pipelines, Model Registry, drift monitoring, cost-right serving. Often scoped under AWS funding (POC / Well-Architected / MAP), so the customer pays $0.

matched within< 24h
backbonePipelines + Registry
cost to youoften $0
The ML lifecycle on AWS, end to end (2026) — the MLOps guide · CloudRoute