for AWS partners →Fund your inference rebuild →

SageMaker FinOps · 2026 playbook

Cutting SageMaker inference cost — every lever, ranked by impact (2026).

SageMaker inference bills are dominated by one number: instance-hours that ran whether or not a request arrived. The biggest savings almost never come from a cheaper instance — they come from choosing the right endpoint type, sizing it honestly, letting it scale to zero, and packing more models onto fewer GPUs. This guide walks every lever — real-time vs serverless vs async vs batch transform, right-sizing, autoscaling and scale-to-zero, multi-model and multi-container endpoints, Inferentia and Graviton, Savings Plans, the idle-endpoint trap — each with the underlying mechanism, the typical savings, and where it ranks. We close with the honest build-vs-buy line: when to stay on SageMaker and when to move to Bedrock managed instead.

Fund your inference rebuild →→ jump to the lever-ranking table

levers covered

top-3 combined savings

50–90%

Inferentia2 $/token vs GPU

40–60% less

scale-to-zero idle savings

100%

TL;DR

The single largest SageMaker inference cost is the idle-endpoint trap: a real-time endpoint bills per instance-hour 24/7 whether or not traffic arrives. A $0.736/hr ml.g5.xlarge left running is ~$537/month — most of it serving zero requests in dev, staging, and abandoned demos. Find and kill idle endpoints first; it is the highest-ROI move and costs nothing to do.
After idle cleanup, the order of impact is: pick the right endpoint type for the traffic shape (serverless or batch transform can cut spend 60–90% for spiky or offline workloads), right-size + autoscale (often 30–50%), pack models with multi-model / multi-container endpoints (50–80% on fragmented fleets), move to Inferentia2 or Graviton (40–60% better $/token), then commit a steady-state baseline to a SageMaker Savings Plan (up to ~64%).
There is a build-vs-buy ceiling. If you are self-hosting a foundation model on SageMaker purely to serve inference and your utilization is low or bursty, a managed API like Amazon Bedrock is usually cheaper per token and removes the idle problem entirely — you pay per token, not per instance-hour. Keep SageMaker for custom models, fine-tunes you must own, strict in-VPC isolation, or sustained high utilization where reserved GPUs beat per-token pricing.

the cost model

IWhere SageMaker inference money actually goes

Before any optimization, you have to understand what SageMaker charges for. Almost every surprising inference bill traces back to a single fact: real-time endpoints bill per provisioned instance-hour, not per request. The instance is the meter, and it runs whether traffic is zero or saturated.

A SageMaker real-time endpoint is a managed, always-on fleet of EC2 instances behind a load balancer and an HTTPS invoke API. You choose an instance type and an initial instance count, and from the moment the endpoint reaches InService until the moment you delete it, you pay the per-hour rate multiplied by the instance count — continuously, by the second, with no free tier and no automatic shutdown. There is no "pause." An endpoint that served ten requests last week and zero this week costs exactly the same as one under load.

The dollar figures are not small. As a 2026 on-demand reference in us-east-1: a CPU ml.m5.xlarge runs about $0.23/hr (~$168/month if left on), a single-GPU ml.g5.xlarge about $1.408/hr (~$1,028/month), the inference-optimized ml.inf2.xlarge (Inferentia2) about $0.99/hr (~$725/month), and a multi-GPU ml.g5.12xlarge about $7.09/hr (~$5,175/month). Rates vary by region and move over time — always confirm against the live pricing page — but the shape is what matters: a single idle GPU endpoint is a four-figure monthly line item, and most fleets have several.

On top of instance-hours, SageMaker bills a few smaller things that occasionally matter: provisioned storage attached to the endpoint (per GB-month), data processing, and inter-AZ transfer. But there is no separate per-invocation charge on real-time endpoints — which is exactly why a busy endpoint and an idle one cost the same. The corollary is the whole game: your bill is set by how many instance-hours you provision, not by how many predictions you serve.

This reframes the entire optimization problem. The question is never "how do I make each prediction cheaper" — it is "how do I stop paying for instance-hours that did not serve traffic." Every lever in this guide is a different answer to that question: a cheaper meter (Inferentia, Graviton), a meter that turns off when idle (serverless, scale-to-zero, batch), fewer meters running the same work (multi-model endpoints), or a discounted rate on the meters you genuinely keep busy (Savings Plans).

the one-sentence cost model

A SageMaker real-time endpoint bills per provisioned instance-hour, 24/7, regardless of traffic. Every optimization is a variation on the same move: reduce paid-for instance-hours that did not serve a request — or get a discount on the ones that did.

lever #1 — do this first

IIThe idle-endpoint trap (and why it is lever #1)

The highest-ROI action in SageMaker FinOps is not an architecture change. It is finding the endpoints that are running and serving nobody, and deleting them. This is free, takes an afternoon, and on most accounts removes the largest single source of waste.

Idle endpoints accumulate for entirely human reasons. A data scientist deploys a model to test a notebook and never deletes it. A demo endpoint goes up for a customer call and stays up for a year. A staging endpoint mirrors production but receives a request a day. A "temporary" A/B variant is left at one instance long after the test ended. None of these throw an error. None of them page anyone. They simply bill, quietly, every hour, forever.

The diagnostic is straightforward. For each endpoint, pull the CloudWatch Invocations metric over the last 7–14 days. Any endpoint with near-zero invocations is either dead (delete it) or genuinely low-traffic (a candidate for serverless or scale-to-zero, covered next). Cross-reference with the InvocationsPerInstance and CPU/GPU utilization metrics: an endpoint sitting at single-digit percent utilization is paying for capacity it is not using even when it does get traffic.

A practical sweep, in order: (1) list every endpoint in every region — waste hides in regions nobody checks; (2) tag each with an owner and an environment; (3) delete anything with zero invocations and no owner; (4) move low-but-nonzero traffic endpoints to serverless or scale-to-zero; (5) put a guardrail in place so the trap does not refill — a scheduled Lambda that flags any endpoint below an invocations threshold, or simply a policy that dev/staging endpoints auto-delete nightly and redeploy on demand.

The reason this ranks first is pure arithmetic. Architecture levers save a percentage of a workload that is actually doing work. Killing an idle GPU endpoint saves 100% of a four-figure monthly line that was doing no work at all. On a typical mid-size ML account with a dozen endpoints, idle cleanup alone routinely removes 20–40% of the inference bill before a single model is re-architected.

the cheapest endpoint is the one that does not exist

A deleted endpoint costs $0. A scaled-to-zero endpoint costs $0 while idle. Before optimizing how an endpoint runs, prove it should be running at all. Audit CloudWatch Invocations across all regions, delete the dead ones, and set a guardrail so the idle fleet does not silently rebuild.

lever #2 — the biggest architectural choice

IIIPick the right endpoint type for the traffic shape

SageMaker offers four inference modes, and choosing the wrong one is the most expensive mistake teams make. The right mode is dictated almost entirely by the shape of your traffic — how often requests arrive, how bursty they are, how latency-sensitive they are, and whether they need an answer now or can be processed offline.

Most teams default to a real-time endpoint because it is the first option in every tutorial. Real-time is the right answer for steady, latency-sensitive, online traffic — and the wrong answer for almost everything else. The other three modes exist specifically to stop you paying for idle instance-hours when your traffic does not justify an always-on fleet.

The decision rule is simple enough to apply in a meeting. Do you need a synchronous answer in milliseconds, on steady traffic? Real-time. Spiky or intermittent and cold-start-tolerant? Serverless. Heavy payloads or long inference that can be queued? Asynchronous (and let it scale to zero). No online answer needed at all? Batch transform. The single most common and most expensive anti-pattern is running a real-time endpoint for work that was actually serverless, async, or batch in disguise.

Real-time endpoints — steady, low-latency, online

Bills: per provisioned instance-hour, 24/7, regardless of traffic.

Use when: you have consistent traffic and need single-digit-to-low-hundreds-of-milliseconds latency on every request — a live recommendation model, a fraud check in the payment path, an interactive feature.

Avoid when: traffic is spiky, intermittent, or offline. If your endpoint spends most of its hours idle, you are paying full price for nothing. This is the mode that creates the idle trap.

Serverless inference — spiky or intermittent, tolerant of cold starts

Bills: per millisecond of compute actually used, plus request count. You configure memory (and, in 2026, optional provisioned concurrency); when no requests arrive, you pay nothing for compute.

Use when: traffic is unpredictable or bursty, there are long quiet periods, and you can tolerate occasional cold-start latency (a few hundred ms to a few seconds while a container spins up). Internal tools, low-volume APIs, and "demo that occasionally gets used" workloads are ideal.

Typical savings: 60–90% versus an always-on real-time endpoint that was mostly idle, because you stop paying for empty hours entirely. The tradeoff is cold starts and (historically) no GPU support — serverless targets CPU-class and smaller models. For large models that need a GPU continuously, serverless is not the lever.

Asynchronous inference — large payloads, long inference, queue-tolerant

Bills: per instance-hour while processing — but async endpoints can autoscale to zero instances when the request queue is empty, so an idle async endpoint costs $0.

Use when: requests are large (big documents, images, audio), each inference takes seconds to minutes, and the caller can accept a queued, near-real-time response rather than a synchronous one. The endpoint queues requests in S3, processes them, and writes results back.

Typical savings: substantial for bursty heavy workloads, because the combination of queueing and scale-to-zero means you only pay while genuinely processing. It is the natural home for "real-time-ish" workloads that are too heavy or too bursty for a synchronous real-time endpoint but too latency-sensitive for an overnight batch job.

Batch transform — offline, scheduled, no endpoint at all

Bills: per instance-hour only for the duration of the job, then the instances are torn down. There is no persistent endpoint and therefore no idle cost — ever.

Use when: you do not need an online response at all. Nightly scoring of a customer table, enriching a data warehouse, generating embeddings for a corpus, periodic re-scoring — anything where the answer is consumed in bulk rather than per-request.

Typical savings: often the cheapest mode by a wide margin for offline work, because it removes the endpoint entirely. A model that was served from an always-on real-time endpoint purely to run a nightly job is the textbook waste case — moving it to batch transform can cut that workload's cost by 80–95%.

lever #3

IVRight-size, autoscale, and scale to zero

Once an endpoint genuinely needs to be real-time, the next question is whether it is the right size and whether it shrinks when traffic drops. Most real-time endpoints are over-provisioned at deploy time and never revisited — the instance type was chosen for a feared peak that rarely arrives, and the count never moves.

Right-sizing starts with measurement, not guesswork. Pull CPU/GPU utilization and ModelLatency from CloudWatch under real production load. If a GPU endpoint sits at 15% GPU utilization, you are paying for a GPU you are barely using — either a smaller/cheaper instance fits, or the model can share a GPU with others (multi-model endpoints, next section). The goal is to run each instance in a healthy band — high enough that you are not paying for idle silicon, with enough headroom that latency stays within target during normal peaks.

Autoscaling then makes the fleet track demand instead of standing at peak size all day. SageMaker supports target-tracking autoscaling on metrics like SageMakerVariantInvocationsPerInstance or a target GPU/CPU utilization. You set a target, a minimum, and a maximum; the endpoint adds instances under load and removes them when traffic falls. For a workload with a clear daily peak and a quiet overnight, scaling from (say) four instances at midday down to one overnight removes the hours you were paying for capacity nobody used. Typical savings from right-sizing plus autoscaling land in the 30–50% range on endpoints that were previously pinned at peak size.

The most impactful recent capability is scale-to-zero for real-time endpoints. SageMaker can now scale a real-time endpoint's instance count down to zero during sustained idle periods and bring it back when a request arrives, paying nothing for the idle window. This collapses much of the idle-trap problem for workloads that have genuinely quiet stretches but still need synchronous serving when active — staging environments, business-hours-only internal tools, regional endpoints that idle overnight. The tradeoff mirrors serverless: the first request after a scale-down pays a cold-start latency penalty while an instance comes back, so it suits workloads that can absorb an occasional slow request at the edge of an idle period.

A practical pattern combines all three: right-size the instance to real utilization, attach target-tracking autoscaling with a sensible minimum for latency-critical paths, and enable scale-to-zero (or a scheduled scale-down) for environments that do not need to serve overnight. Dev and staging endpoints in particular almost never justify 24/7 capacity — scheduling them to zero outside working hours is close to free money.

minimum capacity is a latency decision, not a cost decision

Setting an autoscaling minimum above zero buys you cold-start protection for latency-critical paths — keep it for production user-facing endpoints. For dev, staging, and business-hours-only tools, a minimum of zero (scale-to-zero) or a nightly scheduled scale-down is almost always correct: those workloads do not need a warm instance at 3am.

lever #4

VPack more models onto fewer instances

A surprising amount of SageMaker waste is fragmentation: dozens of models, each on its own under-utilized endpoint, each paying for a whole instance to serve a trickle of traffic. The fix is to consolidate — put many models behind fewer instances using multi-model or multi-container endpoints.

Multi-model endpoints (MMEs) host a large number of models behind a single endpoint and a shared fleet of instances. Models are loaded into memory on demand from S3 and evicted under memory pressure, so you can serve hundreds or thousands of models from a handful of instances instead of standing up an endpoint per model. This is transformative for the classic "model per customer," "model per region," or "model per SKU" pattern, where each individual model sees light, intermittent traffic. Instead of paying for N idle instances, you pay for a small shared fleet sized to aggregate demand.

The savings scale with how fragmented you were. A team serving 200 lightweight models on 200 single-instance endpoints is paying for 200 instances at low utilization; folding them into an MME on a handful of right-sized instances can cut that fleet cost by 50–80%. The tradeoff is a cold-load penalty the first time a given model is invoked after eviction, so MMEs suit catalogs where any individual model's traffic is light and a small first-hit latency is acceptable. Large, latency-critical, constantly-hot models are better on their own dedicated capacity.

Multi-container endpoints (and inference components) address a related case: hosting several different models or framework stacks behind one endpoint, sharing the underlying instances and, increasingly, sharing a GPU. Inference components let you pack multiple models onto the same accelerators and scale each independently, squeezing utilization up on expensive GPU instances rather than dedicating a whole GPU to a model that only needs a slice of it. For GPU fleets specifically, this is one of the strongest levers available: GPUs are the most expensive meter, and most single-model GPU endpoints run them well below capacity.

The unifying idea is utilization. An instance at 70% utilization serving five models is dramatically cheaper per prediction than five instances at 14% each serving one model apiece — same total work, one-fifth the instances. Consolidation does not make any single prediction faster, but it stops you renting five GPUs to do one GPU's worth of work.

lever #5

VIInferentia and Graviton — a cheaper meter

Every lever so far reduces how many instance-hours you pay for. This one reduces the price of each hour by running the same inference on purpose-built AWS silicon — AWS Inferentia for accelerated model inference, and Graviton (Arm) for CPU inference.

AWS Inferentia is Amazon's custom inference accelerator, exposed through the inf2 family (Inferentia2) and served on SageMaker via the Neuron SDK. For supported models — which in 2026 span most mainstream transformer and CV architectures — Inferentia2 delivers materially better price-performance than comparable GPU instances: commonly 40–60% lower cost per token (or per inference) at similar latency, because you are buying silicon designed for inference economics rather than general-purpose GPU compute. For high-volume, steady inference of a supported model, moving the endpoint from a g5/GPU instance to an inf2 instance is one of the largest single cost reductions available, and it stacks with right-sizing, autoscaling, and Savings Plans.

The cost of admission is portability. Models must be compiled for Neuron before they run on Inferentia, and while the toolchain and model coverage have matured substantially, an exotic custom architecture or an unsupported operator can mean compilation work or, occasionally, that a model is not yet a fit. The pragmatic approach is to benchmark your actual model on inf2 before committing: compile it, measure latency and throughput against your current GPU endpoint, and confirm the price-performance win on your traffic. For mainstream models the win is usually clear; the benchmark exists to catch the exceptions.

Graviton is the same idea on the CPU side. AWS's Arm-based Graviton processors power CPU instance families that deliver better price-performance than equivalent x86 instances for many CPU-served models — classical ML, smaller transformers, and embedding models that do not need a GPU. For CPU inference workloads, choosing a Graviton-backed instance type over an x86 one is a low-effort win: often double-digit-percent cost-performance improvement with no architecture change beyond ensuring your container is built for Arm.

The decision order is worth stating plainly. First confirm you even need an accelerator — many models served on GPUs run perfectly well, and far more cheaply, on a right-sized CPU (Graviton) instance, and teams routinely over-reach for GPUs out of habit. If you genuinely need acceleration for throughput or latency, benchmark Inferentia2 before defaulting to a GPU. GPUs remain the right choice for very large models, cutting-edge architectures the Neuron stack does not yet cover, and training — but for steady, supported inference, custom AWS silicon is usually the cheaper meter.

benchmark before you commit

Inferentia2 and Graviton win on price-performance for supported models — but support is the catch. Compile your actual model for Neuron (Inferentia) or rebuild your container for Arm (Graviton), measure latency and throughput on your real traffic, and confirm the win before migrating the endpoint. For mainstream models the 40–60% inference savings are real; the benchmark catches the exceptions.

lever #6

VIICommit the steady-state baseline to a Savings Plan

Every lever above is about running fewer or cheaper instance-hours on demand. The final pricing lever is a commitment discount on the instance-hours you genuinely cannot avoid: the steady-state baseline that runs 24/7 no matter what.

SageMaker Savings Plans give you a discount — up to roughly 64% off on-demand — in exchange for committing to a consistent amount of compute spend (measured in $/hour) for a one- or three-year term. They apply across SageMaker instance usage including real-time inference, and they are the right tool for the portion of your fleet that is provably always on: the production endpoints that must serve traffic around the clock and have a stable floor of demand.

The sequencing matters enormously, and getting it backwards is a classic FinOps error. Optimize first, commit second. A Savings Plan locks in spend; if you commit to a baseline and then discover half of it was idle endpoints, over-provisioned instances, or workloads that should have been serverless, you have purchased a discount on waste — and you are contractually holding that waste for a year or three. The correct order is: kill idle endpoints, move workloads to the right endpoint type, right-size and autoscale, migrate to cheaper silicon where it wins — and only then measure the stable, irreducible baseline that remains and commit that to a Savings Plan.

Sizing the commitment is a confidence decision. Look at the trough of your optimized usage — the floor below which spend essentially never drops — and commit to that floor, not your average and certainly not your peak. Usage above the commitment simply bills at on-demand (and can flex with autoscaling, serverless, and batch). Usage below the commitment is wasted commitment. Committing the conservative floor captures most of the discount while leaving headroom for the variable, already-optimized layer on top. Many teams start with a one-year commitment on a clearly stable baseline and expand it as confidence grows.

the cardinal rule of commitments

Optimize first, commit second. A Savings Plan is a discount on spend you have promised to make — so promise only the steady-state baseline that survives after you have deleted idle endpoints, right-sized, and moved workloads to the cheapest correct mode. Committing before optimizing locks in waste for one to three years.

the master table

VIIIEvery lever, ranked by impact

Here is the whole playbook in one view, ordered the way you should actually apply it. Effort is the work to implement; the savings ranges are typical outcomes on workloads where the lever applies, not guarantees — your numbers depend on your starting point. The cardinal sequencing rule holds: free, high-impact cleanup first; pricing commitments last.

SageMaker inference cost levers · ranked by impact · 2026

#	Lever	Mechanism	Typical savings	Effort	Best for
1	Kill idle endpoints	Delete endpoints serving ~0 traffic; remove the meter entirely	100% of the idle line	Low	Every account — do this first
2	Right endpoint type	Serverless / async / batch stop paying for idle hours	60–90%	Med	Spiky, intermittent, or offline traffic
3	Right-size + autoscale	Match instance + count to real utilization; track demand	30–50%	Med	Over-provisioned always-on real-time endpoints
4	Scale-to-zero	Real-time endpoint drops to 0 instances when idle	up to 100% of idle windows	Low–Med	Dev/staging + business-hours-only endpoints
5	Multi-model / multi-container	Pack many models onto a shared, well-utilized fleet	50–80%	Med	Fragmented fleets — model-per-customer/region/SKU
6	Inferentia2	Run supported model inference on purpose-built silicon	40–60% per token	Med–High	High-volume steady inference of supported models
7	Graviton (Arm)	CPU inference on cheaper Arm instances	10–30%	Low–Med	CPU-served models — classical ML, embeddings
8	SageMaker Savings Plan	Commit the steady baseline for up to ~64% off	up to 64% on baseline	Low	The irreducible always-on floor — commit LAST
9	Move to Bedrock managed	Per-token API removes instance-hours + idle entirely	workload-dependent	High	Low/bursty utilization serving foundation models

Apply top-down. Levers 1–5 reduce paid-for instance-hours and are mostly free or low-cost. Levers 6–7 swap in a cheaper meter. Lever 8 discounts the meters you keep — only after 1–7. Lever 9 is the build-vs-buy escape hatch when self-hosting no longer pencils out (next section).

build vs buy

IXWhen to move to Bedrock managed instead

The most important cost decision is not which SageMaker lever to pull — it is whether you should be paying for instances at all. If you are self-hosting a foundation model on SageMaker purely to serve inference, a managed per-token API like Amazon Bedrock is frequently cheaper and structurally removes the idle problem. This is the honest build-vs-buy line.

The two pricing models are fundamentally different. SageMaker bills per instance-hour: you rent capacity, and your unit economics depend entirely on keeping that capacity busy. Bedrock bills per token (input and output) on its on-demand tier: you pay for work performed, and an idle Bedrock model costs nothing because there is nothing to provision. That difference is decisive at low utilization. A self-hosted model on a GPU endpoint that runs at 15% utilization is paying ~6× the per-prediction cost it would at full load — whereas the same traffic on Bedrock costs the same per token whether it is your only request that hour or your ten-thousandth.

The crossover is utilization. There is a break-even where a fully-utilized reserved GPU (especially Inferentia2 under a Savings Plan) beats per-token pricing — sustained, high-volume, predictable inference is where self-hosting wins, because you amortize the instance across enormous throughput. Below that break-even, where traffic is low, bursty, or unpredictable, Bedrock almost always wins on total cost and simultaneously removes the idle trap, the right-sizing problem, the autoscaling tuning, and the operational burden of running the fleet. Many teams discover that an endpoint they spent weeks optimizing should not have been a self-hosted endpoint at all.

Cost is not the only axis, and the non-cost reasons to stay on SageMaker are real. You keep self-hosting for a custom or proprietary model not available as a managed API; a fine-tune you must fully own and control; strict data-isolation or compliance requirements that demand the model run inside your own VPC on dedicated capacity; or genuinely high, steady utilization where reserved silicon plus a Savings Plan beats per-token economics. Those are legitimate, common, and worth paying for.

The decision framework is therefore two questions. First: is there a managed API (Bedrock, or a Bedrock-hosted version of your model family) that serves your use case? If not, you self-host — optimize with levers 1–8. If yes: is your utilization high and steady enough that a reserved, fully-utilized endpoint beats per-token pricing, or do you have a hard isolation/ownership requirement? If yes, self-host. If no — if you are running low or bursty inference on a foundation model purely because that is how you started — moving to Bedrock managed is usually the single largest cost reduction on the table, because it does not optimize the instance-hours, it deletes them.

the build-vs-buy one-liner

Self-host on SageMaker for custom models, owned fine-tunes, in-VPC isolation, or sustained high utilization — and optimize hard with levers 1–8. Move to Bedrock managed when you are serving a foundation model at low or bursty utilization: per-token pricing deletes the idle problem instead of optimizing around it.

the rollout

XPutting it together — a 30-day sequence

Levers in the wrong order waste money. Here is the sequence a disciplined team runs, front-loading the free, high-impact work and saving the commitments and migrations for after the picture is clean.

Week 1 — audit and delete (free, highest ROI). List every endpoint in every region, tag owner + environment, pull 7–14 days of CloudWatch Invocations and utilization, and delete everything dead. Schedule dev/staging endpoints to scale to zero or auto-delete nightly. This alone typically removes 20–40% of the bill.

Week 2 — re-home workloads to the right type. For each surviving endpoint, ask the traffic-shape question: offline jobs to batch transform, heavy/bursty payloads to async (scale-to-zero), intermittent low-volume APIs to serverless. Leave only genuinely steady, latency-critical traffic on real-time.

Week 3 — right-size, autoscale, consolidate. Tune instance types to real utilization, attach target-tracking autoscaling with sensible minimums, enable scale-to-zero where idle windows exist, and fold fragmented model fleets into multi-model or multi-container endpoints to lift utilization on the expensive GPU instances.

Week 4 — cheaper silicon, then commit. Benchmark steady high-volume models on Inferentia2; move CPU models to Graviton where the container supports Arm. Only now that the fleet is clean, measure the stable baseline and commit it to a Savings Plan — and run the build-vs-buy test on any low-utilization foundation-model endpoints, migrating to Bedrock where it wins.

Running these in order, rather than reaching first for a Savings Plan or a cheaper instance, means you discount and migrate a fleet that has already shed its waste — so every dollar of commitment lands on something real. The biggest wins are almost always in the first two weeks, cost almost nothing, and require no new architecture.

side by side

The four SageMaker endpoint types — when each wins

Endpoint-type choice is the largest architectural cost lever, and it is dictated by traffic shape. This is the decision table: match the row to how your requests actually arrive, not to which option appeared first in the tutorial you started from.

Variable	Real-time	Serverless	Asynchronous	Batch transform
Billing basis	Per instance-hour, 24/7	Per ms of compute + requests	Per instance-hour while processing	Per instance-hour for the job only
Cost when idle	Full price (the trap)	$0	$0 (scales to zero)	$0 (no endpoint exists)
Latency	Lowest, consistent	Low + cold starts	Queued / near-real-time	Offline / bulk
Best traffic shape	Steady, online	Spiky, intermittent	Bursty, heavy payloads	Scheduled, offline
Payload / duration	Small, fast	Small–medium, fast	Large, long-running	Whole datasets
GPU support	Yes	Limited (CPU-class)	Yes	Yes
Typical vs idle real-time	baseline	60–90% cheaper	large for bursty heavy	80–95% cheaper offline

Decision rule: synchronous + steady → real-time; spiky + cold-start-tolerant → serverless; heavy/long + queue-tolerant → async (scale-to-zero); no online answer needed → batch transform. The most expensive mistake is running real-time for work that was actually one of the other three.

staring at a SageMaker bill that keeps climbing?

Get matched with a partner who audits your endpoints and re-architects the waste

Start in 3 minutes →

a recent match

A SageMaker inference-bill teardown — anonymized

inquiry · series-b ML-driven SaaS, remote-EU

Series-B B2B SaaS, ML-heavy product, ~22 engineers, SageMaker inference bill ~$31K/month and climbing

Situation: A document-intelligence product running everything on always-on real-time GPU endpoints: a per-customer model fleet (one endpoint per enterprise customer), a nightly re-scoring job served from a live endpoint, and a self-hosted open-weights model behind a g5.12xlarge at ~18% average utilization. Bill had grown with the customer count and nobody owned it. The ML team wanted to cut spend without a multi-month rewrite, and to fund the rebuild rather than pay for it out of runway.

What CloudRoute did: Routed within 20 hours to an AWS Advanced-tier partner with a SageMaker FinOps + Neuron track record. Week 1: audited all regions, deleted 9 dead demo/staging endpoints (immediate ~$6K/month). Week 2: moved the nightly re-scoring job to batch transform (no endpoint), and folded the 40+ per-customer models into a multi-model endpoint on a small right-sized fleet. Week 3: benchmarked the self-hosted model on inf2, confirmed the price-performance win, and migrated it; right-sized + added scale-to-zero on staging. Week 4: committed the now-stable baseline to a 1-year SageMaker Savings Plan. The partner filed AWS Well-Architected / POC funding to cover the engagement.

Outcome: Monthly inference spend fell from ~$31K to ~$9.8K — a ~68% reduction — within the engagement window, with latency targets held. Idle cleanup and batch/MME consolidation drove most of it; Inferentia2 and the Savings Plan compounded the rest. The remediation work was AWS-funded via the partner; CloudRoute's commission was paid by the partner. Customer paid $0 for the engagement.

engagement window: ~4 weeks · monthly spend: $31K → ~$9.8K (~68%) · idle endpoints removed: 9 · cost to customer: $0

faq

Common questions

What is the single biggest cause of high SageMaker inference bills?

Idle real-time endpoints. A real-time endpoint bills per provisioned instance-hour 24/7 regardless of traffic, so endpoints that serve little or no traffic — abandoned demos, forgotten dev/staging deployments, leftover A/B variants — quietly accumulate four-figure monthly charges while doing no work. Auditing CloudWatch Invocations across every region and deleting dead endpoints is the highest-ROI move in SageMaker FinOps, and it is free.

How do I choose between real-time, serverless, asynchronous, and batch transform?

By traffic shape. Use real-time for steady, latency-sensitive online traffic; serverless for spiky or intermittent traffic that can tolerate cold starts (you pay nothing when idle); asynchronous for large payloads or long inferences that can be queued (it scales to zero); and batch transform for offline, scheduled work that needs no live endpoint at all. The most common expensive mistake is running an always-on real-time endpoint for work that was actually serverless, async, or batch.

Does SageMaker support scale-to-zero for real-time endpoints?

Yes. SageMaker can scale a real-time endpoint down to zero instances during sustained idle periods and bring it back when a request arrives, so you pay nothing during idle windows. The tradeoff is a cold-start latency penalty on the first request after a scale-down, which makes it ideal for dev/staging and business-hours-only endpoints, and usable for production paths that can tolerate an occasional slow request at the edge of an idle period. Async endpoints have long scaled to zero based on queue depth.

How much can Inferentia2 actually save versus GPU instances?

For supported models, Inferentia2 (the inf2 family) typically delivers 40–60% lower cost per token or per inference than comparable GPU instances at similar latency, because it is purpose-built silicon for inference economics. The catch is portability: models must be compiled for the Neuron SDK, and coverage, while broad in 2026, does not include every exotic architecture. Benchmark your specific model on inf2 before migrating to confirm the win on your traffic.

What are multi-model endpoints and when do they save money?

Multi-model endpoints (MMEs) host many models behind a single endpoint and a shared instance fleet, loading models from S3 on demand and evicting under memory pressure. They are ideal for fragmented fleets — model-per-customer, model-per-region, model-per-SKU — where each model sees light, intermittent traffic. Instead of paying for N under-utilized single-model endpoints, you pay for a small shared fleet, often cutting that cost 50–80%. The tradeoff is a cold-load penalty the first time a given model is invoked after eviction, so MMEs suit light-traffic catalogs rather than constantly-hot, latency-critical models.

Should I buy a SageMaker Savings Plan to cut inference costs?

Yes, but last — not first. Savings Plans give up to roughly 64% off in exchange for a one- or three-year compute-spend commitment, and they are right for the steady baseline you genuinely run 24/7. The cardinal rule is optimize first, commit second: if you commit before deleting idle endpoints, right-sizing, and moving workloads to the cheapest correct mode, you lock in a discount on waste for years. Measure the irreducible floor after optimizing, then commit that conservative floor.

When should I move from self-hosting on SageMaker to Bedrock managed?

Move to Bedrock when you are serving a foundation model at low, bursty, or unpredictable utilization. Bedrock bills per token, so an idle model costs nothing — it deletes the instance-hour and idle-endpoint problem rather than optimizing around it. Stay on SageMaker for custom or proprietary models, fine-tunes you must own, strict in-VPC isolation/compliance requirements, or genuinely high, steady utilization where a fully-used reserved endpoint (especially Inferentia2 under a Savings Plan) beats per-token pricing. The crossover is utilization.

Can AWS credits or funding cover a SageMaker cost-optimization project?

Often, yes. AWS funds cost-optimization and re-architecture work through programs like Well-Architected reviews and POC/build funding when an AWS partner files the engagement, which can mean the remediation work itself is AWS-funded and the customer pays $0. CloudRoute routes you to a vetted partner who both does the SageMaker FinOps work and files for the applicable funding — so you cut the recurring bill and the project that cuts it is paid for by AWS, not your runway.

Cut your SageMaker inference bill — and have AWS fund the rebuild

CloudRoute routes you to a vetted AWS partner who audits your endpoints, re-architects the waste, and files for AWS funding so the engagement costs you $0. No procurement. No discovery theater.

Get matched in 24h →→ see the data & AI persona detail

matched within< 24h

typical inference savings40–70%

cost to you$0