SageMaker endpoints · the complete 2026 guide

Amazon SageMaker endpoints — the four ways to deploy a model for inference, explained.

Q: What is a SageMaker endpoint?

A SageMaker endpoint is the managed, hosted service that serves predictions from a deployed model — the inference side of the ML lifecycle. You deploy a trained model artifact (from S3) with an inference container, and SageMaker exposes a live HTTPS service your application invokes to get predictions, handling the container, the compute, and (for the persistent modes) the autoscaling and load balancing. There are four endpoint types — real-time, serverless, asynchronous, and batch transform — and choosing among them is the single biggest cost-and-latency decision in serving.

Q: What are the four types of SageMaker endpoints?

Real-time (a persistent, always-on endpoint with millisecond latency, for steady online traffic; bills 24/7 whether or not requests arrive); serverless (scales to zero and bills per inference, for spiky or intermittent traffic, with occasional cold starts); asynchronous (queues requests and writes results to S3, for large payloads or long-running inferences, can scale to zero); and batch transform (a transient job that scores a whole dataset in S3 with no persistent endpoint, for offline bulk scoring). The right one falls out almost entirely from your traffic shape and latency requirement.

Q: What is the difference between real-time and serverless inference on SageMaker?

Both serve synchronous online requests; the difference is what happens when traffic is low. A real-time endpoint keeps instances always-on, so every request is fast (no cold start) but you pay per instance-hour 24/7 whether or not requests arrive. Serverless inference scales to zero when idle and bills only for the compute consumed during inference — far cheaper for spiky or intermittent traffic — but the first request after a quiet period pays a cold-start latency. Use real-time for steady, latency-sensitive traffic; use serverless for bursty or low-traffic workloads (and as a low-risk default that costs nothing when idle).

Q: What is the cheapest way to host a model on SageMaker?

It depends on the traffic. For offline, whole-dataset scoring, batch transform is usually cheapest — a transient job, nothing bills between runs. For spiky or intermittent online traffic, serverless inference is typically cheapest because it scales to zero. For large payloads or long inferences, asynchronous inference can scale to zero between bursts. Real-time endpoints are the most expensive when idle and only make sense for steady, latency-sensitive traffic. If you run many models each at light traffic, a multi-model endpoint consolidates them onto one shared fleet for a further large saving. Matching the endpoint type to the traffic is the single biggest cost lever in SageMaker serving.

Q: What is a multi-model endpoint on SageMaker?

A multi-model endpoint (MME) is a single endpoint that serves many models of the same framework from a shared fleet of instances. The models live in S3; SageMaker loads a requested model into memory on demand (via a TargetModel header), caches recently-used ones, and evicts cold ones to make room — letting you host thousands of models behind one endpoint. It replaces N idle endpoints with one autoscaled shared fleet, which dramatically cuts cost for the one-model-per-customer pattern. The trade-off is a load latency on the first request for a model not currently in memory. For a small number of different frameworks, or a serial inference pipeline, use a multi-container endpoint instead.

Q: How do I deploy a model to a SageMaker endpoint?

The path is: (1) have a trained model artifact in S3 plus an inference container; (2) create a SageMaker model object pointing at the artifact and container with an IAM role; (3) choose the endpoint type and create an endpoint configuration (instance type/count and autoscaling for real-time/async, memory and concurrency for serverless); (4) create the endpoint (or launch a batch transform job); (5) invoke it with InvokeEndpoint or InvokeEndpointAsync; (6) attach autoscaling and Model Monitor; (7) roll out updates behind production variants and delete endpoints you no longer use. In the SageMaker Python SDK most of this collapses into a model.deploy(...) call — just remember that deploying with an instance type creates an always-on real-time endpoint that bills until you delete it.

Q: Why is my SageMaker endpoint bill so high?

The overwhelmingly common cause is an always-on real-time endpoint left running — those bill per instance-hour 24/7 whether or not they serve traffic, so a forgotten test endpoint on an entry GPU is roughly $1,000/month of pure waste. Other frequent causes: defaulting to real-time when serverless or batch transform would fit the traffic, hosting on a larger instance than the model needs, and running one endpoint per model when a multi-model endpoint would consolidate them. Audit for idle endpoints first, then match the endpoint type to the traffic shape — together those two fixes usually account for the largest share of a surprise serving bill.

Q: Should I use a SageMaker endpoint or Amazon Bedrock for inference?

If you want to call an existing foundation model (Claude, Llama, Nova, Mistral, and others) — a chat assistant, summarizer, RAG system, or coding helper — Amazon Bedrock managed inference is usually the better path: you pay per token, there is no endpoint or instance to manage, and there is zero idle cost. A SageMaker endpoint is the right tool when you need to serve your own model, run classical/tabular ML that is not a foundation model, fine-tune weights deeply, or control the serving environment (specific instances, custom containers, AWS silicon). Some teams do both — serve their highest-volume custom model on a right-sized SageMaker endpoint (often on Inferentia) and everything else through Bedrock.

Q: Can AWS credits cover SageMaker endpoint hosting?

Yes. AWS credits apply to SageMaker inference compute (endpoints of every type), storage, and features just like any other AWS service, auto-applying to your monthly bill until exhausted. Eligible programs include Activate Portfolio (up to ~$100K), Bedrock/GenAI PoC funding ($10K–$50K), and the Generative AI Accelerator (up to $1M). Credits also stack on top of Savings Plans and AWS-silicon savings, so disciplined cost management makes them last longer. CloudRoute routes you to a vetted AWS partner who both deploys the model and files the credit application; the customer pays $0 because AWS funds the pool and the partner pays CloudRoute a routing commission.

Once a model is trained, you have to serve it — and on SageMaker that means choosing among four endpoint types: real-time, serverless, asynchronous, and batch transform. Each has a different cost shape, latency profile, cold-start behaviour, and autoscaling model, and picking the wrong one is the most common way to overspend. This guide covers all four in depth, plus multi-model and multi-container endpoints, the step-by-step deploy flow, how SageMaker hosting compares to Bedrock managed inference, and how to cut the bill — including taking it to $0 with AWS credits.

Get matched in 24h →→ compare the four types

endpoint types

billing granularity

per second

biggest cost lever

endpoint mode

credits to fund it

up to $1M

TL;DR

A SageMaker endpoint is the hosted service that serves predictions from a deployed model. There are four types: real-time (persistent, always-on, millisecond latency), serverless (scales to zero, pay per inference, occasional cold starts), asynchronous (queued, for large payloads or long inferences), and batch transform (a transient job that scores a whole dataset with no persistent endpoint). Choosing among them is the single biggest cost-and-latency decision in serving.
The decision turns on two questions: how predictable is your traffic, and how fast must each prediction return? Steady online traffic → real-time; spiky online → serverless; large or slow inferences → asynchronous; whole-dataset offline scoring → batch transform. The same model can differ by 10–20× in monthly cost purely on which mode you pick — real-time endpoints bill 24/7 whether or not requests arrive; serverless, async, and batch scale to zero.
For consolidation, multi-model and multi-container endpoints let many models share one endpoint and its instances. And AWS credits (Activate up to $100K, Bedrock/GenAI PoC $10K–$50K, GenAI Accelerator up to $1M) cover SageMaker hosting compute, storage, and features — CloudRoute routes you to the partner who both deploys the model and files the credit application; you pay $0, AWS funds it.

definition

IWhat a SageMaker endpoint actually is

A SageMaker endpoint is the managed, hosted service that takes a trained model and serves predictions from it — the inference side of the ML lifecycle, where a model artifact becomes a thing your application can actually call.

The cleanest one-line definition: a SageMaker endpoint is a managed inference service that loads your trained model, runs it on AWS-managed compute, and returns predictions in response to requests — with SageMaker handling the container, the scaling, and (for the persistent modes) the load balancing. Training produces a model artifact in S3; an endpoint is how that artifact gets served. The two are separate, separately-billed steps: a training job spikes and disappears, while an endpoint persists for as long as you keep it up.

Mechanically, deploying to an endpoint involves three SageMaker objects. A model ties together the artifact in S3 and the inference container image that knows how to load and run it. An endpoint configuration specifies how to host that model — the instance type and count, the endpoint type, autoscaling settings, and (for advanced setups) the traffic split across production variants. An endpoint is the live, named HTTPS service that applications invoke; you call it and SageMaker routes the request to the model behind it. Understanding this model → config → endpoint chain is what makes the rest of deployment make sense.

The single most consequential choice in all of this is the endpoint type, set in the endpoint configuration. SageMaker offers four, and they are genuinely different services under one name — different billing bases, different latency characteristics, different scaling behaviour. The bulk of this guide (sections II–IV) walks each one, because choosing the right type for your traffic is the highest-leverage cost and performance decision you make on the serving side.

A note on what an endpoint is not: it is not a training resource (that is a training job), and it is not where you call Amazon's foundation models (that is Bedrock — covered in section VIII). A SageMaker endpoint serves your model — one you trained, fine-tuned, or pulled from JumpStart — on infrastructure you choose. If you only want to call an existing foundation model through a managed API, you do not need a SageMaker endpoint at all.

model → endpoint config → endpoint

Three objects, in order: a model (artifact in S3 + inference container) → an endpoint configuration (instance type/count, endpoint type, autoscaling, variants) → a live endpoint (the HTTPS service your app invokes). The same model can be hosted by many different endpoint configurations; the config is where the cost-and-latency decisions live.

the four options

IIThe four endpoint types at a glance

SageMaker serves predictions four ways. Before the deep dive on each, here is the whole landscape on one screen — because the right choice falls out almost entirely from your traffic shape and latency requirement.

Two questions decide it. How predictable is your traffic? — steady, spiky, or none (offline). And how fast must each prediction return? — milliseconds, or is seconds-to-minutes fine? Map your answers onto the four types below and the choice is usually obvious. The cost consequence of getting it wrong is large: the persistent modes bill for uptime, the elastic modes bill for usage.

Real-time inference — A persistent endpoint on always-on instances behind an auto-scaling group and load balancer, returning predictions in milliseconds. For steady, latency-sensitive online traffic. Bills per instance-hour 24/7 whether or not requests arrive — the most expensive idle mode.
Serverless inference — SageMaker provisions and scales compute automatically per request and scales to zero when idle; you pay only for the compute consumed during inference plus request count. For spiky or intermittent online traffic. Trade-off: occasional cold-start latency on the first request after a quiet period.
Asynchronous inference — Requests are queued and processed in the background; results are written to S3 and you are notified when ready. For large payloads (big images/documents) or long-running inferences where a synchronous response is not required. Can scale to zero between bursts.
Batch transform — No persistent endpoint at all — a transient job points at a dataset in S3, spins up compute, scores every record, writes results back to S3, and tears the compute down. For offline, scheduled scoring of whole datasets. Usually the cheapest path for bulk scoring; nothing bills between runs.

SageMaker endpoint types · choosing the right one (2026)

Type	Traffic shape	Latency	Scales to zero?	Billing basis	Cold starts?	Typical use
Real-time	Steady, online	Milliseconds	No (always-on)	Per instance-hour, 24/7	No	Live API, fraud check
Serverless	Spiky / intermittent	Ms (cold-start risk)	Yes	Per inference compute + requests	Yes (occasional)	Bursty internal apps
Asynchronous	Large payloads, bursts	Seconds–minutes	Yes	Per instance-time while busy	Minimal	Big docs/images, long inferences
Batch transform	Offline, scheduled	N/A (not online)	N/A (transient job)	Per job instance-time	N/A	Nightly bulk scoring

Rule of thumb: steady online → real-time; spiky online → serverless; big or slow inferences → asynchronous; whole-dataset offline scoring → batch transform. The next two sections work through each in depth, with the cost mechanics and the gotchas.

the online modes

IIIReal-time and serverless — the two online modes, in depth

Real-time and serverless both serve synchronous online requests; the difference is what happens when traffic is low. Real-time keeps the lights on (and bills for it); serverless turns them off (and pays a cold-start tax to turn them back on).

These are the two modes you reach for when an application needs an answer back in the same request — a recommendation API, a fraud check at checkout, a classification in a user-facing flow. The choice between them is almost entirely about traffic predictability and your tolerance for occasional latency spikes.

Real-time inference

What it is: a persistent HTTPS endpoint backed by one or more always-on instances behind a load balancer, returning predictions in single- or double-digit milliseconds. The model is loaded into memory once and stays resident, so every request is fast — there is no per-request warm-up.

Cost shape: you pay the per-hour instance rate × instance count for as long as the endpoint exists, 24/7, whether it serves a million requests or zero. This is the defining property: real-time endpoints bill for uptime, not usage. An entry-GPU real-time endpoint left running continuously is roughly $1,000/month (representative 2026) regardless of traffic — which is why a forgotten test endpoint is the classic SageMaker cost mistake.

Autoscaling: attach a scaling policy (target-tracking on invocations-per-instance, or step/scheduled scaling) so SageMaker adds instances under load and removes them when traffic falls — down to a configured minimum instance count. Note the floor: standard real-time autoscaling does not scale to zero, so even the minimum (typically one instance) keeps billing. Scheduled scaling helps for predictable daily patterns (scale up for business hours, down overnight).

When to use it: steady, latency-sensitive online traffic where requests arrive consistently enough that the instances are not sitting idle, and where millisecond latency on every request matters. If traffic is steady and high, real-time is also the most cost-efficient online mode per request — the idle problem only bites when an always-on endpoint serves sporadic traffic.

Serverless inference

What it is: a synchronous endpoint with no instances to manage — you configure a memory size and a max concurrency, and SageMaker provisions compute per request and scales it to zero when idle. You never pick an instance type; AWS handles the capacity.

Cost shape: you pay for the compute consumed during inference (memory-size × duration) plus the number of requests, and nothing while idle because it scales to zero. For an endpoint that is busy only a few hours a day, this is a fraction of what the equivalent always-on real-time endpoint would cost — the bill tracks usage, not uptime.

Cold starts: the trade-off. After a period of no traffic, the first request has to spin compute back up and load the model — adding latency (typically sub-second to a few seconds depending on model size) to that first call before steady-state latency resumes. Provisioned Concurrency mitigates this by keeping a configured number of instances warm and ready (at a cost), giving you serverless economics for the spiky bulk of traffic while removing cold-start latency for the baseline.

When to use it: spiky, intermittent, or unpredictable online traffic — internal tools, low-traffic features, dev/test endpoints, new products without established load — where an always-on endpoint would mostly sit idle and the occasional cold start is acceptable. It is also the lowest-risk default for a first deployment, precisely because a forgotten serverless endpoint costs nothing when no one is calling it.

the idle test

Ask: will this endpoint be busy most of the time, or sit idle between bursts? Busy and steady → real-time is efficient and avoids cold starts. Idle much of the time → serverless, because real-time would bill 24/7 for capacity you are not using. When in doubt for a new or low-traffic workload, start serverless: a forgotten serverless endpoint costs nothing; a forgotten real-time endpoint is ~$1,000/month.

the offline / heavy modes

IVAsynchronous and batch transform — for large, slow, or offline work

Not every prediction needs to come back in the same request. When payloads are large, inferences are slow, or you are scoring a whole dataset on a schedule, asynchronous inference and batch transform are both cheaper and better-fit than forcing the work through an online endpoint.

The shared idea: decouple the request from the response. Asynchronous inference keeps an endpoint but processes a queue in the background; batch transform drops the persistent endpoint entirely and runs a transient scoring job. Both can sit at $0 when there is no work to do.

Asynchronous inference

What it is: you invoke the endpoint with a pointer to an input in S3; SageMaker queues the request, processes it in the background, writes the result to S3, and notifies you (via SNS) when it is done. The caller does not block waiting for the answer.

Why it exists: two reasons. First, large payloads — async supports much larger request sizes than real-time (think large images, audio, or documents) because the data goes via S3 rather than in the request body. Second, long-running inferences — when a single prediction takes seconds or minutes (large generative models, heavy CV pipelines), a synchronous endpoint would time out or tie up a connection; async absorbs it gracefully.

Cost shape and scaling: billed per instance-time while the endpoint is processing the queue. Crucially, async endpoints can scale to zero when the queue is empty and scale back up when requests arrive — so a bursty workload (idle for hours, then a flood of large documents) bills only for the busy time, not the idle time. That makes it both cheaper and more appropriate than a real-time endpoint for spiky heavy work.

When to use it: large payloads, long inference times, or bursty heavy workloads where a synchronous response is not required and near-real-time (seconds to minutes) is acceptable — document processing pipelines, batch image analysis triggered by uploads, long generative jobs.

Batch transform

What it is: not an endpoint at all in the persistent sense. You point a batch transform job at a dataset in S3; SageMaker spins up compute, runs the model over every record, writes the predictions back to S3, and tears the compute down when finished. There is no standing service and nothing to invoke between runs.

Cost shape: billed per instance-second of the transient job — you pay only for the minutes the job runs, and nothing between runs. This makes it the cheapest way to score a large dataset offline by a wide margin: a one-hour nightly scoring job might be ~$40–$50/month (representative 2026) versus ~$1,000/month for an always-on real-time endpoint doing the same scoring continuously.

Scaling and throughput: you can parallelize across multiple instances and tune the records-per-mini-batch and concurrent-transforms settings to push throughput, which is how a very large dataset gets scored in a bounded window.

When to use it: offline, scheduled, whole-dataset scoring where there is no online-latency requirement — nightly churn or propensity scores for the entire user base, periodic re-scoring after a model update, one-off bulk inference over a backlog. If you do not need an answer right now, batch transform is almost always the cheapest correct choice.

async vs batch in one line

Asynchronous keeps an endpoint and processes a queue as requests trickle in (large/slow online-ish work, results in seconds-to-minutes). Batch transform has no endpoint and scores a whole dataset on demand or on a schedule (offline bulk work). Both bill only while working; both are far cheaper than forcing heavy or offline work through an always-on real-time endpoint.

consolidation

VMulti-model and multi-container endpoints — hosting many models efficiently

If you have dozens or hundreds of models — one per customer, per region, per segment — giving each its own real-time endpoint is ruinously expensive and operationally heavy. SageMaker offers two patterns to pack many models behind a single endpoint and its instances.

The motivating problem is the idle-instance tax multiplied. A hundred models on a hundred real-time endpoints means a hundred sets of always-on instances, most of them mostly idle. Consolidation lets many models share the same compute, so you pay for one (autoscaled) fleet instead of a hundred near-idle ones. Two distinct mechanisms address it.

Multi-model endpoints (MME)

What it is: a single endpoint that can serve many models of the same framework from a shared fleet of instances. The models live in S3; SageMaker loads a given model into memory on demand when a request specifies it (via a TargetModel header), keeps recently-used models cached in memory, and evicts cold ones to make room. You can host thousands of models behind one endpoint this way.

Why it saves money: instead of N endpoints each with their own idle instances, you run one autoscaled fleet shared across all N models. For large numbers of models that are each invoked only occasionally — the classic "one model per customer" SaaS pattern — this collapses the cost dramatically.

The trade-off: a request for a model not currently in memory incurs a load latency (fetch from S3 + load) on that first call, similar in spirit to a cold start. Frequently-used models stay warm; rarely-used ones pay the load cost when they are called. MME fits best when many models share a framework and per-model traffic is light or bursty.

Multi-container endpoints (MCE)

What it is: a single endpoint hosting up to fifteen distinct containers (different frameworks or different models) on shared infrastructure. You can invoke a specific container directly, or chain them in a serial inference pipeline where the output of one container feeds the next (e.g., a preprocessing container → a model container → a post-processing container).

When to use which: reach for multi-model endpoints when you have many models of the same framework and want to pack them efficiently behind one fleet. Reach for multi-container endpoints when you have a small number of different containers/frameworks to co-host, or when you need a serial inference pipeline. And for splitting traffic across versions of one model (canary/A-B rollout), use production variants on a standard endpoint, which let you weight traffic between two model variants on the same endpoint.

when consolidation pays off

The win scales with the number of models and how idle each one is. One model, steady traffic → a plain single-model endpoint. Many same-framework models, light/bursty each → multi-model endpoint (share one fleet, accept occasional load latency). A few different frameworks, or a serial pipeline → multi-container endpoint. Two versions of one model for a safe rollout → production variants with weighted traffic.

how to deploy

VIDeploying a model to an endpoint, step by step

The path from a trained artifact to a live endpoint is short and well-trodden. Here is the realistic sequence, framework-agnostic, with the decision points called out.

Whether you do this from the SageMaker Python SDK, the AWS SDK (boto3), the console, or a Pipeline deploy step, the same logical steps happen under the hood. Knowing them makes the SDK one-liners legible rather than magic.

1 · Have a model artifact and an inference container — You need the trained model packaged in S3 (typically a model.tar.gz) and a container image that knows how to load and serve it — a SageMaker-provided framework container (PyTorch, TensorFlow, XGBoost, Hugging Face) or your own custom container implementing the inference handler. JumpStart models come with this packaging done for you.
2 · Create the SageMaker model object — Register a model in SageMaker that points at the artifact in S3 and the container image, with the IAM execution role that can read the artifact. This is the reusable definition that endpoint configurations reference.
3 · Choose the endpoint type and create the endpoint configuration — The pivotal decision (sections II–IV): real-time, serverless, asynchronous, or batch transform. For real-time/async you also pick the instance type and count and the autoscaling policy; for serverless you set memory size and max concurrency; for batch transform you configure the job rather than a standing config. This is where almost all the cost-and-latency trade-offs are made.
4 · Create the endpoint (or launch the batch job) — For the three persistent/queued modes, create the endpoint from the configuration — SageMaker provisions the compute, pulls the container, loads the model, and exposes a live HTTPS endpoint (a few minutes for deployment). For batch transform, you instead launch a transform job that runs and exits.
5 · Invoke it — Call the endpoint with InvokeEndpoint (real-time/serverless) or InvokeEndpointAsync (asynchronous), sending your input payload and receiving predictions (or, for async, a pointer to the S3 result). Your application now has a model behind an API.
6 · Attach autoscaling and monitoring — For real-time/async, register a scaling policy so the fleet tracks load. Turn on Model Monitor to watch for data and quality drift, and CloudWatch metrics/alarms for latency, invocation count, and errors. This is what turns a deployed model into an operated one.
7 · Update safely, and clean up — Roll out new model versions behind production variants (canary/blue-green) so you can shift traffic gradually and roll back if metrics regress. And — the cost-critical step — delete endpoints you are no longer using, because the persistent modes keep billing until you do.

the SDK shortcut

In the SageMaker Python SDK most of this collapses into a model.deploy(...) call, where the arguments you pass (instance type and count, or a serverless/async config) are the endpoint-type decision from step 3. The brevity is convenient — just remember that deploy() with an instance type creates an always-on real-time endpoint that bills until you call delete_endpoint().

optimization

VIICost optimization for endpoints, ranked

Serving is where most production inference cost lives, and there is a fairly stable hierarchy of what actually moves the bill. Work down this list in order — the top items dwarf the bottom ones.

In rough order of impact for a typical team running models in production:

1 · Kill idle real-time endpoints — The highest-leverage and most-ignored lever. An always-on endpoint left up after an experiment bills 24/7 — roughly $1,000/month for one entry-GPU instance doing nothing. Delete test endpoints the moment you are done and audit monthly for "zombie" endpoints. This single fix usually accounts for the largest share of a wasteful serving bill.
2 · Match the endpoint type to the traffic — Serverless for spiky online, batch transform for offline scoring, async for large/slow inferences, real-time only for steady latency-sensitive traffic. The same model can differ by 10–20× in monthly cost purely on this choice — it is the biggest structural lever in serving.
3 · Consolidate many models with multi-model endpoints — If you run dozens or hundreds of models each at light traffic, packing them behind a multi-model (or multi-container) endpoint replaces N idle fleets with one shared autoscaled fleet — a large saving for the one-model-per-customer pattern.
4 · Right-size the instance — Do not host on a GPU what runs fine on CPU; do not provision a large multi-GPU box for a model that fits on a small one. Profile the workload and pick the smallest instance that meets the latency and throughput target. SageMaker Inference Recommender can suggest instance types and report the latency/cost trade-off.
5 · Use autoscaling instead of provisioning for peak — Where you must run real-time, configure target-tracking or scheduled autoscaling so you run the minimum instances off-peak and add capacity only under load — rather than paying for peak capacity 24/7.
6 · Consider AWS Inferentia for high-volume inference — For steady high-volume serving, AWS Inferentia (the inf2 instances, via the Neuron SDK) is positioned as cheaper per inference than equivalent NVIDIA GPU instances. The migration effort is real, but at high volume the per-inference savings compound.
7 · Buy a Savings Plan for the steady baseline — Once your always-on inference is predictable, commit that reliable baseline to a 1- or 3-year SageMaker Savings Plan for a discounted rate (it covers real-time inference usage). Size to baseline, keep variable load on serverless/on-demand.
8 · Fund it with AWS credits — The lever that takes the serving bill to $0 for eligible teams. Credits apply to all of the above usage and stack on top of Savings Plans and Inferentia savings, so disciplined optimization makes them last far longer. Covered in the sample and CTA below.

the two-line summary

Hosting cost is driven by which endpoint type far more than which model, and by how much idle capacity you are paying for. Match the type to the traffic, kill idle endpoints, consolidate where you run many models — do those three and you have optimized most of the serving bill. Credits then cover what remains.

the alternative

VIIISageMaker endpoints vs Bedrock managed inference

A fair question before you stand up an endpoint at all: should you be hosting a model yourself, or calling one through Amazon Bedrock? The two solve different problems, and for a large share of generative-AI workloads Bedrock is the shorter path.

Bedrock managed inference is "AI as an API call." You call an existing foundation model — Anthropic's Claude, Meta's Llama, Amazon's Nova and Titan, Mistral, Cohere, and others — through one consistent API, paying per token, with no endpoint to create, no instance to choose, and nothing to scale or keep warm. There is no idle cost because there is no standing infrastructure: you pay for the tokens you send and receive, full stop. Bedrock handles capacity, scaling, and availability invisibly.

A SageMaker endpoint is "host the model yourself." You deploy a specific model artifact onto instances you choose, you pick the endpoint type and scaling, and you pay for compute (per instance-hour for real-time, per inference for serverless, and so on). That gives you full control — any model including your own from-scratch or deeply fine-tuned ones, any framework, specific instance types, custom containers, AWS silicon — at the cost of owning the operational and idle-capacity decisions Bedrock hides.

The deciding question is the same one that decides SageMaker-vs-Bedrock generally: does a foundation model that already does what you need exist on Bedrock? If yes — a chat assistant, a summarizer, a RAG system, a coding helper over an existing model — Bedrock is usually cheaper to start, cheaper at moderate volume, and has zero idle cost, so there is little reason to host it yourself. If no — you have a proprietary model, a classical-ML workload (tabular fraud, churn, forecasting, recommendation) that is not a foundation model at all, a need to fine-tune weights deeply, or strict control requirements over the serving environment — a SageMaker endpoint is the right tool.

There is also a genuine middle ground. You can deploy open foundation models from SageMaker JumpStart to your own endpoint when you want full control over an open-weights model rather than calling it through Bedrock — for example to pin a specific version, run on specific hardware, or keep inference inside a tightly-controlled boundary. And at very high, steady inference volume, a right-sized SageMaker endpoint (especially on Inferentia) can be cheaper per inference than per-token pricing — so some teams serve their highest-volume model on SageMaker and everything else through Bedrock. The dedicated Bedrock vs SageMaker page works the full comparison; the table below summarizes the inference-specific decision.

the inference decision in one line

Calling an existing foundation model → Bedrock managed inference (per-token, zero idle cost, no infrastructure). Serving your own / classical / deeply-fine-tuned model, or needing control over the serving environment → a SageMaker endpoint (per-compute, full control). Very-high-volume custom serving can favour a right-sized SageMaker endpoint on AWS silicon.

the decision that matters most

The four endpoint types — same model, four ways to serve it

The endpoint type is the single biggest cost lever in SageMaker serving. Here is how the same model lands on the bill and the latency profile under each type — illustrative monthly figures using a representative 2026 entry-GPU rate. Verify live rates on the AWS pricing page.

Endpoint type	Billing basis	Bills when idle?	Cold starts?	Latency	Illustrative monthly cost*	Best when
Real-time	Per instance-hour, 24/7	Yes — always-on	No	Milliseconds	~$1,000 (1× entry GPU, continuous)	Steady, latency-sensitive online traffic
Serverless	Per inference compute + requests	No — scales to zero	Yes (occasional)	Ms (cold-start risk)	A fraction of real-time if busy part of the day	Spiky / intermittent online traffic
Asynchronous	Per instance-time while busy	No — scales to zero	Minimal	Seconds–minutes	Proportional to busy time + queue	Large payloads, long-running inferences
Batch transform	Per job instance-time	No — transient job	N/A	N/A (offline)	~$40–$50 (1 hr/night offline scoring)	Offline, scheduled, whole-dataset scoring

*Illustrative, representative 2026 figures for a single entry-GPU-class workload — for relative reasoning only, not budgeting. The same model can differ by 10–20× in monthly cost purely on endpoint type. Confirm current rates on the AWS SageMaker pricing page.

endpoints bill 24/7 — credits can cover them

Fund your SageMaker endpoints with AWS credits — pay $0

Get matched in 24h →

a recent match

A serving bill re-architected and taken to $0 — anonymized

inquiry · Series-A personalization SaaS, UK

Series-A personalization SaaS, 20 people, serving a recommendation model per customer on SageMaker

Situation: They had stood up one always-on real-time endpoint per enterprise customer — dozens of them, most serving light, bursty traffic — and the SageMaker hosting line had climbed past ~$12K/month, almost all of it idle-instance cost. They also ran a nightly full-catalogue re-scoring job on yet another real-time endpoint. The serving bill was the fastest-growing AWS line and the runway math did not support it to the next milestone.

What CloudRoute did: Routed within 22 hours to a UK partner with an ML / SageMaker cost-optimization track record. The partner consolidated the per-customer models behind a multi-model endpoint on one autoscaled fleet, moved the nightly catalogue scoring off its real-time endpoint onto batch transform, and shifted the lightest-traffic customers to serverless inference — then, in parallel, filed an Activate Portfolio credit application plus a GenAI PoC application for the inference workload.

Outcome: Consolidation plus the endpoint-type changes cut the serving run-rate from ~$12K to ~$4K/month (multi-model endpoint + batch transform for offline scoring + serverless for the long tail). Credits approved within 16 days then covered the remaining bill — taking effective serving cost to ~$0 through the credit window. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.

serving bill cut: ~65% before credits · then to ~$0 on credits · matched in: < 24h · cost to customer: $0

faq

Common questions

What is a SageMaker endpoint?

A SageMaker endpoint is the managed, hosted service that serves predictions from a deployed model — the inference side of the ML lifecycle. You deploy a trained model artifact (from S3) with an inference container, and SageMaker exposes a live HTTPS service your application invokes to get predictions, handling the container, the compute, and (for the persistent modes) the autoscaling and load balancing. There are four endpoint types — real-time, serverless, asynchronous, and batch transform — and choosing among them is the single biggest cost-and-latency decision in serving.

What are the four types of SageMaker endpoints?

Real-time (a persistent, always-on endpoint with millisecond latency, for steady online traffic; bills 24/7 whether or not requests arrive); serverless (scales to zero and bills per inference, for spiky or intermittent traffic, with occasional cold starts); asynchronous (queues requests and writes results to S3, for large payloads or long-running inferences, can scale to zero); and batch transform (a transient job that scores a whole dataset in S3 with no persistent endpoint, for offline bulk scoring). The right one falls out almost entirely from your traffic shape and latency requirement.

What is the difference between real-time and serverless inference on SageMaker?

Both serve synchronous online requests; the difference is what happens when traffic is low. A real-time endpoint keeps instances always-on, so every request is fast (no cold start) but you pay per instance-hour 24/7 whether or not requests arrive. Serverless inference scales to zero when idle and bills only for the compute consumed during inference — far cheaper for spiky or intermittent traffic — but the first request after a quiet period pays a cold-start latency. Use real-time for steady, latency-sensitive traffic; use serverless for bursty or low-traffic workloads (and as a low-risk default that costs nothing when idle).

What is the cheapest way to host a model on SageMaker?

It depends on the traffic. For offline, whole-dataset scoring, batch transform is usually cheapest — a transient job, nothing bills between runs. For spiky or intermittent online traffic, serverless inference is typically cheapest because it scales to zero. For large payloads or long inferences, asynchronous inference can scale to zero between bursts. Real-time endpoints are the most expensive when idle and only make sense for steady, latency-sensitive traffic. If you run many models each at light traffic, a multi-model endpoint consolidates them onto one shared fleet for a further large saving. Matching the endpoint type to the traffic is the single biggest cost lever in SageMaker serving.

What is a multi-model endpoint on SageMaker?

A multi-model endpoint (MME) is a single endpoint that serves many models of the same framework from a shared fleet of instances. The models live in S3; SageMaker loads a requested model into memory on demand (via a TargetModel header), caches recently-used ones, and evicts cold ones to make room — letting you host thousands of models behind one endpoint. It replaces N idle endpoints with one autoscaled shared fleet, which dramatically cuts cost for the one-model-per-customer pattern. The trade-off is a load latency on the first request for a model not currently in memory. For a small number of different frameworks, or a serial inference pipeline, use a multi-container endpoint instead.

How do I deploy a model to a SageMaker endpoint?

The path is: (1) have a trained model artifact in S3 plus an inference container; (2) create a SageMaker model object pointing at the artifact and container with an IAM role; (3) choose the endpoint type and create an endpoint configuration (instance type/count and autoscaling for real-time/async, memory and concurrency for serverless); (4) create the endpoint (or launch a batch transform job); (5) invoke it with InvokeEndpoint or InvokeEndpointAsync; (6) attach autoscaling and Model Monitor; (7) roll out updates behind production variants and delete endpoints you no longer use. In the SageMaker Python SDK most of this collapses into a model.deploy(...) call — just remember that deploying with an instance type creates an always-on real-time endpoint that bills until you delete it.

Why is my SageMaker endpoint bill so high?

The overwhelmingly common cause is an always-on real-time endpoint left running — those bill per instance-hour 24/7 whether or not they serve traffic, so a forgotten test endpoint on an entry GPU is roughly $1,000/month of pure waste. Other frequent causes: defaulting to real-time when serverless or batch transform would fit the traffic, hosting on a larger instance than the model needs, and running one endpoint per model when a multi-model endpoint would consolidate them. Audit for idle endpoints first, then match the endpoint type to the traffic shape — together those two fixes usually account for the largest share of a surprise serving bill.

Should I use a SageMaker endpoint or Amazon Bedrock for inference?

If you want to call an existing foundation model (Claude, Llama, Nova, Mistral, and others) — a chat assistant, summarizer, RAG system, or coding helper — Amazon Bedrock managed inference is usually the better path: you pay per token, there is no endpoint or instance to manage, and there is zero idle cost. A SageMaker endpoint is the right tool when you need to serve your own model, run classical/tabular ML that is not a foundation model, fine-tune weights deeply, or control the serving environment (specific instances, custom containers, AWS silicon). Some teams do both — serve their highest-volume custom model on a right-sized SageMaker endpoint (often on Inferentia) and everything else through Bedrock.

Can AWS credits cover SageMaker endpoint hosting?

Yes. AWS credits apply to SageMaker inference compute (endpoints of every type), storage, and features just like any other AWS service, auto-applying to your monthly bill until exhausted. Eligible programs include Activate Portfolio (up to ~$100K), Bedrock/GenAI PoC funding ($10K–$50K), and the Generative AI Accelerator (up to $1M). Credits also stack on top of Savings Plans and AWS-silicon savings, so disciplined cost management makes them last longer. CloudRoute routes you to a vetted AWS partner who both deploys the model and files the credit application; the customer pays $0 because AWS funds the pool and the partner pays CloudRoute a routing commission.

Deploy it on SageMaker — funded by AWS credits

CloudRoute connects ML and data-science teams with vetted AWS partners who deploy and optimize SageMaker endpoints and file the credit applications that fund hosting. Customer pays $0 — AWS funds it.

Get matched in 24h →→ see the data-AI persona detail

matched within< 24h

credit ceilingup to $1M

cost to you$0