Once a model is trained, you have to serve it — and on SageMaker that means choosing among four endpoint types: real-time, serverless, asynchronous, and batch transform. Each has a different cost shape, latency profile, cold-start behaviour, and autoscaling model, and picking the wrong one is the most common way to overspend. This guide covers all four in depth, plus multi-model and multi-container endpoints, the step-by-step deploy flow, how SageMaker hosting compares to Bedrock managed inference, and how to cut the bill — including taking it to $0 with AWS credits.
A SageMaker endpoint is the managed, hosted service that takes a trained model and serves predictions from it — the inference side of the ML lifecycle, where a model artifact becomes a thing your application can actually call.
The cleanest one-line definition: a SageMaker endpoint is a managed inference service that loads your trained model, runs it on AWS-managed compute, and returns predictions in response to requests — with SageMaker handling the container, the scaling, and (for the persistent modes) the load balancing. Training produces a model artifact in S3; an endpoint is how that artifact gets served. The two are separate, separately-billed steps: a training job spikes and disappears, while an endpoint persists for as long as you keep it up.
Mechanically, deploying to an endpoint involves three SageMaker objects. A model ties together the artifact in S3 and the inference container image that knows how to load and run it. An endpoint configuration specifies how to host that model — the instance type and count, the endpoint type, autoscaling settings, and (for advanced setups) the traffic split across production variants. An endpoint is the live, named HTTPS service that applications invoke; you call it and SageMaker routes the request to the model behind it. Understanding this model → config → endpoint chain is what makes the rest of deployment make sense.
The single most consequential choice in all of this is the endpoint type, set in the endpoint configuration. SageMaker offers four, and they are genuinely different services under one name — different billing bases, different latency characteristics, different scaling behaviour. The bulk of this guide (sections II–IV) walks each one, because choosing the right type for your traffic is the highest-leverage cost and performance decision you make on the serving side.
A note on what an endpoint is not: it is not a training resource (that is a training job), and it is not where you call Amazon's foundation models (that is Bedrock — covered in section VIII). A SageMaker endpoint serves your model — one you trained, fine-tuned, or pulled from JumpStart — on infrastructure you choose. If you only want to call an existing foundation model through a managed API, you do not need a SageMaker endpoint at all.
Three objects, in order: a model (artifact in S3 + inference container) → an endpoint configuration (instance type/count, endpoint type, autoscaling, variants) → a live endpoint (the HTTPS service your app invokes). The same model can be hosted by many different endpoint configurations; the config is where the cost-and-latency decisions live.
SageMaker serves predictions four ways. Before the deep dive on each, here is the whole landscape on one screen — because the right choice falls out almost entirely from your traffic shape and latency requirement.
Two questions decide it. How predictable is your traffic? — steady, spiky, or none (offline). And how fast must each prediction return? — milliseconds, or is seconds-to-minutes fine? Map your answers onto the four types below and the choice is usually obvious. The cost consequence of getting it wrong is large: the persistent modes bill for uptime, the elastic modes bill for usage.
| Type | Traffic shape | Latency | Scales to zero? | Billing basis | Cold starts? | Typical use |
|---|---|---|---|---|---|---|
| Real-time | Steady, online | Milliseconds | No (always-on) | Per instance-hour, 24/7 | No | Live API, fraud check |
| Serverless | Spiky / intermittent | Ms (cold-start risk) | Yes | Per inference compute + requests | Yes (occasional) | Bursty internal apps |
| Asynchronous | Large payloads, bursts | Seconds–minutes | Yes | Per instance-time while busy | Minimal | Big docs/images, long inferences |
| Batch transform | Offline, scheduled | N/A (not online) | N/A (transient job) | Per job instance-time | N/A | Nightly bulk scoring |
Real-time and serverless both serve synchronous online requests; the difference is what happens when traffic is low. Real-time keeps the lights on (and bills for it); serverless turns them off (and pays a cold-start tax to turn them back on).
These are the two modes you reach for when an application needs an answer back in the same request — a recommendation API, a fraud check at checkout, a classification in a user-facing flow. The choice between them is almost entirely about traffic predictability and your tolerance for occasional latency spikes.
What it is: a persistent HTTPS endpoint backed by one or more always-on instances behind a load balancer, returning predictions in single- or double-digit milliseconds. The model is loaded into memory once and stays resident, so every request is fast — there is no per-request warm-up.
Cost shape: you pay the per-hour instance rate × instance count for as long as the endpoint exists, 24/7, whether it serves a million requests or zero. This is the defining property: real-time endpoints bill for uptime, not usage. An entry-GPU real-time endpoint left running continuously is roughly $1,000/month (representative 2026) regardless of traffic — which is why a forgotten test endpoint is the classic SageMaker cost mistake.
Autoscaling: attach a scaling policy (target-tracking on invocations-per-instance, or step/scheduled scaling) so SageMaker adds instances under load and removes them when traffic falls — down to a configured minimum instance count. Note the floor: standard real-time autoscaling does not scale to zero, so even the minimum (typically one instance) keeps billing. Scheduled scaling helps for predictable daily patterns (scale up for business hours, down overnight).
When to use it: steady, latency-sensitive online traffic where requests arrive consistently enough that the instances are not sitting idle, and where millisecond latency on every request matters. If traffic is steady and high, real-time is also the most cost-efficient online mode per request — the idle problem only bites when an always-on endpoint serves sporadic traffic.
What it is: a synchronous endpoint with no instances to manage — you configure a memory size and a max concurrency, and SageMaker provisions compute per request and scales it to zero when idle. You never pick an instance type; AWS handles the capacity.
Cost shape: you pay for the compute consumed during inference (memory-size × duration) plus the number of requests, and nothing while idle because it scales to zero. For an endpoint that is busy only a few hours a day, this is a fraction of what the equivalent always-on real-time endpoint would cost — the bill tracks usage, not uptime.
Cold starts: the trade-off. After a period of no traffic, the first request has to spin compute back up and load the model — adding latency (typically sub-second to a few seconds depending on model size) to that first call before steady-state latency resumes. Provisioned Concurrency mitigates this by keeping a configured number of instances warm and ready (at a cost), giving you serverless economics for the spiky bulk of traffic while removing cold-start latency for the baseline.
When to use it: spiky, intermittent, or unpredictable online traffic — internal tools, low-traffic features, dev/test endpoints, new products without established load — where an always-on endpoint would mostly sit idle and the occasional cold start is acceptable. It is also the lowest-risk default for a first deployment, precisely because a forgotten serverless endpoint costs nothing when no one is calling it.
Ask: will this endpoint be busy most of the time, or sit idle between bursts? Busy and steady → real-time is efficient and avoids cold starts. Idle much of the time → serverless, because real-time would bill 24/7 for capacity you are not using. When in doubt for a new or low-traffic workload, start serverless: a forgotten serverless endpoint costs nothing; a forgotten real-time endpoint is ~$1,000/month.
Not every prediction needs to come back in the same request. When payloads are large, inferences are slow, or you are scoring a whole dataset on a schedule, asynchronous inference and batch transform are both cheaper and better-fit than forcing the work through an online endpoint.
The shared idea: decouple the request from the response. Asynchronous inference keeps an endpoint but processes a queue in the background; batch transform drops the persistent endpoint entirely and runs a transient scoring job. Both can sit at $0 when there is no work to do.
What it is: you invoke the endpoint with a pointer to an input in S3; SageMaker queues the request, processes it in the background, writes the result to S3, and notifies you (via SNS) when it is done. The caller does not block waiting for the answer.
Why it exists: two reasons. First, large payloads — async supports much larger request sizes than real-time (think large images, audio, or documents) because the data goes via S3 rather than in the request body. Second, long-running inferences — when a single prediction takes seconds or minutes (large generative models, heavy CV pipelines), a synchronous endpoint would time out or tie up a connection; async absorbs it gracefully.
Cost shape and scaling: billed per instance-time while the endpoint is processing the queue. Crucially, async endpoints can scale to zero when the queue is empty and scale back up when requests arrive — so a bursty workload (idle for hours, then a flood of large documents) bills only for the busy time, not the idle time. That makes it both cheaper and more appropriate than a real-time endpoint for spiky heavy work.
When to use it: large payloads, long inference times, or bursty heavy workloads where a synchronous response is not required and near-real-time (seconds to minutes) is acceptable — document processing pipelines, batch image analysis triggered by uploads, long generative jobs.
What it is: not an endpoint at all in the persistent sense. You point a batch transform job at a dataset in S3; SageMaker spins up compute, runs the model over every record, writes the predictions back to S3, and tears the compute down when finished. There is no standing service and nothing to invoke between runs.
Cost shape: billed per instance-second of the transient job — you pay only for the minutes the job runs, and nothing between runs. This makes it the cheapest way to score a large dataset offline by a wide margin: a one-hour nightly scoring job might be ~$40–$50/month (representative 2026) versus ~$1,000/month for an always-on real-time endpoint doing the same scoring continuously.
Scaling and throughput: you can parallelize across multiple instances and tune the records-per-mini-batch and concurrent-transforms settings to push throughput, which is how a very large dataset gets scored in a bounded window.
When to use it: offline, scheduled, whole-dataset scoring where there is no online-latency requirement — nightly churn or propensity scores for the entire user base, periodic re-scoring after a model update, one-off bulk inference over a backlog. If you do not need an answer right now, batch transform is almost always the cheapest correct choice.
Asynchronous keeps an endpoint and processes a queue as requests trickle in (large/slow online-ish work, results in seconds-to-minutes). Batch transform has no endpoint and scores a whole dataset on demand or on a schedule (offline bulk work). Both bill only while working; both are far cheaper than forcing heavy or offline work through an always-on real-time endpoint.
If you have dozens or hundreds of models — one per customer, per region, per segment — giving each its own real-time endpoint is ruinously expensive and operationally heavy. SageMaker offers two patterns to pack many models behind a single endpoint and its instances.
The motivating problem is the idle-instance tax multiplied. A hundred models on a hundred real-time endpoints means a hundred sets of always-on instances, most of them mostly idle. Consolidation lets many models share the same compute, so you pay for one (autoscaled) fleet instead of a hundred near-idle ones. Two distinct mechanisms address it.
What it is: a single endpoint that can serve many models of the same framework from a shared fleet of instances. The models live in S3; SageMaker loads a given model into memory on demand when a request specifies it (via a TargetModel header), keeps recently-used models cached in memory, and evicts cold ones to make room. You can host thousands of models behind one endpoint this way.
Why it saves money: instead of N endpoints each with their own idle instances, you run one autoscaled fleet shared across all N models. For large numbers of models that are each invoked only occasionally — the classic "one model per customer" SaaS pattern — this collapses the cost dramatically.
The trade-off: a request for a model not currently in memory incurs a load latency (fetch from S3 + load) on that first call, similar in spirit to a cold start. Frequently-used models stay warm; rarely-used ones pay the load cost when they are called. MME fits best when many models share a framework and per-model traffic is light or bursty.
What it is: a single endpoint hosting up to fifteen distinct containers (different frameworks or different models) on shared infrastructure. You can invoke a specific container directly, or chain them in a serial inference pipeline where the output of one container feeds the next (e.g., a preprocessing container → a model container → a post-processing container).
When to use which: reach for multi-model endpoints when you have many models of the same framework and want to pack them efficiently behind one fleet. Reach for multi-container endpoints when you have a small number of different containers/frameworks to co-host, or when you need a serial inference pipeline. And for splitting traffic across versions of one model (canary/A-B rollout), use production variants on a standard endpoint, which let you weight traffic between two model variants on the same endpoint.
The win scales with the number of models and how idle each one is. One model, steady traffic → a plain single-model endpoint. Many same-framework models, light/bursty each → multi-model endpoint (share one fleet, accept occasional load latency). A few different frameworks, or a serial pipeline → multi-container endpoint. Two versions of one model for a safe rollout → production variants with weighted traffic.
The path from a trained artifact to a live endpoint is short and well-trodden. Here is the realistic sequence, framework-agnostic, with the decision points called out.
Whether you do this from the SageMaker Python SDK, the AWS SDK (boto3), the console, or a Pipeline deploy step, the same logical steps happen under the hood. Knowing them makes the SDK one-liners legible rather than magic.
In the SageMaker Python SDK most of this collapses into a model.deploy(...) call, where the arguments you pass (instance type and count, or a serverless/async config) are the endpoint-type decision from step 3. The brevity is convenient — just remember that deploy() with an instance type creates an always-on real-time endpoint that bills until you call delete_endpoint().
Serving is where most production inference cost lives, and there is a fairly stable hierarchy of what actually moves the bill. Work down this list in order — the top items dwarf the bottom ones.
In rough order of impact for a typical team running models in production:
Hosting cost is driven by which endpoint type far more than which model, and by how much idle capacity you are paying for. Match the type to the traffic, kill idle endpoints, consolidate where you run many models — do those three and you have optimized most of the serving bill. Credits then cover what remains.
A fair question before you stand up an endpoint at all: should you be hosting a model yourself, or calling one through Amazon Bedrock? The two solve different problems, and for a large share of generative-AI workloads Bedrock is the shorter path.
Bedrock managed inference is "AI as an API call." You call an existing foundation model — Anthropic's Claude, Meta's Llama, Amazon's Nova and Titan, Mistral, Cohere, and others — through one consistent API, paying per token, with no endpoint to create, no instance to choose, and nothing to scale or keep warm. There is no idle cost because there is no standing infrastructure: you pay for the tokens you send and receive, full stop. Bedrock handles capacity, scaling, and availability invisibly.
A SageMaker endpoint is "host the model yourself." You deploy a specific model artifact onto instances you choose, you pick the endpoint type and scaling, and you pay for compute (per instance-hour for real-time, per inference for serverless, and so on). That gives you full control — any model including your own from-scratch or deeply fine-tuned ones, any framework, specific instance types, custom containers, AWS silicon — at the cost of owning the operational and idle-capacity decisions Bedrock hides.
The deciding question is the same one that decides SageMaker-vs-Bedrock generally: does a foundation model that already does what you need exist on Bedrock? If yes — a chat assistant, a summarizer, a RAG system, a coding helper over an existing model — Bedrock is usually cheaper to start, cheaper at moderate volume, and has zero idle cost, so there is little reason to host it yourself. If no — you have a proprietary model, a classical-ML workload (tabular fraud, churn, forecasting, recommendation) that is not a foundation model at all, a need to fine-tune weights deeply, or strict control requirements over the serving environment — a SageMaker endpoint is the right tool.
There is also a genuine middle ground. You can deploy open foundation models from SageMaker JumpStart to your own endpoint when you want full control over an open-weights model rather than calling it through Bedrock — for example to pin a specific version, run on specific hardware, or keep inference inside a tightly-controlled boundary. And at very high, steady inference volume, a right-sized SageMaker endpoint (especially on Inferentia) can be cheaper per inference than per-token pricing — so some teams serve their highest-volume model on SageMaker and everything else through Bedrock. The dedicated Bedrock vs SageMaker page works the full comparison; the table below summarizes the inference-specific decision.
Calling an existing foundation model → Bedrock managed inference (per-token, zero idle cost, no infrastructure). Serving your own / classical / deeply-fine-tuned model, or needing control over the serving environment → a SageMaker endpoint (per-compute, full control). Very-high-volume custom serving can favour a right-sized SageMaker endpoint on AWS silicon.
The endpoint type is the single biggest cost lever in SageMaker serving. Here is how the same model lands on the bill and the latency profile under each type — illustrative monthly figures using a representative 2026 entry-GPU rate. Verify live rates on the AWS pricing page.
| Endpoint type | Billing basis | Bills when idle? | Cold starts? | Latency | Illustrative monthly cost* | Best when |
|---|---|---|---|---|---|---|
| Real-time | Per instance-hour, 24/7 | Yes — always-on | No | Milliseconds | ~$1,000 (1× entry GPU, continuous) | Steady, latency-sensitive online traffic |
| Serverless | Per inference compute + requests | No — scales to zero | Yes (occasional) | Ms (cold-start risk) | A fraction of real-time if busy part of the day | Spiky / intermittent online traffic |
| Asynchronous | Per instance-time while busy | No — scales to zero | Minimal | Seconds–minutes | Proportional to busy time + queue | Large payloads, long-running inferences |
| Batch transform | Per job instance-time | No — transient job | N/A | N/A (offline) | ~$40–$50 (1 hr/night offline scoring) | Offline, scheduled, whole-dataset scoring |
Situation: They had stood up one always-on real-time endpoint per enterprise customer — dozens of them, most serving light, bursty traffic — and the SageMaker hosting line had climbed past ~$12K/month, almost all of it idle-instance cost. They also ran a nightly full-catalogue re-scoring job on yet another real-time endpoint. The serving bill was the fastest-growing AWS line and the runway math did not support it to the next milestone.
What CloudRoute did: Routed within 22 hours to a UK partner with an ML / SageMaker cost-optimization track record. The partner consolidated the per-customer models behind a multi-model endpoint on one autoscaled fleet, moved the nightly catalogue scoring off its real-time endpoint onto batch transform, and shifted the lightest-traffic customers to serverless inference — then, in parallel, filed an Activate Portfolio credit application plus a GenAI PoC application for the inference workload.
Outcome: Consolidation plus the endpoint-type changes cut the serving run-rate from ~$12K to ~$4K/month (multi-model endpoint + batch transform for offline scoring + serverless for the long tail). Credits approved within 16 days then covered the remaining bill — taking effective serving cost to ~$0 through the credit window. CloudRoute's commission was paid by the partner from AWS engagement funding; the customer paid $0.
serving bill cut: ~65% before credits · then to ~$0 on credits · matched in: < 24h · cost to customer: $0
CloudRoute connects ML and data-science teams with vetted AWS partners who deploy and optimize SageMaker endpoints and file the credit applications that fund hosting. Customer pays $0 — AWS funds it.