SageMaker inference bills are dominated by one number: instance-hours that ran whether or not a request arrived. The biggest savings almost never come from a cheaper instance — they come from choosing the right endpoint type, sizing it honestly, letting it scale to zero, and packing more models onto fewer GPUs. This guide walks every lever — real-time vs serverless vs async vs batch transform, right-sizing, autoscaling and scale-to-zero, multi-model and multi-container endpoints, Inferentia and Graviton, Savings Plans, the idle-endpoint trap — each with the underlying mechanism, the typical savings, and where it ranks. We close with the honest build-vs-buy line: when to stay on SageMaker and when to move to Bedrock managed instead.
Before any optimization, you have to understand what SageMaker charges for. Almost every surprising inference bill traces back to a single fact: real-time endpoints bill per provisioned instance-hour, not per request. The instance is the meter, and it runs whether traffic is zero or saturated.
A SageMaker real-time endpoint is a managed, always-on fleet of EC2 instances behind a load balancer and an HTTPS invoke API. You choose an instance type and an initial instance count, and from the moment the endpoint reaches InService until the moment you delete it, you pay the per-hour rate multiplied by the instance count — continuously, by the second, with no free tier and no automatic shutdown. There is no "pause." An endpoint that served ten requests last week and zero this week costs exactly the same as one under load.
The dollar figures are not small. As a 2026 on-demand reference in us-east-1: a CPU ml.m5.xlarge runs about $0.23/hr (~$168/month if left on), a single-GPU ml.g5.xlarge about $1.408/hr (~$1,028/month), the inference-optimized ml.inf2.xlarge (Inferentia2) about $0.99/hr (~$725/month), and a multi-GPU ml.g5.12xlarge about $7.09/hr (~$5,175/month). Rates vary by region and move over time — always confirm against the live pricing page — but the shape is what matters: a single idle GPU endpoint is a four-figure monthly line item, and most fleets have several.
On top of instance-hours, SageMaker bills a few smaller things that occasionally matter: provisioned storage attached to the endpoint (per GB-month), data processing, and inter-AZ transfer. But there is no separate per-invocation charge on real-time endpoints — which is exactly why a busy endpoint and an idle one cost the same. The corollary is the whole game: your bill is set by how many instance-hours you provision, not by how many predictions you serve.
This reframes the entire optimization problem. The question is never "how do I make each prediction cheaper" — it is "how do I stop paying for instance-hours that did not serve traffic." Every lever in this guide is a different answer to that question: a cheaper meter (Inferentia, Graviton), a meter that turns off when idle (serverless, scale-to-zero, batch), fewer meters running the same work (multi-model endpoints), or a discounted rate on the meters you genuinely keep busy (Savings Plans).
A SageMaker real-time endpoint bills per provisioned instance-hour, 24/7, regardless of traffic. Every optimization is a variation on the same move: reduce paid-for instance-hours that did not serve a request — or get a discount on the ones that did.
The highest-ROI action in SageMaker FinOps is not an architecture change. It is finding the endpoints that are running and serving nobody, and deleting them. This is free, takes an afternoon, and on most accounts removes the largest single source of waste.
Idle endpoints accumulate for entirely human reasons. A data scientist deploys a model to test a notebook and never deletes it. A demo endpoint goes up for a customer call and stays up for a year. A staging endpoint mirrors production but receives a request a day. A "temporary" A/B variant is left at one instance long after the test ended. None of these throw an error. None of them page anyone. They simply bill, quietly, every hour, forever.
The diagnostic is straightforward. For each endpoint, pull the CloudWatch Invocations metric over the last 7–14 days. Any endpoint with near-zero invocations is either dead (delete it) or genuinely low-traffic (a candidate for serverless or scale-to-zero, covered next). Cross-reference with the InvocationsPerInstance and CPU/GPU utilization metrics: an endpoint sitting at single-digit percent utilization is paying for capacity it is not using even when it does get traffic.
A practical sweep, in order: (1) list every endpoint in every region — waste hides in regions nobody checks; (2) tag each with an owner and an environment; (3) delete anything with zero invocations and no owner; (4) move low-but-nonzero traffic endpoints to serverless or scale-to-zero; (5) put a guardrail in place so the trap does not refill — a scheduled Lambda that flags any endpoint below an invocations threshold, or simply a policy that dev/staging endpoints auto-delete nightly and redeploy on demand.
The reason this ranks first is pure arithmetic. Architecture levers save a percentage of a workload that is actually doing work. Killing an idle GPU endpoint saves 100% of a four-figure monthly line that was doing no work at all. On a typical mid-size ML account with a dozen endpoints, idle cleanup alone routinely removes 20–40% of the inference bill before a single model is re-architected.
A deleted endpoint costs $0. A scaled-to-zero endpoint costs $0 while idle. Before optimizing how an endpoint runs, prove it should be running at all. Audit CloudWatch Invocations across all regions, delete the dead ones, and set a guardrail so the idle fleet does not silently rebuild.
SageMaker offers four inference modes, and choosing the wrong one is the most expensive mistake teams make. The right mode is dictated almost entirely by the shape of your traffic — how often requests arrive, how bursty they are, how latency-sensitive they are, and whether they need an answer now or can be processed offline.
Most teams default to a real-time endpoint because it is the first option in every tutorial. Real-time is the right answer for steady, latency-sensitive, online traffic — and the wrong answer for almost everything else. The other three modes exist specifically to stop you paying for idle instance-hours when your traffic does not justify an always-on fleet.
The decision rule is simple enough to apply in a meeting. Do you need a synchronous answer in milliseconds, on steady traffic? Real-time. Spiky or intermittent and cold-start-tolerant? Serverless. Heavy payloads or long inference that can be queued? Asynchronous (and let it scale to zero). No online answer needed at all? Batch transform. The single most common and most expensive anti-pattern is running a real-time endpoint for work that was actually serverless, async, or batch in disguise.
Bills: per provisioned instance-hour, 24/7, regardless of traffic.
Use when: you have consistent traffic and need single-digit-to-low-hundreds-of-milliseconds latency on every request — a live recommendation model, a fraud check in the payment path, an interactive feature.
Avoid when: traffic is spiky, intermittent, or offline. If your endpoint spends most of its hours idle, you are paying full price for nothing. This is the mode that creates the idle trap.
Bills: per millisecond of compute actually used, plus request count. You configure memory (and, in 2026, optional provisioned concurrency); when no requests arrive, you pay nothing for compute.
Use when: traffic is unpredictable or bursty, there are long quiet periods, and you can tolerate occasional cold-start latency (a few hundred ms to a few seconds while a container spins up). Internal tools, low-volume APIs, and "demo that occasionally gets used" workloads are ideal.
Typical savings: 60–90% versus an always-on real-time endpoint that was mostly idle, because you stop paying for empty hours entirely. The tradeoff is cold starts and (historically) no GPU support — serverless targets CPU-class and smaller models. For large models that need a GPU continuously, serverless is not the lever.
Bills: per instance-hour while processing — but async endpoints can autoscale to zero instances when the request queue is empty, so an idle async endpoint costs $0.
Use when: requests are large (big documents, images, audio), each inference takes seconds to minutes, and the caller can accept a queued, near-real-time response rather than a synchronous one. The endpoint queues requests in S3, processes them, and writes results back.
Typical savings: substantial for bursty heavy workloads, because the combination of queueing and scale-to-zero means you only pay while genuinely processing. It is the natural home for "real-time-ish" workloads that are too heavy or too bursty for a synchronous real-time endpoint but too latency-sensitive for an overnight batch job.
Bills: per instance-hour only for the duration of the job, then the instances are torn down. There is no persistent endpoint and therefore no idle cost — ever.
Use when: you do not need an online response at all. Nightly scoring of a customer table, enriching a data warehouse, generating embeddings for a corpus, periodic re-scoring — anything where the answer is consumed in bulk rather than per-request.
Typical savings: often the cheapest mode by a wide margin for offline work, because it removes the endpoint entirely. A model that was served from an always-on real-time endpoint purely to run a nightly job is the textbook waste case — moving it to batch transform can cut that workload's cost by 80–95%.
Once an endpoint genuinely needs to be real-time, the next question is whether it is the right size and whether it shrinks when traffic drops. Most real-time endpoints are over-provisioned at deploy time and never revisited — the instance type was chosen for a feared peak that rarely arrives, and the count never moves.
Right-sizing starts with measurement, not guesswork. Pull CPU/GPU utilization and ModelLatency from CloudWatch under real production load. If a GPU endpoint sits at 15% GPU utilization, you are paying for a GPU you are barely using — either a smaller/cheaper instance fits, or the model can share a GPU with others (multi-model endpoints, next section). The goal is to run each instance in a healthy band — high enough that you are not paying for idle silicon, with enough headroom that latency stays within target during normal peaks.
Autoscaling then makes the fleet track demand instead of standing at peak size all day. SageMaker supports target-tracking autoscaling on metrics like SageMakerVariantInvocationsPerInstance or a target GPU/CPU utilization. You set a target, a minimum, and a maximum; the endpoint adds instances under load and removes them when traffic falls. For a workload with a clear daily peak and a quiet overnight, scaling from (say) four instances at midday down to one overnight removes the hours you were paying for capacity nobody used. Typical savings from right-sizing plus autoscaling land in the 30–50% range on endpoints that were previously pinned at peak size.
The most impactful recent capability is scale-to-zero for real-time endpoints. SageMaker can now scale a real-time endpoint's instance count down to zero during sustained idle periods and bring it back when a request arrives, paying nothing for the idle window. This collapses much of the idle-trap problem for workloads that have genuinely quiet stretches but still need synchronous serving when active — staging environments, business-hours-only internal tools, regional endpoints that idle overnight. The tradeoff mirrors serverless: the first request after a scale-down pays a cold-start latency penalty while an instance comes back, so it suits workloads that can absorb an occasional slow request at the edge of an idle period.
A practical pattern combines all three: right-size the instance to real utilization, attach target-tracking autoscaling with a sensible minimum for latency-critical paths, and enable scale-to-zero (or a scheduled scale-down) for environments that do not need to serve overnight. Dev and staging endpoints in particular almost never justify 24/7 capacity — scheduling them to zero outside working hours is close to free money.
Setting an autoscaling minimum above zero buys you cold-start protection for latency-critical paths — keep it for production user-facing endpoints. For dev, staging, and business-hours-only tools, a minimum of zero (scale-to-zero) or a nightly scheduled scale-down is almost always correct: those workloads do not need a warm instance at 3am.
A surprising amount of SageMaker waste is fragmentation: dozens of models, each on its own under-utilized endpoint, each paying for a whole instance to serve a trickle of traffic. The fix is to consolidate — put many models behind fewer instances using multi-model or multi-container endpoints.
Multi-model endpoints (MMEs) host a large number of models behind a single endpoint and a shared fleet of instances. Models are loaded into memory on demand from S3 and evicted under memory pressure, so you can serve hundreds or thousands of models from a handful of instances instead of standing up an endpoint per model. This is transformative for the classic "model per customer," "model per region," or "model per SKU" pattern, where each individual model sees light, intermittent traffic. Instead of paying for N idle instances, you pay for a small shared fleet sized to aggregate demand.
The savings scale with how fragmented you were. A team serving 200 lightweight models on 200 single-instance endpoints is paying for 200 instances at low utilization; folding them into an MME on a handful of right-sized instances can cut that fleet cost by 50–80%. The tradeoff is a cold-load penalty the first time a given model is invoked after eviction, so MMEs suit catalogs where any individual model's traffic is light and a small first-hit latency is acceptable. Large, latency-critical, constantly-hot models are better on their own dedicated capacity.
Multi-container endpoints (and inference components) address a related case: hosting several different models or framework stacks behind one endpoint, sharing the underlying instances and, increasingly, sharing a GPU. Inference components let you pack multiple models onto the same accelerators and scale each independently, squeezing utilization up on expensive GPU instances rather than dedicating a whole GPU to a model that only needs a slice of it. For GPU fleets specifically, this is one of the strongest levers available: GPUs are the most expensive meter, and most single-model GPU endpoints run them well below capacity.
The unifying idea is utilization. An instance at 70% utilization serving five models is dramatically cheaper per prediction than five instances at 14% each serving one model apiece — same total work, one-fifth the instances. Consolidation does not make any single prediction faster, but it stops you renting five GPUs to do one GPU's worth of work.
Every lever so far reduces how many instance-hours you pay for. This one reduces the price of each hour by running the same inference on purpose-built AWS silicon — AWS Inferentia for accelerated model inference, and Graviton (Arm) for CPU inference.
AWS Inferentia is Amazon's custom inference accelerator, exposed through the inf2 family (Inferentia2) and served on SageMaker via the Neuron SDK. For supported models — which in 2026 span most mainstream transformer and CV architectures — Inferentia2 delivers materially better price-performance than comparable GPU instances: commonly 40–60% lower cost per token (or per inference) at similar latency, because you are buying silicon designed for inference economics rather than general-purpose GPU compute. For high-volume, steady inference of a supported model, moving the endpoint from a g5/GPU instance to an inf2 instance is one of the largest single cost reductions available, and it stacks with right-sizing, autoscaling, and Savings Plans.
The cost of admission is portability. Models must be compiled for Neuron before they run on Inferentia, and while the toolchain and model coverage have matured substantially, an exotic custom architecture or an unsupported operator can mean compilation work or, occasionally, that a model is not yet a fit. The pragmatic approach is to benchmark your actual model on inf2 before committing: compile it, measure latency and throughput against your current GPU endpoint, and confirm the price-performance win on your traffic. For mainstream models the win is usually clear; the benchmark exists to catch the exceptions.
Graviton is the same idea on the CPU side. AWS's Arm-based Graviton processors power CPU instance families that deliver better price-performance than equivalent x86 instances for many CPU-served models — classical ML, smaller transformers, and embedding models that do not need a GPU. For CPU inference workloads, choosing a Graviton-backed instance type over an x86 one is a low-effort win: often double-digit-percent cost-performance improvement with no architecture change beyond ensuring your container is built for Arm.
The decision order is worth stating plainly. First confirm you even need an accelerator — many models served on GPUs run perfectly well, and far more cheaply, on a right-sized CPU (Graviton) instance, and teams routinely over-reach for GPUs out of habit. If you genuinely need acceleration for throughput or latency, benchmark Inferentia2 before defaulting to a GPU. GPUs remain the right choice for very large models, cutting-edge architectures the Neuron stack does not yet cover, and training — but for steady, supported inference, custom AWS silicon is usually the cheaper meter.
Inferentia2 and Graviton win on price-performance for supported models — but support is the catch. Compile your actual model for Neuron (Inferentia) or rebuild your container for Arm (Graviton), measure latency and throughput on your real traffic, and confirm the win before migrating the endpoint. For mainstream models the 40–60% inference savings are real; the benchmark catches the exceptions.
Every lever above is about running fewer or cheaper instance-hours on demand. The final pricing lever is a commitment discount on the instance-hours you genuinely cannot avoid: the steady-state baseline that runs 24/7 no matter what.
SageMaker Savings Plans give you a discount — up to roughly 64% off on-demand — in exchange for committing to a consistent amount of compute spend (measured in $/hour) for a one- or three-year term. They apply across SageMaker instance usage including real-time inference, and they are the right tool for the portion of your fleet that is provably always on: the production endpoints that must serve traffic around the clock and have a stable floor of demand.
The sequencing matters enormously, and getting it backwards is a classic FinOps error. Optimize first, commit second. A Savings Plan locks in spend; if you commit to a baseline and then discover half of it was idle endpoints, over-provisioned instances, or workloads that should have been serverless, you have purchased a discount on waste — and you are contractually holding that waste for a year or three. The correct order is: kill idle endpoints, move workloads to the right endpoint type, right-size and autoscale, migrate to cheaper silicon where it wins — and only then measure the stable, irreducible baseline that remains and commit that to a Savings Plan.
Sizing the commitment is a confidence decision. Look at the trough of your optimized usage — the floor below which spend essentially never drops — and commit to that floor, not your average and certainly not your peak. Usage above the commitment simply bills at on-demand (and can flex with autoscaling, serverless, and batch). Usage below the commitment is wasted commitment. Committing the conservative floor captures most of the discount while leaving headroom for the variable, already-optimized layer on top. Many teams start with a one-year commitment on a clearly stable baseline and expand it as confidence grows.
Optimize first, commit second. A Savings Plan is a discount on spend you have promised to make — so promise only the steady-state baseline that survives after you have deleted idle endpoints, right-sized, and moved workloads to the cheapest correct mode. Committing before optimizing locks in waste for one to three years.
Here is the whole playbook in one view, ordered the way you should actually apply it. Effort is the work to implement; the savings ranges are typical outcomes on workloads where the lever applies, not guarantees — your numbers depend on your starting point. The cardinal sequencing rule holds: free, high-impact cleanup first; pricing commitments last.
| # | Lever | Mechanism | Typical savings | Effort | Best for |
|---|---|---|---|---|---|
| 1 | Kill idle endpoints | Delete endpoints serving ~0 traffic; remove the meter entirely | 100% of the idle line | Low | Every account — do this first |
| 2 | Right endpoint type | Serverless / async / batch stop paying for idle hours | 60–90% | Med | Spiky, intermittent, or offline traffic |
| 3 | Right-size + autoscale | Match instance + count to real utilization; track demand | 30–50% | Med | Over-provisioned always-on real-time endpoints |
| 4 | Scale-to-zero | Real-time endpoint drops to 0 instances when idle | up to 100% of idle windows | Low–Med | Dev/staging + business-hours-only endpoints |
| 5 | Multi-model / multi-container | Pack many models onto a shared, well-utilized fleet | 50–80% | Med | Fragmented fleets — model-per-customer/region/SKU |
| 6 | Inferentia2 | Run supported model inference on purpose-built silicon | 40–60% per token | Med–High | High-volume steady inference of supported models |
| 7 | Graviton (Arm) | CPU inference on cheaper Arm instances | 10–30% | Low–Med | CPU-served models — classical ML, embeddings |
| 8 | SageMaker Savings Plan | Commit the steady baseline for up to ~64% off | up to 64% on baseline | Low | The irreducible always-on floor — commit LAST |
| 9 | Move to Bedrock managed | Per-token API removes instance-hours + idle entirely | workload-dependent | High | Low/bursty utilization serving foundation models |
The most important cost decision is not which SageMaker lever to pull — it is whether you should be paying for instances at all. If you are self-hosting a foundation model on SageMaker purely to serve inference, a managed per-token API like Amazon Bedrock is frequently cheaper and structurally removes the idle problem. This is the honest build-vs-buy line.
The two pricing models are fundamentally different. SageMaker bills per instance-hour: you rent capacity, and your unit economics depend entirely on keeping that capacity busy. Bedrock bills per token (input and output) on its on-demand tier: you pay for work performed, and an idle Bedrock model costs nothing because there is nothing to provision. That difference is decisive at low utilization. A self-hosted model on a GPU endpoint that runs at 15% utilization is paying ~6× the per-prediction cost it would at full load — whereas the same traffic on Bedrock costs the same per token whether it is your only request that hour or your ten-thousandth.
The crossover is utilization. There is a break-even where a fully-utilized reserved GPU (especially Inferentia2 under a Savings Plan) beats per-token pricing — sustained, high-volume, predictable inference is where self-hosting wins, because you amortize the instance across enormous throughput. Below that break-even, where traffic is low, bursty, or unpredictable, Bedrock almost always wins on total cost and simultaneously removes the idle trap, the right-sizing problem, the autoscaling tuning, and the operational burden of running the fleet. Many teams discover that an endpoint they spent weeks optimizing should not have been a self-hosted endpoint at all.
Cost is not the only axis, and the non-cost reasons to stay on SageMaker are real. You keep self-hosting for a custom or proprietary model not available as a managed API; a fine-tune you must fully own and control; strict data-isolation or compliance requirements that demand the model run inside your own VPC on dedicated capacity; or genuinely high, steady utilization where reserved silicon plus a Savings Plan beats per-token economics. Those are legitimate, common, and worth paying for.
The decision framework is therefore two questions. First: is there a managed API (Bedrock, or a Bedrock-hosted version of your model family) that serves your use case? If not, you self-host — optimize with levers 1–8. If yes: is your utilization high and steady enough that a reserved, fully-utilized endpoint beats per-token pricing, or do you have a hard isolation/ownership requirement? If yes, self-host. If no — if you are running low or bursty inference on a foundation model purely because that is how you started — moving to Bedrock managed is usually the single largest cost reduction on the table, because it does not optimize the instance-hours, it deletes them.
Self-host on SageMaker for custom models, owned fine-tunes, in-VPC isolation, or sustained high utilization — and optimize hard with levers 1–8. Move to Bedrock managed when you are serving a foundation model at low or bursty utilization: per-token pricing deletes the idle problem instead of optimizing around it.
Levers in the wrong order waste money. Here is the sequence a disciplined team runs, front-loading the free, high-impact work and saving the commitments and migrations for after the picture is clean.
Week 1 — audit and delete (free, highest ROI). List every endpoint in every region, tag owner + environment, pull 7–14 days of CloudWatch Invocations and utilization, and delete everything dead. Schedule dev/staging endpoints to scale to zero or auto-delete nightly. This alone typically removes 20–40% of the bill.
Week 2 — re-home workloads to the right type. For each surviving endpoint, ask the traffic-shape question: offline jobs to batch transform, heavy/bursty payloads to async (scale-to-zero), intermittent low-volume APIs to serverless. Leave only genuinely steady, latency-critical traffic on real-time.
Week 3 — right-size, autoscale, consolidate. Tune instance types to real utilization, attach target-tracking autoscaling with sensible minimums, enable scale-to-zero where idle windows exist, and fold fragmented model fleets into multi-model or multi-container endpoints to lift utilization on the expensive GPU instances.
Week 4 — cheaper silicon, then commit. Benchmark steady high-volume models on Inferentia2; move CPU models to Graviton where the container supports Arm. Only now that the fleet is clean, measure the stable baseline and commit it to a Savings Plan — and run the build-vs-buy test on any low-utilization foundation-model endpoints, migrating to Bedrock where it wins.
Running these in order, rather than reaching first for a Savings Plan or a cheaper instance, means you discount and migrate a fleet that has already shed its waste — so every dollar of commitment lands on something real. The biggest wins are almost always in the first two weeks, cost almost nothing, and require no new architecture.
Endpoint-type choice is the largest architectural cost lever, and it is dictated by traffic shape. This is the decision table: match the row to how your requests actually arrive, not to which option appeared first in the tutorial you started from.
| Variable | Real-time | Serverless | Asynchronous | Batch transform |
|---|---|---|---|---|
| Billing basis | Per instance-hour, 24/7 | Per ms of compute + requests | Per instance-hour while processing | Per instance-hour for the job only |
| Cost when idle | Full price (the trap) | $0 | $0 (scales to zero) | $0 (no endpoint exists) |
| Latency | Lowest, consistent | Low + cold starts | Queued / near-real-time | Offline / bulk |
| Best traffic shape | Steady, online | Spiky, intermittent | Bursty, heavy payloads | Scheduled, offline |
| Payload / duration | Small, fast | Small–medium, fast | Large, long-running | Whole datasets |
| GPU support | Yes | Limited (CPU-class) | Yes | Yes |
| Typical vs idle real-time | baseline | 60–90% cheaper | large for bursty heavy | 80–95% cheaper offline |
Situation: A document-intelligence product running everything on always-on real-time GPU endpoints: a per-customer model fleet (one endpoint per enterprise customer), a nightly re-scoring job served from a live endpoint, and a self-hosted open-weights model behind a g5.12xlarge at ~18% average utilization. Bill had grown with the customer count and nobody owned it. The ML team wanted to cut spend without a multi-month rewrite, and to fund the rebuild rather than pay for it out of runway.
What CloudRoute did: Routed within 20 hours to an AWS Advanced-tier partner with a SageMaker FinOps + Neuron track record. Week 1: audited all regions, deleted 9 dead demo/staging endpoints (immediate ~$6K/month). Week 2: moved the nightly re-scoring job to batch transform (no endpoint), and folded the 40+ per-customer models into a multi-model endpoint on a small right-sized fleet. Week 3: benchmarked the self-hosted model on inf2, confirmed the price-performance win, and migrated it; right-sized + added scale-to-zero on staging. Week 4: committed the now-stable baseline to a 1-year SageMaker Savings Plan. The partner filed AWS Well-Architected / POC funding to cover the engagement.
Outcome: Monthly inference spend fell from ~$31K to ~$9.8K — a ~68% reduction — within the engagement window, with latency targets held. Idle cleanup and batch/MME consolidation drove most of it; Inferentia2 and the Savings Plan compounded the rest. The remediation work was AWS-funded via the partner; CloudRoute's commission was paid by the partner. Customer paid $0 for the engagement.
engagement window: ~4 weeks · monthly spend: $31K → ~$9.8K (~68%) · idle endpoints removed: 9 · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who audits your endpoints, re-architects the waste, and files for AWS funding so the engagement costs you $0. No procurement. No discovery theater.