Spot is AWS's spare compute capacity sold at a steep discount — typically 70–90% below On-Demand — in exchange for a 2-minute interruption notice. Used well, Spot is the single largest line-item cut available on EC2. Used carelessly, it takes your production traffic down at the worst possible moment. This guide covers what belongs on Spot, what doesn't, how capacity-optimized allocation and pool diversification keep interruptions rare, and how to wire Spot into Auto Scaling Groups, EKS (Karpenter), and Fargate without praying.
Spot Instances are not a different class of hardware. They are ordinary EC2 instances — the same M, C, R, and accelerated families you already use — drawn from whatever capacity a given Availability Zone has sitting idle at that moment. You pay the Spot price, which floats with supply and demand but settles far below On-Demand. The trade you make for that discount is reclaimability.
At any moment, every AWS Availability Zone has some unused EC2 capacity — servers that were provisioned for peak On-Demand demand but aren't being rented right now. Rather than let that capacity sit idle, AWS sells it as Spot at a discount. When On-Demand demand rises and AWS needs that capacity back, it reclaims your Spot instances after sending a 2-minute interruption notice. That is the entire bargain: deep discount, in exchange for the possibility of a 2-minute eviction.
How deep is the discount? It varies by instance family, region, and Availability Zone, but the steady-state range is roughly 70–90% off the equivalent On-Demand rate. A general-purpose instance that lists at $0.10/hour On-Demand commonly runs $0.02–$0.03/hour on Spot. Always confirm the live number in the Spot price history (EC2 console → Spot Requests → Pricing history, or the `describe-spot-price-history` API) and the Spot placement score for current rates and capacity — they move.
Two clarifications that trip people up. First, pricing is no longer an auction. Years ago you set a "max bid" and competed; today the Spot price changes gradually based on long-term supply and demand, and you simply pay the current price. You can still set a maximum price, but the right default for most workloads is to leave it at the On-Demand price (the cap) so you are never interrupted for price reasons — only for genuine capacity reclamation. Second, a "Spot pool" is one specific combination of instance type + size + Availability Zone (e.g., `c7g.xlarge` in `us-east-1a`). Reclamations happen per pool. The whole reliability game is about not depending on any single pool.
The capacity signal worth knowing is the Spot placement score: AWS scores how likely a given region or AZ is to fulfill a Spot request of a given size and shape right now. Before you commit a large Spot fleet to a region, check the placement score — it tells you whether the capacity you need actually exists where you want it.
Spot = the exact same EC2 instances you already run, at 70–90% off, with the condition that AWS can take any individual instance back on 2 minutes’ notice. Everything else in this guide is about arranging your workload so that a 2-minute eviction of one instance is a non-event.
The single decision that determines whether Spot saves you money or causes an incident is workload fit. The question is never "is this important?" — plenty of important workloads run beautifully on Spot. The question is: "if one instance vanishes in 2 minutes, does the work survive?" If yes, it is a Spot candidate. If a lost instance means lost data or a hard outage with no graceful recovery, keep it off Spot.
Below is the practitioner's split. The "safe" column is not theoretical — these are the workloads that companies run on Spot at 60–90% Spot mix in production every day. The "not safe" column is the set where the 2-minute notice is genuinely not enough, or where the state lost on eviction is irreplaceable.
Stateless web / API tiers behind a load balancer. If a request can be retried against another healthy instance and no session state lives only in local memory, a fleet of Spot instances behind an ALB or NLB is ideal. On eviction, the target is drained and traffic shifts to surviving instances. Keep session state in ElastiCache, DynamoDB, or sticky-but-replicated storage — not on the box.
Batch, ETL, and data processing. AWS Batch, queue-driven workers (SQS consumers), nightly ETL, video/image transcoding (MediaConvert-style pipelines), report generation. The job is idempotent or checkpointable; if a worker dies, the message returns to the queue and another worker picks it up. This is the canonical Spot use case.
CI/CD runners. GitHub Actions self-hosted runners, GitLab runners, Jenkins agents, build farms. Builds are short, retryable, and stateless. Running CI on Spot routinely cuts build-infrastructure cost by 70%+ with no developer-visible downside beyond the rare re-queued job.
Big-data clusters. Amazon EMR (Spark, Hive, Presto/Trino), and self-managed Spark/Hadoop. EMR has first-class Spot support and lets you put task nodes (which hold no HDFS data) entirely on Spot while keeping core nodes more stable. Spark's own resilience re-computes lost partitions.
Kubernetes / EKS worker nodes. Stateless pods, horizontally scaled deployments, and most workloads scheduled by the cluster autoscaler or Karpenter. Kubernetes already assumes nodes are cattle; combined with proper Pod Disruption Budgets and graceful node draining, EKS on Spot is one of the highest-leverage Spot wins available.
ML training and inference experimentation. Training jobs that checkpoint to S3 every N steps resume from the last checkpoint after an interruption. SageMaker managed Spot training does this for you and commonly cuts training cost by up to ~90%. Batch / asynchronous inference is also a strong fit; keep latency-critical real-time inference endpoints on stable capacity.
Single, non-replicated stateful databases. A primary database holding the only copy of live data, with state in memory and no synchronous replica, has no business on Spot — a 2-minute notice is not enough to fail over cleanly, and you risk data loss. Use managed services (RDS/Aurora) on stable capacity, or a properly replicated cluster, and cover it with a Savings Plan or Reserved Instance instead.
Workloads that cannot checkpoint and must run to completion. A 9-hour simulation that holds everything in RAM and produces nothing useful until the final minute will lose all 9 hours on an eviction at hour 8. If you cannot add checkpointing, keep it on On-Demand (or make checkpointing the prerequisite for moving it to Spot).
License-locked or single-pinned servers. Software pinned to one host ID, legacy license managers, and anything that cannot tolerate its instance being replaced under it. The replacement churn of Spot fights this model.
Latency-critical singletons with no warm standby. A real-time inference endpoint or stateful game server with no warm replacement, where a 2-minute drain still means a user-visible drop, should stay on stable capacity — or get a warm standby first, then move the elastic surplus to Spot.
Before you put anything on Spot, answer one question: "If AWS evicts this instance in 2 minutes, does the work survive — automatically?" Survives via retry, re-queue, checkpoint, or a healthy peer → Spot. Loses data or causes a hard outage with no graceful recovery → not Spot.
The difference between Spot that interrupts constantly and Spot that almost never interrupts is two settings: which allocation strategy you use, and how many pools you spread across. Get these right and real-world interruption rates for well-diversified fleets routinely sit in the low single-digit percent per month. Get them wrong — chase the absolute cheapest pool, depend on one instance type in one AZ — and you will feel every reclamation.
Allocation strategy. When your fleet needs more Spot capacity, AWS has to decide which pools to draw from. The strategy you choose controls that decision:
Diversification. Allocation strategy only has good options to choose from if you give it many pools. Diversification means telling your fleet it can use many instance types across many Availability Zones, rather than pinning to one. Practical guidance: select a broad set of instance types that are interchangeable for your workload — multiple sizes (e.g., `xlarge` and `2xlarge`), multiple families that meet your vCPU/memory shape (e.g., `m6i`, `m6a`, `m7i`, `m7g`), and every AZ in the region. Ten or more pools is a healthy target; with that many, the reclamation of any single pool is absorbed by the rest, and AWS almost always has a deep pool to pull from. The attribute-based instance type selection feature lets you specify requirements ("≥ 4 vCPU, ≥ 8 GiB, current-gen") and have AWS expand that to the full matching set automatically, so you do not hand-maintain instance lists as new families launch.
A note on mixing architectures: Graviton (ARM) Spot pools are frequently the deepest and cheapest of all, because Graviton supply is plentiful and demand is still catching up. If your workload runs on ARM (or you can build multi-arch images), adding Graviton instance types to your diversification set both lowers cost and improves Spot availability. That is a direct overlap with the Graviton migration lever covered elsewhere in this cluster.
You almost never request raw Spot instances by hand in production. You let an orchestrator manage the fleet, request Spot capacity, react to interruptions, and replace evicted instances. The two orchestrators that matter for most teams are Auto Scaling Groups (for general EC2 fleets) and Karpenter (for EKS/Kubernetes). Both implement capacity-optimized allocation and diversification natively.
Auto Scaling Groups with a mixed instances policy. A modern ASG does not have to be single-instance-type or single-purchase-option. A mixed instances policy lets one ASG span many instance types and split capacity between On-Demand and Spot with explicit knobs:
On EKS, you can run Spot through managed node groups (which use ASGs under the hood) — but Karpenter has become the preferred approach for Spot-heavy clusters. Karpenter is a Kubernetes-native autoscaler that watches for unschedulable pods and provisions exactly the right nodes for them in seconds, choosing instance types and purchase options directly against the EC2 fleet API.
Why Karpenter and Spot fit so well: in a NodePool you can declare that Karpenter may use Spot (and On-Demand as fallback), allow a wide, diversified set of instance families and sizes, and let it pick capacity-optimized pools automatically. When a pod cannot be placed on Spot, Karpenter can fall back to On-Demand, then consolidate back to Spot when capacity returns. It also does workload consolidation — bin-packing pods onto fewer nodes and terminating the emptied ones — which compounds the Spot discount with higher utilization.
Interruption handling in Karpenter: Karpenter watches the EC2 interruption and rebalance signals (via an SQS queue), cordons and drains the node on the 2-minute notice, and provisions a replacement so pods reschedule cleanly. Pair this with Pod Disruption Budgets (so you never drain too many replicas of one service at once) and sane `terminationGracePeriodSeconds`, and Spot interruptions become routine background events rather than incidents.
The standard EKS pattern: a small On-Demand node pool for the control-plane-adjacent and stateful pods you cannot lose (ingress controllers, stateful sets, single-replica system services), and a large Spot node pool for the stateless application workloads that make up the bulk of the cluster. That single split frequently takes 50–70% off an EKS compute bill.
Reliable Spot is not about avoiding every interruption — it is about making each interruption boring. When AWS decides to reclaim an instance, it gives you three signals and a 2-minute window. A workload that listens for those signals and drains cleanly within the window experiences interruptions as a non-event. A workload that ignores them experiences them as dropped requests and lost jobs.
There are three mechanisms to listen for, in increasing order of lead time:
What "draining cleanly" looks like in practice depends on the workload. For a web tier: catch the notice, deregister the target from the ALB/NLB target group so the load balancer stops routing to it, let in-flight requests complete (connection draining), then terminate. For a queue worker: stop pulling new messages, finish (or return-to-queue) the current message, then exit — the message's visibility timeout means anything unfinished simply reappears for another worker. For a batch / training job: write a checkpoint to S3 on the notice so the job resumes from the last checkpoint elsewhere. For Kubernetes: the node is cordoned and drained, pods get their graceful termination period, and the scheduler places them on surviving or new nodes.
A practical robustness tip: keep your graceful-shutdown work comfortably under 120 seconds. If draining a web instance takes 90 seconds of connection draining, you have only 30 seconds of slack — keep request timeouts and drain windows tuned so you never run past the deadline. And always design for the case where the notice does not arrive at all (rare, but possible under sudden hardware failure): idempotent jobs, replicated state, and health-check-driven replacement mean even a no-notice loss is survivable.
All-Spot is a beginner's framing. The mature compute-cost strategy layers three purchase options so each covers the part of the workload it suits best: Savings Plans for the predictable baseline you commit to, On-Demand for the unpredictable spikes you cannot pre-commit and cannot risk on Spot, and Spot for the elastic, interruption-tolerant bulk. Done right, the blended bill is dramatically lower than any single option while staying as reliable as your most stable layer.
Think of your compute demand as three layers stacked on top of each other:
Layer 1 — the committed baseline (Savings Plans / Reserved capacity). The floor of compute you run 24/7 and are confident you will keep running for 1–3 years. Cover it with a Compute Savings Plan (flexible across EC2, Fargate, and Lambda) or, for a deeper discount on a stable shape, an EC2 Instance Savings Plan. Savings Plans give up to ~70%+ off On-Demand in exchange for an hourly spend commitment, but — critically — they are a billing construct, not a capacity guarantee, and they cannot be interrupted. This is the layer you never want on Spot. (See the Savings Plans page in this cluster for the full commitment math.)
Layer 2 — the interruption-tolerant elastic layer (Spot). The large, variable portion of demand that scales with traffic, jobs, or batch volume and tolerates a 2-minute eviction. This is where Spot earns its 70–90% discount. The bigger and more stateless this layer is, the more Spot saves you.
Layer 3 — the safety valve (On-Demand). The thin top slice for sudden spikes you did not commit to and cannot safely place on Spot, plus the On-Demand fallback when a Spot pool momentarily cannot fulfill capacity. You pay full price here, but it is a small fraction of total hours, and it guarantees you never fail to scale.
The reason commitments and Spot are complementary rather than competing: a Savings Plan discount applies to whatever eligible compute you run, and AWS applies it to your On-Demand usage first (your baseline), where the discount is largest relative to the rate. Spot already prices below the Savings-Plan-discounted rate, so you do not "waste" a Savings Plan by also running Spot — the commitment soaks up the baseline, Spot handles the elastic surplus more cheaply still, and On-Demand catches the rest. The honest tradeoff to keep in view: commitments reduce flexibility (you owe the hourly spend for the full term whether or not you use it), so you size the Savings Plan to the baseline you are genuinely confident about and let Spot + On-Demand flex above it.
A common mature split for a scaled, stateless-heavy workload: ~30–40% Savings-Plan-covered baseline · ~50–60% Spot · ~10% On-Demand spillover. That mix routinely lands a blended compute rate 50–65% below straight On-Demand while the reliability ceiling stays at the level of the Savings-Plan/On-Demand baseline — Spot interruptions only ever touch the elastic layer.
You do not need to run EC2 yourself to get Spot pricing. Fargate Spot brings the same spare-capacity discount to serverless containers — you specify a task and AWS runs it on spare capacity at a steep discount, with the same 2-minute interruption notice, and you never touch an instance, an AMI, or a node group.
On Amazon ECS, a service or task can run on the FARGATE_SPOT capacity provider instead of (or blended with) the regular FARGATE provider. Fargate Spot tasks cost substantially less than standard Fargate — commonly around 50–70% off — and behave like Spot in every other respect: when AWS needs the capacity back, it sends a SIGTERM plus a task-state change with roughly 2 minutes for the task to shut down gracefully before SIGKILL. The fit rules are identical to EC2 Spot: stateless services behind a load balancer, queue workers, batch and scheduled tasks — yes; the single stateful task you cannot lose — no.
The clean pattern on ECS mirrors the EC2 blend: define a capacity provider strategy that places a base number of tasks on regular FARGATE (your always-on floor) and the remainder on FARGATE_SPOT (the elastic, interruptible bulk), with a weight ratio controlling the split. You get most of the cost of Spot with the operational simplicity of Fargate — no instance fleet, no patching, no Karpenter to operate. For teams without the appetite to run and tune EC2 Spot fleets, Fargate Spot is frequently the highest return-on-effort Spot win available, because the orchestration is entirely AWS-managed.
EKS users can also schedule pods onto Fargate via Fargate profiles, though Fargate-on-EKS does not currently offer a Spot price tier the way ECS does — for Spot economics on EKS, the EC2-backed Karpenter/managed-node-group path covered above is the route. The decision tree is simple: containers on ECS → Fargate Spot for the easy win; containers on EKS that need Spot → EC2 node pools via Karpenter; raw VM fleets → ASG mixed instances policy.
These three are not competitors to choose between — they are layers to combine. But to combine them well you need to see clearly what each one trades. The table below is the at-a-glance version; the section above explains how they stack.
| Dimension | Spot | On-Demand | Savings Plans |
|---|---|---|---|
| Discount vs On-Demand | Up to ~90% (typ. 70–90%) | Baseline (0%) | Up to ~70%+ (commit-based) |
| Commitment | None | None | 1 or 3 yr hourly $ commit |
| Can be interrupted? | Yes — 2-min notice | No | No (it is a billing discount, not capacity) |
| Flexibility | High (no lock-in) | Highest (pay-as-you-go) | Lower (locked for the term) |
| Capacity guaranteed? | No | Yes | No (apply with On-Demand or Spot) |
| Best for | Stateless / batch / CI / k8s nodes / ML training | Spiky, uncommitted, can't-lose-and-can't-Spot | The predictable 24/7 baseline |
| Reliability ceiling | Elastic layer only | Full | Full |
| Right layer in the stack | Elastic surplus | Spillover / safety valve | Committed floor |
Spot is high-leverage but it is not set-and-forget. Choosing the diversification set, wiring capacity-optimized allocation, building interruption handling into every workload, configuring Karpenter or the ASG mixed instances policy correctly, and sizing the Savings Plan baseline so commitments and Spot complement rather than collide — that is a real engineering project. It is also exactly the kind of work AWS will frequently fund.
AWS funds partner-led cost-optimization and Well-Architected engagements through its partner programs — the partner is paid through AWS, and a Well-Architected Review (the Cost Optimization pillar in particular) can unlock remediation credits that offset the rework. For qualifying, credit-eligible engagements, that means a vetted partner can architect your Spot adoption, implement the interruption handling and orchestration, and set up the Savings-Plan baseline, and you cut your bill for $0. The honest framing: AWS-funding applies to qualifying engagements; where it does not, it is still a vetted-partner referral that pays for itself many times over in the savings — a 50–65% cut on compute spend dwarfs the cost of the work.
CloudRoute's role is the routing layer: you tell us your stack and your bill, and we match you to an AWS partner with a real Spot / FinOps track record — not a generalist. They run the audit (Compute Optimizer, Cost Explorer, the Spot placement scores, your interruption tolerance per workload), produce the blended-strategy plan, and do the rework. CloudRoute is paid a commission by the partner; you are not in that payment loop. The same engagement typically folds in the rest of this cluster's levers — right-sizing, Graviton migration, Savings Plans — because a competent cost review never optimizes compute in isolation.
Spot is one lever among several in a cost-optimization program. It is the biggest single cut available on the compute line for interruption-tolerant workloads, but it does not touch storage, data transfer, or the stable baseline. Here is how it slots against the neighboring levers in this cluster.
| Situation | Reach for | Why | Typical saving |
|---|---|---|---|
| Large stateless / batch / CI / k8s compute layer | Spot Instances | Interruption-tolerant work gets the deepest discount available | 70–90% on that layer |
| Predictable 24/7 baseline you'll keep for 1–3 yrs | Savings Plans | Commitment beats Spot for must-stay-up capacity | Up to ~70%+ |
| Over-provisioned instances (low CPU/mem utilization) | Right-sizing (Compute Optimizer) | Stop paying for capacity you never use, before discounting it | 20–50% on right-sized fleet |
| Workload runs (or could run) on ARM | Graviton migration | Better price-performance; Graviton Spot pools are deep + cheap | ~20–40% price-perf |
| Whole AWS bill is a mystery / no governance | A partner-led cost audit | Find every lever (incl. storage + data transfer) at once | Compounding |
Situation: Fast-growing product on a single large EKS cluster, all worker nodes On-Demand. Compute was the dominant line item and growing linearly with usage. The team knew Spot existed but had been burned once by a naive all-Spot experiment that took a stateless service down during a capacity crunch, so they'd sworn off it. No interruption handling, no Pod Disruption Budgets, no diversification — and no Savings Plan covering the baseline. Internal platform engineer was fully allocated to product work.
What CloudRoute did: Routed within 20 hours to an EU-Central AWS partner with EKS + FinOps + Karpenter track record. The partner ran a Well-Architected Cost Optimization review, then: (1) split the cluster into a small On-Demand node pool for ingress + stateful/system pods and a large Spot pool via Karpenter with capacity-optimized allocation across ~14 diversified instance types (including Graviton) and all AZs; (2) added interruption handling — SQS-driven drain, Pod Disruption Budgets on every service, tuned grace periods; (3) put a Compute Savings Plan under the ~35% always-on baseline. The Well-Architected review unlocked remediation credits that covered the rework.
Outcome: EKS compute moved to ~60% Spot / 10% On-Demand-elastic / ~30% Savings-Plan baseline. Monthly EKS compute dropped from ~$19K to ~$7.4K — a 61% cut — taking the total bill from ~$31K to ~$19.4K. Measured Spot interruption rate after diversification: ~2% of nodes/month, every one handled with zero user-visible impact. Because the engagement was AWS-funded via the Well-Architected remediation credits, the customer paid $0 for the work; CloudRoute's commission was paid by the partner.
engagement window: ~5 weeks · founder/engineer time: ~7 hours · monthly compute saved: ~$11.6K (61%) · interruption rate: ~2%/mo · cost to customer: $0
CloudRoute routes you to a vetted AWS partner with real Spot + FinOps + Karpenter track record. They architect the diversification, allocation, interruption handling, and Savings Plan baseline — often AWS-funded, so you pay $0 on qualifying work.