ec2 spot instances · 2026 field guide

EC2 Spot Instances — up to 90% off, if you run the right workloads on them.

Spot is AWS's spare compute capacity sold at a steep discount — typically 70–90% below On-Demand — in exchange for a 2-minute interruption notice. Used well, Spot is the single largest line-item cut available on EC2. Used carelessly, it takes your production traffic down at the worst possible moment. This guide covers what belongs on Spot, what doesn't, how capacity-optimized allocation and pool diversification keep interruptions rare, and how to wire Spot into Auto Scaling Groups, EKS (Karpenter), and Fargate without praying.

discount vs On-Demand
70–90%
interruption notice
2 min
best-fit workloads
stateless / batch
audit cost to you
$0*
TL;DR
  • Spot Instances are spare EC2 capacity at 70–90% off On-Demand. The catch: AWS can reclaim them with a 2-minute warning when it needs the capacity back. That makes Spot perfect for interruption-tolerant work — stateless web tiers behind a load balancer, batch and ETL jobs, CI/CD runners, big-data (EMR/Spark), Kubernetes worker nodes, and ML training — and wrong for anything that holds critical state in memory with no checkpoint (single stateful databases, license servers, long jobs that can't resume).
  • Two practices make Spot reliable at scale: capacity-optimized allocation (let AWS draw from the pools with the deepest spare capacity, not the cheapest) and diversification (spread across many instance types and Availability Zones so no single pool reclamation hurts). Modern tooling — Auto Scaling Groups with mixed instances policies, and Karpenter on EKS — implements both for you.
  • The mature pattern is blended, not all-Spot: a Savings-Plan-or-On-Demand baseline covers the floor you can't lose, and Spot scales the elastic, interruption-tolerant layer on top. A CloudRoute-matched AWS partner architects that split — interruption handling, ASG/Karpenter config, and the Savings Plan baseline — as part of a cost-optimization engagement that is often AWS-funded, so you cut the bill for $0 on qualifying work.
the mechanism

IWhat EC2 Spot actually is — spare capacity, not a separate product

Spot Instances are not a different class of hardware. They are ordinary EC2 instances — the same M, C, R, and accelerated families you already use — drawn from whatever capacity a given Availability Zone has sitting idle at that moment. You pay the Spot price, which floats with supply and demand but settles far below On-Demand. The trade you make for that discount is reclaimability.

At any moment, every AWS Availability Zone has some unused EC2 capacity — servers that were provisioned for peak On-Demand demand but aren't being rented right now. Rather than let that capacity sit idle, AWS sells it as Spot at a discount. When On-Demand demand rises and AWS needs that capacity back, it reclaims your Spot instances after sending a 2-minute interruption notice. That is the entire bargain: deep discount, in exchange for the possibility of a 2-minute eviction.

How deep is the discount? It varies by instance family, region, and Availability Zone, but the steady-state range is roughly 70–90% off the equivalent On-Demand rate. A general-purpose instance that lists at $0.10/hour On-Demand commonly runs $0.02–$0.03/hour on Spot. Always confirm the live number in the Spot price history (EC2 console → Spot Requests → Pricing history, or the `describe-spot-price-history` API) and the Spot placement score for current rates and capacity — they move.

Two clarifications that trip people up. First, pricing is no longer an auction. Years ago you set a "max bid" and competed; today the Spot price changes gradually based on long-term supply and demand, and you simply pay the current price. You can still set a maximum price, but the right default for most workloads is to leave it at the On-Demand price (the cap) so you are never interrupted for price reasons — only for genuine capacity reclamation. Second, a "Spot pool" is one specific combination of instance type + size + Availability Zone (e.g., `c7g.xlarge` in `us-east-1a`). Reclamations happen per pool. The whole reliability game is about not depending on any single pool.

The capacity signal worth knowing is the Spot placement score: AWS scores how likely a given region or AZ is to fulfill a Spot request of a given size and shape right now. Before you commit a large Spot fleet to a region, check the placement score — it tells you whether the capacity you need actually exists where you want it.

the one-line mental model

Spot = the exact same EC2 instances you already run, at 70–90% off, with the condition that AWS can take any individual instance back on 2 minutes’ notice. Everything else in this guide is about arranging your workload so that a 2-minute eviction of one instance is a non-event.

fit

IIWhat is safe to run on Spot — and what absolutely is not

The single decision that determines whether Spot saves you money or causes an incident is workload fit. The question is never "is this important?" — plenty of important workloads run beautifully on Spot. The question is: "if one instance vanishes in 2 minutes, does the work survive?" If yes, it is a Spot candidate. If a lost instance means lost data or a hard outage with no graceful recovery, keep it off Spot.

Below is the practitioner's split. The "safe" column is not theoretical — these are the workloads that companies run on Spot at 60–90% Spot mix in production every day. The "not safe" column is the set where the 2-minute notice is genuinely not enough, or where the state lost on eviction is irreplaceable.

Safe for Spot (interruption-tolerant)

Stateless web / API tiers behind a load balancer. If a request can be retried against another healthy instance and no session state lives only in local memory, a fleet of Spot instances behind an ALB or NLB is ideal. On eviction, the target is drained and traffic shifts to surviving instances. Keep session state in ElastiCache, DynamoDB, or sticky-but-replicated storage — not on the box.

Batch, ETL, and data processing. AWS Batch, queue-driven workers (SQS consumers), nightly ETL, video/image transcoding (MediaConvert-style pipelines), report generation. The job is idempotent or checkpointable; if a worker dies, the message returns to the queue and another worker picks it up. This is the canonical Spot use case.

CI/CD runners. GitHub Actions self-hosted runners, GitLab runners, Jenkins agents, build farms. Builds are short, retryable, and stateless. Running CI on Spot routinely cuts build-infrastructure cost by 70%+ with no developer-visible downside beyond the rare re-queued job.

Big-data clusters. Amazon EMR (Spark, Hive, Presto/Trino), and self-managed Spark/Hadoop. EMR has first-class Spot support and lets you put task nodes (which hold no HDFS data) entirely on Spot while keeping core nodes more stable. Spark's own resilience re-computes lost partitions.

Kubernetes / EKS worker nodes. Stateless pods, horizontally scaled deployments, and most workloads scheduled by the cluster autoscaler or Karpenter. Kubernetes already assumes nodes are cattle; combined with proper Pod Disruption Budgets and graceful node draining, EKS on Spot is one of the highest-leverage Spot wins available.

ML training and inference experimentation. Training jobs that checkpoint to S3 every N steps resume from the last checkpoint after an interruption. SageMaker managed Spot training does this for you and commonly cuts training cost by up to ~90%. Batch / asynchronous inference is also a strong fit; keep latency-critical real-time inference endpoints on stable capacity.

Not safe for Spot (keep on On-Demand / Savings Plans)

Single, non-replicated stateful databases. A primary database holding the only copy of live data, with state in memory and no synchronous replica, has no business on Spot — a 2-minute notice is not enough to fail over cleanly, and you risk data loss. Use managed services (RDS/Aurora) on stable capacity, or a properly replicated cluster, and cover it with a Savings Plan or Reserved Instance instead.

Workloads that cannot checkpoint and must run to completion. A 9-hour simulation that holds everything in RAM and produces nothing useful until the final minute will lose all 9 hours on an eviction at hour 8. If you cannot add checkpointing, keep it on On-Demand (or make checkpointing the prerequisite for moving it to Spot).

License-locked or single-pinned servers. Software pinned to one host ID, legacy license managers, and anything that cannot tolerate its instance being replaced under it. The replacement churn of Spot fights this model.

Latency-critical singletons with no warm standby. A real-time inference endpoint or stateful game server with no warm replacement, where a 2-minute drain still means a user-visible drop, should stay on stable capacity — or get a warm standby first, then move the elastic surplus to Spot.

the litmus test

Before you put anything on Spot, answer one question: "If AWS evicts this instance in 2 minutes, does the work survive — automatically?" Survives via retry, re-queue, checkpoint, or a healthy peer → Spot. Loses data or causes a hard outage with no graceful recovery → not Spot.

the reliability engine

IIICapacity-optimized allocation + diversification — how Spot stays reliable

The difference between Spot that interrupts constantly and Spot that almost never interrupts is two settings: which allocation strategy you use, and how many pools you spread across. Get these right and real-world interruption rates for well-diversified fleets routinely sit in the low single-digit percent per month. Get them wrong — chase the absolute cheapest pool, depend on one instance type in one AZ — and you will feel every reclamation.

Allocation strategy. When your fleet needs more Spot capacity, AWS has to decide which pools to draw from. The strategy you choose controls that decision:

  • capacity-optimized (the default you want) — AWS provisions from the pools with the most spare capacity right now — the pools least likely to be reclaimed soon. This minimizes interruptions and is the right default for the overwhelming majority of workloads. There is also "capacity-optimized-prioritized," which honors your instance-type priority list while still favoring deep-capacity pools — useful when some instance types suit your workload better than others.
  • price-capacity-optimized — A balance: AWS weighs both the availability of capacity and the price, picking pools that are both deep and cheap. For many fleets this is now the best general-purpose choice — you keep most of the interruption resilience of capacity-optimized while shaving a bit more off the bill.
  • lowest-price (usually a trap) — AWS picks the cheapest pools regardless of how thin their capacity is. The cheapest pool is often the one about to be reclaimed, so this strategy maximizes interruptions. Avoid it unless your workload is so trivially interruption-tolerant that you genuinely do not care.

Diversification. Allocation strategy only has good options to choose from if you give it many pools. Diversification means telling your fleet it can use many instance types across many Availability Zones, rather than pinning to one. Practical guidance: select a broad set of instance types that are interchangeable for your workload — multiple sizes (e.g., `xlarge` and `2xlarge`), multiple families that meet your vCPU/memory shape (e.g., `m6i`, `m6a`, `m7i`, `m7g`), and every AZ in the region. Ten or more pools is a healthy target; with that many, the reclamation of any single pool is absorbed by the rest, and AWS almost always has a deep pool to pull from. The attribute-based instance type selection feature lets you specify requirements ("≥ 4 vCPU, ≥ 8 GiB, current-gen") and have AWS expand that to the full matching set automatically, so you do not hand-maintain instance lists as new families launch.

A note on mixing architectures: Graviton (ARM) Spot pools are frequently the deepest and cheapest of all, because Graviton supply is plentiful and demand is still catching up. If your workload runs on ARM (or you can build multi-arch images), adding Graviton instance types to your diversification set both lowers cost and improves Spot availability. That is a direct overlap with the Graviton migration lever covered elsewhere in this cluster.

how to deploy it

IVSpot in Auto Scaling Groups and EKS (Karpenter)

You almost never request raw Spot instances by hand in production. You let an orchestrator manage the fleet, request Spot capacity, react to interruptions, and replace evicted instances. The two orchestrators that matter for most teams are Auto Scaling Groups (for general EC2 fleets) and Karpenter (for EKS/Kubernetes). Both implement capacity-optimized allocation and diversification natively.

Auto Scaling Groups with a mixed instances policy. A modern ASG does not have to be single-instance-type or single-purchase-option. A mixed instances policy lets one ASG span many instance types and split capacity between On-Demand and Spot with explicit knobs:

  • On-Demand base capacity — a fixed floor of On-Demand instances that always exist regardless of Spot availability. This is your "never lose this" baseline (e.g., enough to serve minimum traffic).
  • On-Demand percentage above base — for capacity beyond the base, what fraction is On-Demand vs Spot. Set, say, 20% On-Demand / 80% Spot so the elastic layer is mostly cheap Spot but every scale-out still adds a little stable capacity.
  • Spot allocation strategy — set to capacity-optimized or price-capacity-optimized (per the previous section).
  • Instance type list (or attribute-based selection) — the diversified set of pools the ASG may draw from across all AZs.
  • Capacity Rebalancing — when enabled, the ASG proactively launches a replacement when AWS signals an instance is at elevated risk of interruption, before the 2-minute notice even fires, so the new instance is warm by the time the old one is reclaimed.

EKS: Karpenter is the modern answer

On EKS, you can run Spot through managed node groups (which use ASGs under the hood) — but Karpenter has become the preferred approach for Spot-heavy clusters. Karpenter is a Kubernetes-native autoscaler that watches for unschedulable pods and provisions exactly the right nodes for them in seconds, choosing instance types and purchase options directly against the EC2 fleet API.

Why Karpenter and Spot fit so well: in a NodePool you can declare that Karpenter may use Spot (and On-Demand as fallback), allow a wide, diversified set of instance families and sizes, and let it pick capacity-optimized pools automatically. When a pod cannot be placed on Spot, Karpenter can fall back to On-Demand, then consolidate back to Spot when capacity returns. It also does workload consolidation — bin-packing pods onto fewer nodes and terminating the emptied ones — which compounds the Spot discount with higher utilization.

Interruption handling in Karpenter: Karpenter watches the EC2 interruption and rebalance signals (via an SQS queue), cordons and drains the node on the 2-minute notice, and provisions a replacement so pods reschedule cleanly. Pair this with Pod Disruption Budgets (so you never drain too many replicas of one service at once) and sane `terminationGracePeriodSeconds`, and Spot interruptions become routine background events rather than incidents.

The standard EKS pattern: a small On-Demand node pool for the control-plane-adjacent and stateful pods you cannot lose (ingress controllers, stateful sets, single-replica system services), and a large Spot node pool for the stateless application workloads that make up the bulk of the cluster. That single split frequently takes 50–70% off an EKS compute bill.

graceful failure

VHandling interruptions gracefully — the 2-minute drill

Reliable Spot is not about avoiding every interruption — it is about making each interruption boring. When AWS decides to reclaim an instance, it gives you three signals and a 2-minute window. A workload that listens for those signals and drains cleanly within the window experiences interruptions as a non-event. A workload that ignores them experiences them as dropped requests and lost jobs.

There are three mechanisms to listen for, in increasing order of lead time:

  • Rebalance Recommendation (earliest warning) — An EventBridge / instance-metadata signal that fires when AWS judges an instance is at elevated risk of interruption — often well before the 2-minute notice. It is your cue to proactively start draining and bring up a replacement while you still have plenty of time. Auto Scaling Capacity Rebalancing and Karpenter both act on this automatically.
  • Spot Instance Interruption Notice (the 2-minute warning) — Delivered via instance metadata (the `instance-action` endpoint) and EventBridge. This is the hard 2-minute countdown to termination. Your shutdown logic must fit inside it: stop taking new work, finish or checkpoint in-flight work, deregister from the load balancer, flush logs, and exit.
  • EC2 instance termination (the deadline) — After the 2 minutes, the instance is reclaimed (stopped or terminated depending on your request type). Anything not drained or checkpointed by then is lost — which is exactly why the previous two signals exist.

What "draining cleanly" looks like in practice depends on the workload. For a web tier: catch the notice, deregister the target from the ALB/NLB target group so the load balancer stops routing to it, let in-flight requests complete (connection draining), then terminate. For a queue worker: stop pulling new messages, finish (or return-to-queue) the current message, then exit — the message's visibility timeout means anything unfinished simply reappears for another worker. For a batch / training job: write a checkpoint to S3 on the notice so the job resumes from the last checkpoint elsewhere. For Kubernetes: the node is cordoned and drained, pods get their graceful termination period, and the scheduler places them on surviving or new nodes.

A practical robustness tip: keep your graceful-shutdown work comfortably under 120 seconds. If draining a web instance takes 90 seconds of connection draining, you have only 30 seconds of slack — keep request timeouts and drain windows tuned so you never run past the deadline. And always design for the case where the notice does not arrive at all (rare, but possible under sudden hardware failure): idempotent jobs, replicated state, and health-check-driven replacement mean even a no-notice loss is survivable.

the mature pattern

VISpot + On-Demand + Savings Plans — the blended strategy

All-Spot is a beginner's framing. The mature compute-cost strategy layers three purchase options so each covers the part of the workload it suits best: Savings Plans for the predictable baseline you commit to, On-Demand for the unpredictable spikes you cannot pre-commit and cannot risk on Spot, and Spot for the elastic, interruption-tolerant bulk. Done right, the blended bill is dramatically lower than any single option while staying as reliable as your most stable layer.

Think of your compute demand as three layers stacked on top of each other:

Layer 1 — the committed baseline (Savings Plans / Reserved capacity). The floor of compute you run 24/7 and are confident you will keep running for 1–3 years. Cover it with a Compute Savings Plan (flexible across EC2, Fargate, and Lambda) or, for a deeper discount on a stable shape, an EC2 Instance Savings Plan. Savings Plans give up to ~70%+ off On-Demand in exchange for an hourly spend commitment, but — critically — they are a billing construct, not a capacity guarantee, and they cannot be interrupted. This is the layer you never want on Spot. (See the Savings Plans page in this cluster for the full commitment math.)

Layer 2 — the interruption-tolerant elastic layer (Spot). The large, variable portion of demand that scales with traffic, jobs, or batch volume and tolerates a 2-minute eviction. This is where Spot earns its 70–90% discount. The bigger and more stateless this layer is, the more Spot saves you.

Layer 3 — the safety valve (On-Demand). The thin top slice for sudden spikes you did not commit to and cannot safely place on Spot, plus the On-Demand fallback when a Spot pool momentarily cannot fulfill capacity. You pay full price here, but it is a small fraction of total hours, and it guarantees you never fail to scale.

The reason commitments and Spot are complementary rather than competing: a Savings Plan discount applies to whatever eligible compute you run, and AWS applies it to your On-Demand usage first (your baseline), where the discount is largest relative to the rate. Spot already prices below the Savings-Plan-discounted rate, so you do not "waste" a Savings Plan by also running Spot — the commitment soaks up the baseline, Spot handles the elastic surplus more cheaply still, and On-Demand catches the rest. The honest tradeoff to keep in view: commitments reduce flexibility (you owe the hourly spend for the full term whether or not you use it), so you size the Savings Plan to the baseline you are genuinely confident about and let Spot + On-Demand flex above it.

a representative blend

A common mature split for a scaled, stateless-heavy workload: ~30–40% Savings-Plan-covered baseline · ~50–60% Spot · ~10% On-Demand spillover. That mix routinely lands a blended compute rate 50–65% below straight On-Demand while the reliability ceiling stays at the level of the Savings-Plan/On-Demand baseline — Spot interruptions only ever touch the elastic layer.

serverless containers

VIISpot for Fargate — Spot economics without managing instances

You do not need to run EC2 yourself to get Spot pricing. Fargate Spot brings the same spare-capacity discount to serverless containers — you specify a task and AWS runs it on spare capacity at a steep discount, with the same 2-minute interruption notice, and you never touch an instance, an AMI, or a node group.

On Amazon ECS, a service or task can run on the FARGATE_SPOT capacity provider instead of (or blended with) the regular FARGATE provider. Fargate Spot tasks cost substantially less than standard Fargate — commonly around 50–70% off — and behave like Spot in every other respect: when AWS needs the capacity back, it sends a SIGTERM plus a task-state change with roughly 2 minutes for the task to shut down gracefully before SIGKILL. The fit rules are identical to EC2 Spot: stateless services behind a load balancer, queue workers, batch and scheduled tasks — yes; the single stateful task you cannot lose — no.

The clean pattern on ECS mirrors the EC2 blend: define a capacity provider strategy that places a base number of tasks on regular FARGATE (your always-on floor) and the remainder on FARGATE_SPOT (the elastic, interruptible bulk), with a weight ratio controlling the split. You get most of the cost of Spot with the operational simplicity of Fargate — no instance fleet, no patching, no Karpenter to operate. For teams without the appetite to run and tune EC2 Spot fleets, Fargate Spot is frequently the highest return-on-effort Spot win available, because the orchestration is entirely AWS-managed.

EKS users can also schedule pods onto Fargate via Fargate profiles, though Fargate-on-EKS does not currently offer a Spot price tier the way ECS does — for Spot economics on EKS, the EC2-backed Karpenter/managed-node-group path covered above is the route. The decision tree is simple: containers on ECS → Fargate Spot for the easy win; containers on EKS that need Spot → EC2 node pools via Karpenter; raw VM fleets → ASG mixed instances policy.

purchase options compared

VIIISpot vs On-Demand vs Savings Plans — side by side

These three are not competitors to choose between — they are layers to combine. But to combine them well you need to see clearly what each one trades. The table below is the at-a-glance version; the section above explains how they stack.

EC2 Spot vs On-Demand vs Savings Plans · 2026 (representative — verify live rates in Cost Explorer / Spot price history)
DimensionSpotOn-DemandSavings Plans
Discount vs On-DemandUp to ~90% (typ. 70–90%)Baseline (0%)Up to ~70%+ (commit-based)
CommitmentNoneNone1 or 3 yr hourly $ commit
Can be interrupted?Yes — 2-min noticeNoNo (it is a billing discount, not capacity)
FlexibilityHigh (no lock-in)Highest (pay-as-you-go)Lower (locked for the term)
Capacity guaranteed?NoYesNo (apply with On-Demand or Spot)
Best forStateless / batch / CI / k8s nodes / ML trainingSpiky, uncommitted, can't-lose-and-can't-SpotThe predictable 24/7 baseline
Reliability ceilingElastic layer onlyFullFull
Right layer in the stackElastic surplusSpillover / safety valveCommitted floor
The blended strategy uses all three: Savings Plans cover the baseline you commit to, Spot handles the interruption-tolerant elastic bulk at the deepest discount, and On-Demand catches spikes and Spot-capacity shortfalls. A blended rate 50–65% below straight On-Demand is a realistic target for a stateless-heavy, scaled workload.
getting it done

IXWhy a partner architects Spot adoption (often AWS-funded)

Spot is high-leverage but it is not set-and-forget. Choosing the diversification set, wiring capacity-optimized allocation, building interruption handling into every workload, configuring Karpenter or the ASG mixed instances policy correctly, and sizing the Savings Plan baseline so commitments and Spot complement rather than collide — that is a real engineering project. It is also exactly the kind of work AWS will frequently fund.

AWS funds partner-led cost-optimization and Well-Architected engagements through its partner programs — the partner is paid through AWS, and a Well-Architected Review (the Cost Optimization pillar in particular) can unlock remediation credits that offset the rework. For qualifying, credit-eligible engagements, that means a vetted partner can architect your Spot adoption, implement the interruption handling and orchestration, and set up the Savings-Plan baseline, and you cut your bill for $0. The honest framing: AWS-funding applies to qualifying engagements; where it does not, it is still a vetted-partner referral that pays for itself many times over in the savings — a 50–65% cut on compute spend dwarfs the cost of the work.

CloudRoute's role is the routing layer: you tell us your stack and your bill, and we match you to an AWS partner with a real Spot / FinOps track record — not a generalist. They run the audit (Compute Optimizer, Cost Explorer, the Spot placement scores, your interruption tolerance per workload), produce the blended-strategy plan, and do the rework. CloudRoute is paid a commission by the partner; you are not in that payment loop. The same engagement typically folds in the rest of this cluster's levers — right-sizing, Graviton migration, Savings Plans — because a competent cost review never optimizes compute in isolation.

where Spot fits

When Spot is the right tool — and when reach for another lever

Spot is one lever among several in a cost-optimization program. It is the biggest single cut available on the compute line for interruption-tolerant workloads, but it does not touch storage, data transfer, or the stable baseline. Here is how it slots against the neighboring levers in this cluster.

SituationReach forWhyTypical saving
Large stateless / batch / CI / k8s compute layerSpot InstancesInterruption-tolerant work gets the deepest discount available70–90% on that layer
Predictable 24/7 baseline you'll keep for 1–3 yrsSavings PlansCommitment beats Spot for must-stay-up capacityUp to ~70%+
Over-provisioned instances (low CPU/mem utilization)Right-sizing (Compute Optimizer)Stop paying for capacity you never use, before discounting it20–50% on right-sized fleet
Workload runs (or could run) on ARMGraviton migrationBetter price-performance; Graviton Spot pools are deep + cheap~20–40% price-perf
Whole AWS bill is a mystery / no governanceA partner-led cost auditFind every lever (incl. storage + data transfer) at onceCompounding
These stack. The standard sequence: right-size first (don't discount waste), then commit the baseline with Savings Plans, then move the elastic surplus to Spot, and layer Graviton across both. A partner-led audit — often AWS-funded — sequences them for your specific bill.
sitting on an On-Demand-only compute bill?
Get a partner to map which of your workloads belong on Spot
Start in 3 minutes →
a recent match

A Spot-led compute cut — anonymized

inquiry · seed-stage B2B SaaS, EKS-heavy, EU-Central
Seed-stage B2B SaaS, ~22 engineers, ~$31K/month AWS bill (≈$19K of it EKS compute), running everything On-Demand on EKS

Situation: Fast-growing product on a single large EKS cluster, all worker nodes On-Demand. Compute was the dominant line item and growing linearly with usage. The team knew Spot existed but had been burned once by a naive all-Spot experiment that took a stateless service down during a capacity crunch, so they'd sworn off it. No interruption handling, no Pod Disruption Budgets, no diversification — and no Savings Plan covering the baseline. Internal platform engineer was fully allocated to product work.

What CloudRoute did: Routed within 20 hours to an EU-Central AWS partner with EKS + FinOps + Karpenter track record. The partner ran a Well-Architected Cost Optimization review, then: (1) split the cluster into a small On-Demand node pool for ingress + stateful/system pods and a large Spot pool via Karpenter with capacity-optimized allocation across ~14 diversified instance types (including Graviton) and all AZs; (2) added interruption handling — SQS-driven drain, Pod Disruption Budgets on every service, tuned grace periods; (3) put a Compute Savings Plan under the ~35% always-on baseline. The Well-Architected review unlocked remediation credits that covered the rework.

Outcome: EKS compute moved to ~60% Spot / 10% On-Demand-elastic / ~30% Savings-Plan baseline. Monthly EKS compute dropped from ~$19K to ~$7.4K — a 61% cut — taking the total bill from ~$31K to ~$19.4K. Measured Spot interruption rate after diversification: ~2% of nodes/month, every one handled with zero user-visible impact. Because the engagement was AWS-funded via the Well-Architected remediation credits, the customer paid $0 for the work; CloudRoute's commission was paid by the partner.

engagement window: ~5 weeks · founder/engineer time: ~7 hours · monthly compute saved: ~$11.6K (61%) · interruption rate: ~2%/mo · cost to customer: $0

faq

Common questions

How much cheaper are EC2 Spot Instances, really?
The steady-state range is roughly 70–90% off the equivalent On-Demand rate, varying by instance family, region, and Availability Zone. Newer or less-contended pools — Graviton in particular — often sit at the deep end. Spot pricing is no longer an auction; it floats gradually with supply and demand and you pay the current price (you can set a max, but the right default is to cap it at On-Demand so you're only ever interrupted for capacity, not price). Always confirm the live number in the EC2 Spot price history or the describe-spot-price-history API — rates move.
What actually happens when AWS interrupts a Spot instance?
AWS sends a Spot interruption notice — delivered via instance metadata (the instance-action endpoint) and EventBridge — giving you a 2-minute countdown before the instance is reclaimed. Often you also get an earlier Rebalance Recommendation signal flagging elevated interruption risk before the 2 minutes even start. A well-built workload uses that window to stop taking new work, finish or checkpoint in-flight work, deregister from the load balancer, and exit gracefully — so the interruption is a non-event. If you ignore the signals, you drop whatever was in flight when the instance terminates.
Can I run production on Spot, or is it only for dev/test?
You can absolutely run production on Spot — companies run 60–90% Spot mixes in production every day. The requirement is workload fit, not environment: stateless web tiers behind a load balancer, batch/ETL, CI runners, big-data clusters, EKS worker nodes, and checkpointed ML training all run on Spot in production safely. The rule is the 2-minute litmus test: if an evicted instance's work survives automatically (via retry, re-queue, checkpoint, or a healthy peer), it belongs on Spot regardless of whether it's "production."
How do I keep Spot interruptions rare?
Two settings. First, use a capacity-optimized (or price-capacity-optimized) allocation strategy so AWS draws from the pools with the deepest spare capacity — not the cheapest, thinnest ones (avoid the lowest-price strategy). Second, diversify across many pools: a broad set of interchangeable instance types and sizes across every Availability Zone, ten or more pools as a target. With good diversification, the reclamation of any single pool is absorbed by the rest, and real-world interruption rates for well-built fleets commonly sit in the low single-digit percent per month. Auto Scaling mixed instances policies and Karpenter implement both for you.
Should I use Spot or Savings Plans?
Both — they cover different layers and are complementary, not competing. Cover your predictable 24/7 baseline (the capacity you'll keep running for 1–3 years and cannot lose) with a Savings Plan for up to ~70%+ off; that layer must never be on Spot because Savings Plans can't be interrupted and Spot can. Run the interruption-tolerant elastic surplus on Spot for the deepest discount. Catch spikes and Spot-capacity shortfalls with On-Demand. You don't waste a Savings Plan by also running Spot — the commitment soaks up the baseline, and Spot handles the elastic bulk more cheaply still.
Does Spot work with EKS / Kubernetes?
Yes — EKS on Spot is one of the highest-leverage Spot wins available, because Kubernetes already treats nodes as disposable. The modern approach is Karpenter: it provisions diversified, capacity-optimized Spot nodes in seconds for unschedulable pods, falls back to On-Demand when needed, consolidates pods onto fewer nodes, and handles interruptions (cordon, drain, replace) via an SQS queue. Pair it with Pod Disruption Budgets and a small On-Demand node pool for stateful/system pods. The standard stateless-on-Spot / stateful-on-On-Demand split frequently takes 50–70% off an EKS compute bill.
Is there a Spot option for Fargate / serverless containers?
Yes — Fargate Spot. On ECS you run tasks on the FARGATE_SPOT capacity provider for roughly 50–70% off standard Fargate, with the same 2-minute interruption behavior (SIGTERM plus ~2 minutes before SIGKILL) and the same fit rules (stateless services, queue workers, batch — yes; the one stateful task you can't lose — no). A capacity provider strategy lets you place a base of tasks on regular Fargate and the rest on Fargate Spot. It's often the highest return-on-effort Spot win because the orchestration is fully AWS-managed — no instance fleet to run. Note: Fargate-on-EKS doesn't currently have a Spot price tier; for Spot on EKS, use EC2 node pools via Karpenter.
What does it cost to have a partner set Spot up, and is it really AWS-funded?
For qualifying, credit-eligible engagements, the answer is $0 to you. AWS funds partner-led cost-optimization and Well-Architected work through its partner programs, and a Well-Architected Cost Optimization review can unlock remediation credits that offset the rework — so a vetted partner architects your Spot adoption (diversification, allocation, interruption handling, Karpenter/ASG config, and the Savings Plan baseline) and you cut your bill for free. The honest caveat: AWS-funding applies to qualifying engagements; where it doesn't, it's still a vetted-partner referral that pays for itself many times over — a 50–65% compute cut dwarfs the cost of the work. CloudRoute matches you to the partner and is paid a commission by them, not by you.

Cut your compute bill 50–65% with a Spot strategy that doesn't break production.

CloudRoute routes you to a vetted AWS partner with real Spot + FinOps + Karpenter track record. They architect the diversification, allocation, interruption handling, and Savings Plan baseline — often AWS-funded, so you pay $0 on qualifying work.

matched within< 24h
blended compute cut50–65%
cost to you$0*
EC2 Spot Instances — up to 90% off AWS compute (2026 guide) · CloudRoute