eks production readiness · 2026 checklist

The Amazon EKS production-readiness checklist — everything to verify before you go live (2026).

A cluster that passes `kubectl get nodes` is not a production cluster. This is the reference checklist EKS teams run before going live: account and cluster topology, VPC CNI and IP exhaustion, IRSA vs Pod Identity, the AWS Load Balancer Controller, Karpenter autoscaling, observability, security hardening, cost controls, Velero backups, and a repeatable upgrade strategy — each as a concrete, verifiable check.

checklist domains
11
individual checks
60+
top go-live failure
IP exhaustion
EKS-supported versions
14 months
TL;DR
  • Production-readiness for EKS is not one decision — it is eleven domains: account/cluster design, networking, identity (IRSA/Pod Identity), ingress, autoscaling, observability, security, cost, DR/backups, upgrades, and a documented runbook. A cluster is "ready" only when every domain has an owner and a verifiable check, not when the control plane reports healthy.
  • The three failures that take real clusters down in their first quarter are almost always the same: VPC CNI IP exhaustion (the subnet runs out of addresses and pods stop scheduling), an unmanaged upgrade that strands a deprecated API, and missing pod-level IAM (long-lived keys in Secrets instead of IRSA or EKS Pod Identity). Each is preventable with a pre-launch check.
  • In 2026 the defaults have moved: Karpenter is the mainstream autoscaler (not Cluster Autoscaler), EKS Pod Identity is the simpler path to pod-level IAM (IRSA still valid and often required for cross-account), EKS Auto Mode removes node-management toil for teams that want it, and extended support means each Kubernetes version is patchable for ~14 months — but extended support costs roughly 6x the standard control-plane hour, so an upgrade cadence is a cost control, not just hygiene.
domain 1 — foundations

IWhat "production-ready" means — and getting the blast radius right first

EKS gives you a managed Kubernetes control plane with a 99.95% SLA. That is the floor, not the finish line. AWS runs the API server, etcd, and the scheduler; everything that determines whether your workloads survive a bad deploy, a node failure, an AZ event, or a Tuesday-afternoon traffic spike is yours under the shared-responsibility model. A useful working definition: a cluster is production-ready when, for every domain below, there is a deliberate decision on record, a configuration that implements it, and a way to verify it is still true next month.

The reason to be strict is that EKS failures are rarely loud at install time. The cluster comes up, the demo works, traffic is low, and everything looks fine for weeks. Then the subnet runs out of IPs, or a node terminates and nothing reschedules, or a Kubernetes version reaches end of support and the bill jumps. The defects were present on day one; they only surfaced under load or over time. So the most expensive EKS mistakes are architectural — decided before the first node launches and painful to reverse. Account boundaries, cluster count, and node strategy set the blast radius for everything that follows.

Start with account isolation. Production should live in its own AWS account, separate from staging and dev, under AWS Organizations with SCPs (service control policies) constraining what can happen there. This is the cleanest blast-radius boundary AWS offers — a compromised dev credential cannot touch production if they are in different accounts. A landing zone (AWS Control Tower or a Terraform equivalent) gives you this structure repeatably.

Then decide cluster count. The tradeoff: one large multi-tenant cluster is cheaper (the control plane is $0.10/hour — about $73/month — so ten clusters is ~$730/month before nodes) and simpler to operate, but blast radius is wider and namespace isolation is softer than account-level. Cluster-per-environment (prod/staging/dev) is the near-universal baseline; per-team or per-tenant clusters are a deliberate choice when compliance or noisy-neighbor risk justifies the extra operational surface.

For nodes, 2026 gives three models. EKS Auto Mode hands provisioning, patching, and scaling to AWS (Karpenter under the hood) — lowest toil, ideal for teams without a platform engineer. Managed node groups give EC2 nodes AWS helps lifecycle, with more control over instance types and AMIs. Fargate runs pods with no nodes at all, billed per pod-second — great for spiky or isolated workloads, pricier for steady compute, and with real constraints (no DaemonSets, no privileged pods, limited GPU).

Control-plane config matters too. Enable control-plane logging (API server, audit, authenticator, controller manager, scheduler) to CloudWatch on day one — it is off by default and you cannot retroactively get logs for an incident that already happened. And decide endpoint access: public, private, or both, with sensitive clusters using a private endpoint restricted to your VPC and CI/CD network.

  • Production is its own AWS account — separated from staging/dev under Organizations, with SCPs constraining the blast radius. A landing zone makes this repeatable.
  • Cluster-per-environment is the baseline — one control plane each for prod/staging/dev (~$73/mo each). Per-team or per-tenant clusters only when isolation or compliance demands it.
  • Node strategy chosen deliberately — Auto Mode (lowest toil), managed node groups (more control), or Fargate (no nodes, per-pod billing) — and document why.
  • Control-plane logging enabled — API, audit, authenticator, controllerManager, scheduler → CloudWatch. Off by default; you cannot recover logs after the fact.
  • API endpoint access decided — public, private, or mixed — private endpoint for sensitive production, with allow-listed access from VPC + CI/CD.
the shared-responsibility line

AWS owns the EKS control plane (API server, etcd, scheduler) and its 99.95% SLA. You own node configuration, networking, identity, workload resilience, security posture, cost, and upgrades. Most "EKS outages" are customer-side configuration gaps, not control-plane failures — which is exactly why a pre-launch checklist pays for itself.

domain 2 — networking

IINetworking and the IP exhaustion trap

This is the single most common way a healthy-looking EKS cluster falls over in production. The Amazon VPC CNI gives every pod a real, routable VPC IP address — which is wonderful for native AWS networking and catastrophic if you size your subnets like a traditional application tier.

The mechanic: with the default VPC CNI, every pod consumes one IP from your subnet, and each EC2 node pre-allocates a warm pool of IPs (governed by the instance type's ENI and IP-per-ENI limits) so pods start instantly. A handful of medium nodes running a few hundred pods can silently consume hundreds of addresses. A /24 has 251 usable IPs. Put a production cluster in a /24 and you hit the wall — `FailedScheduling`, pods stuck `Pending`, an incident that looks like a scheduler bug but is really an address shortage.

The fix is to plan IP space up front: size cluster subnets generously — a dedicated /16, or large /19–/20 subnets per AZ for pods. When you genuinely cannot get enough contiguous RFC 1918 space (common in IP-constrained enterprise VPCs), the CNI's custom-networking mode places pods in a secondary CIDR (including the 100.64.0.0/10 range), keeping pod IPs off your scarce primary space. Decide the model before launch, not during the incident.

Then lock down pod-to-pod traffic. Kubernetes is open by default — every pod can reach every other pod across namespaces. Production clusters should run network policies (VPC CNI network policy, or Calico/Cilium) that default-deny and allow only intended flows. Confirm CoreDNS is sized and has a PodDisruptionBudget; DNS is the dependency that, when it wobbles, makes everything look broken at once. And run nodes across at least three Availability Zones, with per-AZ NAT gateways to contain cross-AZ data-processing charges.

  • Subnets sized for pod density — never a /24 for a real cluster. Plan a dedicated /16 or large per-AZ subnets; one node can consume dozens of IPs via warm-pool pre-allocation.
  • IP exhaustion plan documented — know your headroom, and use CNI custom networking (secondary CIDR / 100.64.0.0/10) if primary RFC 1918 space is scarce.
  • Network policies default-deny — VPC CNI network policy, Calico, or Cilium. Without them every pod can reach every pod across all namespaces.
  • CoreDNS sized + PDB — replicas scaled to cluster size with a PodDisruptionBudget so DNS survives node churn and upgrades.
  • Three AZs, NAT planned — nodes across ≥3 AZs; per-AZ NAT gateways to contain cross-AZ data-processing charges.
the number that bites

A /24 subnet = 251 usable IPs. With the VPC CNI, a modest fleet of nodes plus their warm IP pools can exhaust that before you have scheduled your real workload. IP exhaustion presents as scheduler failures, not network errors — which is why teams burn hours chasing the wrong layer. Size for pods, not for servers.

domain 3 — identity

IIIPod-level IAM: IRSA, EKS Pod Identity, and no long-lived keys

If your pods reach AWS services (S3, DynamoDB, SQS, Secrets Manager, Bedrock) using static access keys stored in a Kubernetes Secret, you have a finding waiting to happen. Production EKS gives every workload a scoped IAM role with short-lived credentials — and in 2026 there are two supported ways to do it.

IRSA (IAM Roles for Service Accounts) is the established mechanism: associate an OIDC provider with the cluster, annotate a ServiceAccount with an IAM role ARN, and the pod gets temporary credentials via the STS web-identity flow. Battle-tested, works cross-account, required by some tools — but it carries ceremony: an OIDC provider per cluster, a trust policy on each role, annotations to keep in sync. EKS Pod Identity is the newer, simpler path: install the Pod Identity Agent add-on once, then map a ServiceAccount to an IAM role via an API call — no per-cluster OIDC provider, no trust-policy editing, and roles reusable across clusters. For most net-new clusters in 2026, Pod Identity is the lower-friction default; IRSA stays right for cross-account assumption or OIDC-dependent tooling.

Whichever you pick, the principle is non-negotiable: one role per workload, least privilege, with no wildcard `Action: "*"` / `Resource: "*"` on application pods. The worker-node role itself should be minimal (`AmazonEKSWorkerNodePolicy`, ECR pull, CNI) — application permissions belong on the pod's role, never the node, because everything on a node can otherwise assume the node role via the instance metadata service.

Two checks close the loop. Block pod access to IMDS (or enforce IMDSv2 with hop limit 1) so a compromised pod cannot steal the node role's credentials. And ensure no AWS access keys live in Kubernetes Secrets at all — if you find `AWS_ACCESS_KEY_ID` in a Secret during your pre-launch audit, that workload is not ready.

  • Every AWS-calling pod has its own role — via EKS Pod Identity (simpler, 2026 default) or IRSA (cross-account / OIDC-dependent tooling). Never shared, never the node role.
  • Least privilege, no wildcards — scoped actions and resources per workload; no `*:*` policies on application pods.
  • Node role is minimal — worker-node, ECR-pull, and CNI permissions only — application access lives on pod roles.
  • IMDS locked down — IMDSv2 enforced with hop limit 1, or metadata access blocked for pods, so a compromised pod cannot assume the node role.
  • Zero long-lived keys in Secrets — no `AWS_ACCESS_KEY_ID` in any Kubernetes Secret — temporary STS credentials only.
domain 4 — ingress

IVIngress and load balancing with the AWS Load Balancer Controller

Getting traffic into the cluster reliably means the AWS Load Balancer Controller, correctly installed and correctly scoped — not a hand-rolled LoadBalancer Service per app, and not an ingress controller that quietly creates one expensive load balancer per Ingress object.

The AWS Load Balancer Controller watches Ingress and Service resources and provisions AWS load balancers to match: an Application Load Balancer (ALB) for HTTP/HTTPS Ingress, or a Network Load Balancer (NLB) for L4 Service-type LoadBalancer. The production pattern is IP target mode, where the ALB routes directly to pod IPs (enabled by the VPC CNI giving pods real addresses), bypassing the extra kube-proxy hop. Install the controller with its own IRSA/Pod Identity role — it needs permission to create and modify ELB resources, and that belongs to the controller, not a human.

Consolidate with IngressGroups. Without them, every Ingress can spin up its own ALB, and ALBs bill hourly plus per-LCU — ten services becoming ten ALBs is real money for no benefit. Grouping Ingresses onto a shared ALB via the `group.name` annotation collapses them onto one load balancer with host/path routing, which is cheaper and simpler to manage TLS on.

Terminate TLS at the load balancer with AWS Certificate Manager (ACM) certificates — ACM handles issuance and renewal, so you are not rotating certs by hand. Attach AWS WAF for L7 protection on anything internet-facing, tune health checks to real readiness (an aggressive check flaps healthy pods out during a slow start), and use internal-scheme load balancers for service-to-service traffic so east-west flows are not exposed to the internet.

  • AWS Load Balancer Controller installed + scoped — with its own IRSA/Pod Identity role; IP target mode so the ALB routes straight to pod IPs.
  • IngressGroups consolidate ALBs — shared `group.name` so you run one ALB with host/path routing, not one ALB per Ingress.
  • TLS via ACM — certificates issued and auto-renewed by ACM, terminated at the load balancer.
  • WAF on public ALBs — L7 protection (rate limiting, managed rule groups) for anything internet-facing.
  • Health checks tuned + internal LBs for east-west — readiness-accurate health checks; internal-scheme balancers for service-to-service traffic.
domain 5 — autoscaling

VAutoscaling: Karpenter for nodes, HPA for pods

Production load is not constant, so capacity should not be either. In 2026 the node-autoscaling default has shifted decisively to Karpenter, and pod-level scaling still belongs to the Horizontal Pod Autoscaler — the two operate at different layers and you want both.

At the node layer, Karpenter has largely displaced the older Cluster Autoscaler for new clusters. Rather than scaling fixed node groups, it reads pending pods and provisions right-sized EC2 capacity directly — choosing instance types from a flexible set, bin-packing densely, and consolidating underutilized nodes back down. You define NodePools (instance families, on-demand vs Spot, AZ constraints, limits) and Karpenter does the rest. The wins are faster scale-up and lower cost, which is why it is mainstream. If you run EKS Auto Mode, Karpenter is already operating under the hood.

Spot capacity is where Karpenter earns its keep — but only for interruptible workloads. Spot is 60–90% cheaper than on-demand and can be reclaimed with a two-minute warning, so stateless, replicated services are ideal candidates and databases/stateful singletons are not. The common pattern is a NodePool that prefers Spot with on-demand fallback, plus PodDisruptionBudgets so consolidation and Spot reclamation never drop more replicas than you can afford at once.

At the pod layer, the Horizontal Pod Autoscaler (HPA) scales replica counts on CPU, memory, or custom/external metrics (requests-per-second via KEDA is common). It needs metrics-server and — critically — accurate resource requests, because it scales relative to requests. This is where most clusters are quietly broken: pods without requests and limits cannot be scheduled predictably, cannot autoscale correctly, and turn capacity math into a guess. Requests/limits on every workload is a production gate, not an optimization.

  • Karpenter provisions nodes — NodePools with flexible instance types, Spot + on-demand, and consolidation enabled. The 2026 default over Cluster Autoscaler.
  • Spot for interruptible workloads only — 60–90% savings; pair with PodDisruptionBudgets and on-demand fallback. Keep stateful singletons on on-demand.
  • HPA scales pods on real metrics — CPU/memory or custom/external (KEDA). Requires metrics-server and accurate requests.
  • Every pod has requests + limits — no unbounded pods. Scheduling, autoscaling, and capacity planning all depend on this — it is a gate, not a nicety.
  • PodDisruptionBudgets defined — so consolidation, Spot reclamation, and node upgrades never drop below minimum availability.
domain 6 — observability

VIObservability: you cannot operate what you cannot see

When something breaks at 3 a.m., the question is not "is the cluster down" — it is "which of forty microservices, on which node, is the cause." That is only answerable if metrics, logs, and traces were wired in before launch.

Metrics are the baseline. The standard 2026 stack is Prometheus-compatible — self-managed Prometheus + Grafana, or Amazon Managed Service for Prometheus + Amazon Managed Grafana to offload storage and scaling. Scrape node metrics (node-exporter), control-plane and kubelet metrics, and your app's own instrumentation, and confirm you collect the fundamentals: node CPU/memory/disk pressure, pod restarts and OOMKills, pending-pod counts (your early warning for IP exhaustion and capacity gaps), and per-namespace usage.

Logs need a destination and a retention policy. Container stdout/stderr should ship off the node — via Fluent Bit to CloudWatch Logs, OpenSearch, or a third party — because node-local logs vanish when Karpenter consolidates the node. Set retention deliberately; "keep everything forever in CloudWatch" is a line item that grows silently. And keep the control-plane audit log from Domain 1 — it is your forensic record for security incidents.

Tracing closes the gap for microservices: adopt OpenTelemetry (the vendor-neutral standard) so a slow request can be followed across service boundaries to the real bottleneck — ADOT plus X-Ray, or any OTLP backend. Finally, wire alerts to a human: pages for symptoms users feel (error rate, latency, saturation), and at minimum alerts on node pressure, persistent pending pods, certificate expiry, and approaching Kubernetes end-of-support. An unmonitored cluster is not production-ready no matter how well it is configured.

  • Prometheus-compatible metrics — self-managed or Amazon Managed Prometheus + Grafana. Scrape nodes, control plane, kubelet, and app metrics.
  • Cluster fundamentals dashboarded — node pressure, pod restarts/OOMKills, pending-pod counts, per-namespace usage.
  • Logs shipped off-node with retention set — Fluent Bit → CloudWatch/OpenSearch/third party; deliberate retention so log cost does not balloon.
  • Distributed tracing via OpenTelemetry — ADOT + X-Ray or any OTLP backend so a slow request is traceable across services.
  • Alerts route to a human — error rate, latency, saturation, node pressure, pending pods, cert expiry, and Kubernetes end-of-support.
domain 7 — security

VIISecurity hardening: RBAC, network policies, and image scanning

A default EKS cluster is permissive by design — to get you running. Production needs the opposite posture: least privilege at every layer, default-deny networking, and a guarantee that the images you run are the images you scanned.

Start with cluster access. EKS in 2026 uses access entries (the successor to the `aws-auth` ConfigMap) to map IAM principals to Kubernetes permissions — use them, and grant RBAC least-privilege: humans get the narrowest role that does their job, CI/CD gets a scoped role, and nobody gets `cluster-admin` for convenience. Bind roles to groups, not individuals, so offboarding is one change. Audit who can do what before launch, because RBAC sprawl is invisible until it is exploited.

Harden workloads with Pod Security Standards: enforce the `restricted` (or at least `baseline`) profile via Pod Security Admission so pods cannot run as root, escalate privileges, or mount the host filesystem unless justified. Pair this with the default-deny network policies from Domain 3 and with secrets encryption — enable EKS envelope encryption of Secrets with a KMS key so they are not merely base64 in etcd, and prefer pulling application secrets from Secrets Manager or Parameter Store via the Secrets Store CSI driver over baking them into manifests.

Then secure the supply chain. Scan every image before it runs — Amazon ECR enhanced scanning (Amazon Inspector) does this continuously on push and as new CVEs are disclosed — and gate the pipeline so critical CVEs do not promote to production. Pin images by digest rather than mutable tags like `latest`, so the artifact you tested is provably the one you deploy, and consider an admission policy (Kyverno, OPA Gatekeeper, or the built-in Validating Admission Policy) that rejects unsigned or unscanned images. Layer in GuardDuty EKS Protection for runtime threat detection on audit logs and node activity.

  • Access entries + least-privilege RBAC — IAM-to-Kubernetes mapping via access entries; no casual `cluster-admin`; bind to groups for clean offboarding.
  • Pod Security Standards enforced — `restricted`/`baseline` via Pod Security Admission — no root, no privilege escalation, no host mounts by default.
  • Secrets encrypted with KMS — EKS envelope encryption for Kubernetes Secrets; app secrets pulled from Secrets Manager/Parameter Store via the CSI driver.
  • Images scanned + gated — ECR enhanced scanning (Inspector); block critical CVEs from promoting; pin by digest, not `latest`.
  • Runtime threat detection — GuardDuty EKS Protection on audit logs and runtime activity; admission policy to reject unsigned/unscanned images.
domains 8 & 9 — cost + resilience

VIIICost controls and disaster recovery: stay cheap, stay recoverable

Two domains that decide how the cluster behaves over time, not just at launch. EKS clusters drift expensive quietly — idle nodes, over-provisioned requests, orphaned load balancers, forgotten control planes — and an unbounded bill is its own kind of outage. And Kubernetes manifests in Git get you most of the way back from a disaster, but "most of the way" is not a recovery plan.

Cost first. Make spend visible: turn on cost allocation by namespace, team, and workload — Kubecost/OpenCost or AWS Split Cost Allocation Data for EKS attributes shared cluster cost down to the pod, so "the EKS bill went up" becomes actionable. Then attack the big levers. Right-sizing is usually the largest win: most clusters request far more CPU and memory than they use, forcing Karpenter to provision nodes you do not need — use the Vertical Pod Autoscaler in recommendation mode to align requests with real usage. Spot via Karpenter is the second lever for interruptible workloads (60–90% off), and Compute Savings Plans discount the steady on-demand baseline up to ~72% for a one- or three-year commitment. Finally, close the leaks: keep consolidation on, cap log retention, clean up orphaned ELBs/EBS, and shut down non-production clusters — each control plane is ~$73/month before a single node, so a forgotten cluster is the most embarrassing line on the bill.

Then resilience. Protect two distinct things: cluster state (Kubernetes resources — Deployments, ConfigMaps, CRDs, namespaces) and persistent data (EBS volumes, databases, object storage). If the whole cluster is provisioned by IaC and deployed by GitOps, you can rebuild it and re-apply workloads from source — the single best DR investment, because it makes the cluster reproducible. But GitOps captures neither data nor drift outside Git, so add Velero for cluster-state and volume backups: it snapshots Kubernetes resources to S3 and persistent volumes (EBS/CSI snapshots) on a schedule. For the data layer, prefer managed services — run databases on Amazon RDS/Aurora with point-in-time recovery rather than self-hosting Postgres on a single EBS volume — and define RPO and RTO explicitly, then verify your cadence and restore time meet them.

  • Cost attributed per namespace/team — Kubecost/OpenCost or Split Cost Allocation Data; tag cluster/env/owner so Cost Explorer can slice it.
  • Workloads right-sized — VPA in recommendation mode to align requests with real usage — the largest cost lever in most clusters.
  • Spot + Savings Plans layered — Spot (60–90% off) for interruptible load via Karpenter; Compute Savings Plans (up to ~72% off) for the steady baseline.
  • Consolidation on, leaks closed, non-prod scheduled down — Karpenter reclaims idle nodes; clean up orphaned ELBs/EBS; cap log retention; stop/delete idle clusters (~$73/mo each before nodes).
  • Cluster reproducible from IaC + GitOps — the cluster and its workloads rebuildable from source — the best DR investment you can make.
  • Velero scheduled backups, stored off-cluster — Kubernetes resources + persistent volumes on a schedule, in a different region or account.
  • Restores tested; RPO/RTO defined and met — a rehearsed restore (an untested backup is a hypothesis); managed data services (RDS/Aurora PITR) for the data tier.
the check everyone skips

Test the restore. The most common DR finding is not "no backups" — it is backups that have never been restored, stored in the same region as the cluster, missing the persistent volumes, or lacking the CRDs needed to bring workloads back. Schedule a restore drill before go-live and put it on a recurring calendar.

domain 10 — upgrades

IXUpgrade strategy: the cadence that keeps you supported and cheap

Kubernetes ships a new minor version roughly every four months, and EKS supports each version for a limited window. A cluster without an upgrade plan does not stay still — it ages into deprecated APIs, lost patches, and a 6x control-plane bill. Upgrades are a recurring, planned operation, not a one-time event.

Understand the support lifecycle, because it drives both risk and cost. Each Kubernetes version gets roughly 14 months of standard support on EKS, then up to 12 more months of extended support — patched, but at a much higher price (the control plane runs at roughly 6x the standard rate during extended support, about $0.60/hour vs $0.10/hour). Beyond that, AWS auto-upgrades you. The takeaway: skating into extended support is a quiet cost-control failure, so a regular cadence (at least once a year, ideally tracking N-1) is the goal.

Upgrade in the right order and environment. The sequence is control plane first (one minor version at a time — no skipping), then the data plane (node groups / Karpenter NodePools to a matching or newer AMI), then add-ons (VPC CNI, CoreDNS, kube-proxy, plus controllers like the Load Balancer Controller and Karpenter, each with its own compatibility matrix). Always upgrade staging first, against the same manifests production runs, so you catch breakage before users do.

The landmine is deprecated APIs: each Kubernetes release removes APIs deprecated earlier, and a manifest referencing a removed API fails to apply after the upgrade — silently stranding a workload. Before every upgrade, scan for deprecated usage (Pluto, kube-no-trouble, or the EKS upgrade insights in the console) and fix the manifests first. Combine the upgrade with PodDisruptionBudgets (Domain 6) and a node-rotation strategy so the data-plane roll drains gracefully, and the upgrade becomes boring — which is exactly what an upgrade should be.

  • Documented upgrade cadence — at least annually, ideally tracking N-1. Each version gets ~14 months standard support before extended support begins.
  • Avoid extended-support cost — extended support runs the control plane at ~6x ($0.60 vs $0.10/hour) — staying current is a cost control, not just hygiene.
  • Right order: control plane → nodes → add-ons — one minor version at a time, no skipping; match add-on (CNI/CoreDNS/kube-proxy/controllers) versions to the new Kubernetes version.
  • Staging first — upgrade staging against production manifests before touching production.
  • Scan for deprecated APIs first — Pluto / kube-no-trouble / EKS upgrade insights; fix manifests before the upgrade strands a workload.
the full checklist

XThe master EKS production-readiness checklist

Every domain above, collapsed into a single pre-launch list. Copy it into a ticket, assign an owner per line, and treat "ready" as "every box is checked and verifiable" — not "the cluster came up."

No single line here is hard. The failure mode is never one impossible task — it is the line nobody owned. Assign each to a person, verify it in the actual cluster (not in a doc), and re-verify the high-drift ones (IP headroom, version support, RBAC, backups) on a recurring basis.

eks production-readiness checklist · 2026 · 11 domains
DomainMust-pass checksMost common failure
1. Account & clusterProd in own account; cluster-per-env; node strategy chosen; control-plane logging on; endpoint access decidedLogging left off — no forensics after an incident
2. NetworkingSubnets sized for pods (no /24); IP-exhaustion plan; network policies default-deny; CoreDNS PDB; ≥3 AZsVPC CNI IP exhaustion → pods stuck Pending
3. IdentityPer-workload IAM (Pod Identity/IRSA); least privilege; minimal node role; IMDS locked; no keys in SecretsLong-lived access keys in Kubernetes Secrets
4. IngressAWS LB Controller scoped; IP target mode; IngressGroups; ACM TLS; WAF on public ALBsOne ALB per Ingress — silent cost sprawl
5. AutoscalingKarpenter NodePools; Spot for interruptible; HPA + metrics-server; requests/limits on every pod; PDBsPods with no requests → unschedulable, unscalable
6. ObservabilityPrometheus metrics; logs shipped + retention; OpenTelemetry traces; alerts to a humanNo pending-pod / node-pressure alerting
7. SecurityAccess entries + least-priv RBAC; Pod Security Standards; KMS secret encryption; image scanning + gating; GuardDutyDefault-permissive cluster shipped as-is
8. CostPer-namespace attribution; right-sizing (VPA); Spot + Savings Plans; consolidation; non-prod scheduled downOver-provisioned requests + forgotten clusters
9. DR / backupsIaC+GitOps reproducible; Velero scheduled; backups off-region; restore tested; RPO/RTO metBackups that were never restore-tested
10. UpgradesDocumented cadence; avoid extended support; correct order; staging first; deprecated-API scanDrift into extended support (~6x) + stranded APIs
11. RunbookOn-call + escalation; incident runbook; documented owners per domain; drill scheduleNo owner on the line that fails
Domain 11 is the meta-check: production readiness is owned, documented, and rehearsed — not just configured once. The high-drift rows (2, 7, 9, 10) deserve a recurring re-verification, because they decay silently between launch and the incident.
choosing your model

EKS Auto Mode vs managed node groups vs Fargate — the node decision

Several checklist domains (nodes, autoscaling, patching, cost) hinge on one upstream choice: how you run compute. The three 2026 models trade operational toil against control and cost. Most production clusters pick one as the default and use Fargate selectively for isolated or spiky workloads.

DimensionEKS Auto ModeManaged node groupsFargate
Who manages nodesAWS (Karpenter under the hood)You, with AWS lifecycle helpNo nodes — pod-level
Operational toilLowestMediumLow (but constrained)
Control over instances/AMIsLower (AWS-optimized)HighNone
AutoscalingBuilt in (Karpenter)Karpenter or Cluster AutoscalerPer-pod, automatic
PatchingAutomaticYou trigger node refreshAutomatic
Cost shapeEC2 + Auto Mode fee; bin-packedRaw EC2 — cheapest steady-statePer pod-second — pricey at scale
DaemonSets / privileged / GPUSupportedSupportedLimited / not supported
Best forTeams without a platform engineerTeams wanting full control + lowest steady costSpiky, isolated, or bursty workloads
A common production shape: managed node groups or Auto Mode as the default fleet (with Karpenter + Spot for elasticity and Savings Plans for the baseline), plus Fargate for isolated or unpredictable workloads where not managing nodes is worth the premium.
before you flip the switch
Run this checklist against your cluster with an AWS partner — often AWS-funded
Get an EKS readiness review →
a recent match

A pre-launch EKS readiness review — anonymized

inquiry · seed-stage b2b saas, remote (US + EU)
Seed-stage B2B SaaS, 9 engineers, first production EKS cluster two weeks from a customer go-live

Situation: A small team had stood up EKS from a tutorial and it "worked" — but no one owned production readiness. The cluster sat in a /24 subnet, every workload used a single shared IAM role via static keys in a Secret, there were no resource requests (so HPA did nothing), Cluster Autoscaler was misconfigured, no network policies, no backups, and the Kubernetes version was already one release from end of standard support. They had a contractual go-live date and no platform engineer.

What CloudRoute did: CloudRoute routed them within 20 hours to a vetted AWS partner with EKS and SOC 2 experience. The partner ran this exact eleven-domain checklist as a readiness review: re-architected networking onto a properly sized secondary CIDR to kill the IP-exhaustion risk, moved every workload to EKS Pod Identity with per-service least-privilege roles, set requests/limits and switched node autoscaling to Karpenter with a Spot+on-demand NodePool, added default-deny network policies and Pod Security Standards, wired Prometheus + alerts, installed Velero with cross-region backups and ran a restore drill, and scheduled the pending Kubernetes upgrade in staging first. Scoped as Well-Architected remediation, the engagement qualified for AWS funding.

Outcome: All eleven domains green before the go-live date. IP-exhaustion and long-lived-key findings eliminated; node cost down ~40% after right-sizing + Spot; a tested restore path on record. Because the work was filed as AWS-funded Well-Architected remediation, the customer paid $0 — AWS funded the partner engagement and CloudRoute was paid a commission by the partner.

engagement window: ~3 weeks · domains hardened: 11/11 · node cost: −40% · cost to customer: $0

faq

Common questions

What is the single most common reason a new EKS cluster fails in production?
VPC CNI IP exhaustion. Because every pod gets a real VPC IP and each node pre-allocates a warm pool of addresses, an undersized subnet (a /24 has only 251 usable IPs) runs out far faster than teams expect. It presents as `FailedScheduling` / pods stuck `Pending`, which looks like a scheduler problem, so teams lose hours at the wrong layer. Size cluster subnets for pod density — a dedicated /16 or large per-AZ subnets — and use CNI custom networking with a secondary CIDR when primary IP space is scarce.
IRSA or EKS Pod Identity — which should I use in 2026?
For most net-new clusters, EKS Pod Identity is the lower-friction default: install the agent add-on once, then map ServiceAccounts to IAM roles via an API call, with no per-cluster OIDC provider and no role trust-policy editing, and roles are reusable across clusters. IRSA is still fully supported and is the right choice when you need cross-account role assumption or are using tooling that expects the OIDC web-identity flow. Either way, the rule is the same: one least-privilege role per workload, and never static access keys in a Secret.
Is Karpenter really better than Cluster Autoscaler?
For most clusters in 2026, yes — Karpenter has become the mainstream node autoscaler. Instead of scaling fixed node groups, it provisions right-sized EC2 capacity directly from pending pods, bin-packs densely, and consolidates idle nodes, which means faster scale-up and lower cost. Cluster Autoscaler still works and is fine on existing setups, but new clusters generally start with Karpenter (and EKS Auto Mode runs Karpenter for you under the hood). Whichever you use, accurate pod resource requests and PodDisruptionBudgets are prerequisites.
How often do I actually have to upgrade EKS?
Each Kubernetes version gets roughly 14 months of standard support on EKS, then up to 12 more months of extended support at a much higher price — the control plane runs at about 6x the standard hourly rate ($0.60 vs $0.10/hour) during extended support. So practically, plan at least one upgrade a year (ideally tracking N-1). Upgrade order is control plane → data plane → add-ons, one minor version at a time, staging first, and always scan for deprecated/removed APIs (Pluto, kube-no-trouble, EKS upgrade insights) before you start so a removed API does not strand a workload.
Do I need network policies if I am already inside a private VPC?
Yes. A private VPC controls north-south access into the cluster, but inside Kubernetes every pod can talk to every other pod across all namespaces by default — that is east-west traffic the VPC does not segment. Production clusters should run default-deny network policies (via the VPC CNI network policy feature, Calico, or Cilium) and explicitly allow only the flows you intend, so a single compromised pod cannot pivot freely across the cluster.
How much does an EKS cluster cost before any workloads?
The EKS control plane is $0.10/hour — about $73/month — per cluster, before a single node or load balancer. That makes idle and forgotten clusters a real cost problem: non-production environments left running over weekends, and clusters nobody deleted after a project ended. On top of the control plane you pay for nodes (EC2/Fargate), load balancers (ALB/NLB hourly + LCU), data transfer (cross-AZ and NAT), and storage/logs. Right-sizing requests, Spot via Karpenter, Compute Savings Plans for the baseline, and shutting down idle clusters are the main levers.
What is the one backup mistake teams make most?
Never testing the restore. The common DR finding is not the absence of backups — it is backups that have never been restored, are stored in the same region as the cluster they protect, are missing the persistent volumes, or lack the CRDs needed to bring workloads back. Use Velero on a schedule, store backups in a different region or account, protect the data tier with managed services (RDS/Aurora point-in-time recovery), define explicit RPO/RTO targets, and run a restore drill before go-live and on a recurring calendar after.
Can a small team without a platform engineer run production EKS safely?
Yes, but be honest about the operational surface. EKS Auto Mode removes most node-management toil (AWS handles provisioning, patching, and scaling via Karpenter), and Fargate removes nodes entirely for isolated workloads — both shrink what a small team has to operate. The eleven-domain checklist still applies; it just gets cheaper to satisfy. Many small teams have a partner run the initial readiness review and hardening (often AWS-funded as Well-Architected remediation) and then operate the simplified result themselves.

Want this checklist run against your cluster before go-live?

CloudRoute routes you to a vetted AWS partner who runs the full EKS readiness review and hardens every domain. Scoped as Well-Architected remediation, it is often AWS-funded — customer pays $0. No procurement, no discovery theater.

matched within< 24h
domains reviewed11
cost to you$0 (often AWS-funded)
The Amazon EKS production-readiness checklist (2026) · CloudRoute