Spinning up an Amazon EKS cluster takes one command. Making it production-ready — networking that won't exhaust IPs, autoscaling that's actually cheap, pod-level IAM with no static keys, ingress that doesn't spawn a load balancer per service, observability your on-call can use, and a security posture that passes an audit — is the real work. This guide walks the whole path: how to create the cluster (eksctl, Terraform/OpenTofu, or EKS Auto Mode), the data-plane choice (managed node groups vs Fargate vs Karpenter), networking, identity, add-ons, and a checklist to call it done. Then how CloudRoute matches you to a vetted partner who builds it — often AWS-funded, so you pay $0.
There are three sane ways to create an Amazon EKS cluster in 2026, and the right one depends on whether you want speed, reproducible infrastructure-as-code, or to hand the whole data plane to AWS. The mistake to avoid is clicking through the console once and never being able to recreate what you built.
Whatever you choose, the goal is the same: a cluster definition that lives in a repository, is reviewable, and can be rebuilt from scratch in a different account or region without tribal knowledge. The console is fine for learning EKS; it is not how you run it. Below are the three reproducible paths and when each fits.
eksctl is the official CLI for EKS. A single `eksctl create cluster` (or, better, a checked-in `ClusterConfig` YAML) provisions the control plane, a VPC, subnets across multiple Availability Zones, and a starter node group in roughly 15–20 minutes. It is the quickest way to a working cluster and ideal for proofs of concept, learning, and teams that don't yet have a Terraform practice. The trade-off: eksctl owns its own CloudFormation stacks, so if the rest of your infrastructure is in Terraform, you end up with two state systems and drift between them. Use eksctl to learn and prototype; graduate to Terraform when EKS becomes load-bearing.
For anything you intend to run in production, define the cluster as code with Terraform (HashiCorp, now BSL-licensed) or OpenTofu (the open-source fork — a drop-in for most modules and the safer license choice for many teams). The community `terraform-aws-modules/eks` module is the de-facto standard: it wires the control plane, node groups or Fargate profiles, the VPC CNI and other add-ons, access entries, and IRSA in one reviewable configuration that lives beside the rest of your AWS infrastructure in one state. AWS CDK, Pulumi, and raw CloudFormation are all viable alternatives if your team already standardizes on them. The point is one IaC tool, one state, one source of truth — so the cluster is reproducible and every change goes through review.
EKS Auto Mode (GA since late 2024) is the newest option and changes the calculus for smaller teams. With Auto Mode, AWS manages compute provisioning (Karpenter is built in and operated by AWS), core networking, load balancing, and add-on lifecycle for you — you declare workloads and AWS handles the nodes, scaling, patching, and much of the operational glue underneath. You still define the cluster in eksctl or Terraform, but you operate far less of it. For a team without a dedicated platform engineer, Auto Mode trades a modest cost premium for dramatically less to run, and is often the right call. Teams that need fine-grained control over node configuration, custom AMIs, or specialized scheduling may prefer to run the data plane themselves.
For most teams in 2026: prototype on eksctl, run production on Terraform/OpenTofu with the standard EKS module, and seriously consider EKS Auto Mode if you don't have a platform engineer to operate the data plane. Whatever you pick, pin the Kubernetes version explicitly and decide your upgrade cadence on day one — EKS supports each version for a defined window, and clusters that drift onto unsupported versions are the most painful to rescue later.
This is the decision that most shapes your cluster's cost, scaling behaviour, and operational burden. EKS gives you a managed control plane; how the worker capacity behaves underneath is your choice — and in 2026 the choice is essentially node groups, Fargate, Karpenter, or a deliberate mix.
These are not mutually exclusive — a real cluster often runs Karpenter for general workloads, a small managed node group for cluster-critical add-ons that need stable capacity, and Fargate for a handful of isolation-sensitive jobs. But you should choose a primary model on purpose rather than ending up with one by accident. The three options, and the full comparison table further down, lay out the trade-offs.
A managed node group is a set of EC2 instances of a chosen type that EKS provisions and lifecycle-manages (drains and replaces them on upgrades). They are predictable and simple to reason about, which is why they remain a sensible home for cluster-critical components that should always have capacity. The downside is that they are relatively static: you pick instance types and min/max sizes up front, capacity arrives slowly when you scale, and right-sizing across diverse workloads is a manual, ongoing chore. On their own, node groups are where clusters quietly waste money.
Fargate runs each pod on its own right-sized, AWS-managed micro-VM — there are no EC2 instances for you to patch, scale, or secure. You assign pods to Fargate via a Fargate profile (namespace + label selectors). It shines for spiky or unpredictable workloads, for strong workload isolation (each pod is its own VM), and for teams that want to eliminate node operations entirely. The trade-offs are real: a higher per-vCPU/GB price than EC2 at steady state, no DaemonSets, no GPU or privileged workloads, and slightly slower pod start. Use Fargate selectively — batch jobs, bursty services, isolation-sensitive workloads — rather than as the whole cluster.
Karpenter is AWS's open-source node autoscaler and is now recommended over the older Cluster Autoscaler for most clusters. Instead of scaling fixed node groups, Karpenter watches for unschedulable pods and launches the optimal EC2 instance type and size to fit them — in seconds, not minutes — then consolidates underused nodes by repacking pods onto fewer instances. You configure NodePools and EC2NodeClasses to express what capacity is allowed (instance families, Spot vs On-Demand, AZs), and Karpenter does the bin-packing. This is where most of the cost savings from a good build come from: getting consolidation, a sensible Spot/On-Demand split for fault-tolerant workloads, and interruption handling right routinely cuts compute spend 30–50% versus static node groups.
EKS networking is where the most common production failure hides. The default Amazon VPC CNI gives every pod a real VPC IP address — powerful, because pods are first-class on your network and reach RDS, ElastiCache, and other resources natively — but it makes IP exhaustion a genuine outage mode when subnets are sized for a handful of nodes and then the cluster grows.
Plan the network before you create the cluster, not after the first "insufficient IP addresses" page. The core decisions: a VPC with both private and public subnets spread across at least two (ideally three) Availability Zones; worker nodes and pods in the private subnets; load balancers in the public subnets; and CIDR ranges sized generously for the pod density you expect — because with the VPC CNI, every pod consumes a VPC IP.
The two techniques that prevent IP exhaustion are prefix delegation (assigning each node a /28 block of IPs so it can host far more pods without burning through the subnet) and simply allocating a large enough secondary CIDR. For clusters that will grow, both are effectively mandatory. When you need Kubernetes NetworkPolicy for east-west segmentation, you either enable the VPC CNI's built-in network-policy support or swap in a CNI like Cilium; security-group-per-pod is available when specific workloads need their own security groups. Getting subnet sizing and prefix delegation right on day one is the difference between a cluster that scales quietly and one that falls over under load.
The most common EKS rescue call is "pods stopped scheduling with insufficient IP addresses." It is almost always undersized subnets plus the VPC CNI handing every pod a real IP. The fix — CIDR planning and prefix delegation — is cheap and fast before it pages you, and disruptive after. Size the network for where the cluster is going, not where it starts.
Kubernetes runs its own permission system (RBAC) on top of AWS IAM, and the two have to be reconciled. The single most important security decision at setup is how pods obtain AWS permissions — and the answer in 2026 is never "the node's IAM role" and never "static keys in an env var."
There are two correct mechanisms, and you should use one of them from day one. IAM Roles for Service Accounts (IRSA) maps a Kubernetes service account to an IAM role via an OIDC provider, so each workload assumes exactly the AWS permissions it needs — no long-lived credentials, scoped per service account. EKS Pod Identity is the newer, simpler alternative: instead of per-cluster OIDC setup, you install the Pod Identity Agent add-on once and create associations between service accounts and IAM roles through the EKS API, which is easier to manage across many clusters. Both deliver the same outcome — least-privilege, keyless AWS access per pod.
The anti-patterns this replaces are exactly the ones that fail audits: pods inheriting the node instance role (so every pod gets whatever the node can do — wildly over-privileged), or AWS access keys baked into a Secret or environment variable (leakable, long-lived, and a finding waiting to happen). For SOC 2, ISO 27001, HIPAA, or PCI, IRSA or Pod Identity plus namespace-scoped RBAC is table stakes. Pair it with secrets pulled at runtime from AWS Secrets Manager via the External Secrets Operator rather than committed into the cluster.
For a new cluster, EKS Pod Identity is usually the simpler choice — less per-cluster setup and easier to operate at scale, especially across multiple clusters. IRSA remains the right call when you need its broader compatibility or already have it wired across your estate. Either is correct; what matters is that you use one of them and never the node role or static keys.
Getting traffic into the cluster, and choosing the right set of cluster add-ons, is the step where clusters either stay lean or accrete cost and complexity. The AWS-native ingress pattern is well-trodden; the add-on list is where discipline matters.
For ingress, install the AWS Load Balancer Controller. It provisions an Application Load Balancer (ALB) for HTTP/HTTPS Ingress resources or a Network Load Balancer (NLB) for raw TCP/UDP Services. This is where TLS termination with ACM certificates, AWS WAF, and host/path-based routing get configured. The expensive mistake — and a frequent line item on inflated bills — is letting every service create its own load balancer; instead, consolidate many services behind a shared ALB using Ingress grouping. Done right, ingress is one or a few load balancers, not dozens.
On add-ons, EKS manages the core ones (the VPC CNI, CoreDNS, kube-proxy, and the EBS/EFS CSI drivers) as managed add-ons you can version and upgrade cleanly — prefer those over hand-installed equivalents that drift. Beyond the core, the typically-needed set is: the AWS Load Balancer Controller (ingress), Karpenter (autoscaling, unless on Auto Mode), the External Secrets Operator (secrets from Secrets Manager), cert-manager or ACM for certificates, a metrics pipeline, and a GitOps controller. Resist installing more than you will operate — every add-on is something to keep current and secure. EKS Auto Mode folds much of this list into the managed plane, which is precisely its appeal.
A cluster you cannot see and cannot defend is not production-ready, however well it scales. These two workstreams turn a running cluster into one you can operate on-call and put in front of an auditor.
Observability means metrics, logs, and traces wired before you take traffic — not bolted on after the first incident. The common stacks: Amazon Managed Service for Prometheus with Amazon Managed Grafana, CloudWatch Container Insights, or Datadog; a node-level log pipeline (Fluent Bit) shipping to CloudWatch Logs or your log store; and OpenTelemetry for distributed traces. The deliverable that matters is not dashboards for their own sake but alerts that fire on symptoms users actually feel — latency, error rate, saturation tied to your SLOs — rather than on every CPU spike, so the on-call rotation isn't buried in noise while real incidents slip through.
The baseline hardening every production EKS cluster should have: pod permissions via IRSA or Pod Identity (covered above), namespace-scoped RBAC with no blanket cluster-admin, a private API server endpoint (or tightly restricted public access), encrypted secrets and EBS volumes with KMS, control-plane audit logging enabled and shipped to CloudWatch, Pod Security Standards (restricted profile) enforced, NetworkPolicies for east-west segmentation, image scanning in the pipeline with provenance you trust, and a managed upgrade cadence so the cluster never runs an unsupported Kubernetes version. None of these are exotic; together they are the gap between a demo cluster and one that passes SOC 2 or HIPAA.
Resilience comes from small, unglamorous additions: liveness and readiness probes on every workload, PodDisruptionBudgets so a routine node recycle or upgrade doesn't take a service down, topology spread across Availability Zones, sensible resource requests and limits so the scheduler can bin-pack without OOM kills, and Multi-AZ everything. With these in place, a node failure is a non-event; without them, it's an incident.
The fastest way to make an on-call rotation quit is to page them on raw CPU and memory. Alert on what users feel — latency, error rate, saturation — tied to SLOs, and let dashboards carry the rest. A cluster with clean symptom-based alerting and disruption budgets survives node failures without anyone waking up; that is the bar for "production-ready," not "it's running."
EKS billing is the control-plane fee plus everything the data plane and surrounding services consume, and an untuned cluster wastes a large share of the latter. Knowing where the leaks are is half of controlling them.
The fixed part is small and predictable: AWS charges a per-hour fee for each EKS cluster's control plane (on the order of ~$0.10/hour, roughly $73/month per cluster), plus an additional fee if you opt into extended support for older Kubernetes versions. Everything else is variable — the EC2 or Fargate compute your workloads run on, the load balancers, NAT gateway data processing, EBS volumes, and observability. The variable side is where bills balloon, and almost always for the same reasons.
The most useful thing to settle before standing up EKS is whether you need Kubernetes at all. A large share of teams reaching for EKS would ship faster and operate more cheaply on Amazon ECS with Fargate or on App Runner — and the right time to learn that is before the cluster exists, not after.
EKS earns its complexity when you have many services across multiple teams and want a shared internal platform, when you need portability or a real multi-cloud story, when you run workloads Kubernetes handles distinctly better (GPU scheduling, complex stateful systems, service mesh, rich operators), or when you already have Kubernetes expertise on the team. If none of those are true — a handful of containerized services, a small team, AWS-only is fine, and you'd rather spend engineering hours on product than on operating a platform — ECS with Fargate gives you serverless containers with a fraction of the operational surface, and App Runner is simpler still for straightforward web services.
This isn't an argument against EKS; it's an argument for choosing it on purpose. The images and CI you build on ECS transfer to EKS later, so starting simple is rarely a dead end. If your honest answer to "why Kubernetes?" is "it's the standard" or "we might need it later," that's a signal to start on ECS + Fargate and graduate when a concrete need appears. If your answer is "we have N teams and M services and need a platform with these specific capabilities," then read on and build EKS well. The companion ECS-vs-EKS and Amazon ECS setup guides go deeper on the decision and the simpler path.
A cluster is "production-ready" when it can take real traffic, survive a node failure without paging anyone, deploy reversibly, and pass an audit. Run this checklist before you cut over — if you can't tick an item, that's the remaining work, not a nice-to-have.
Every item above is achievable in-house with time and a platform engineer who has run EKS before. If you don't have one — or want it done right the first time without a hiring search — CloudRoute matches you to a vetted AWS partner who delivers exactly this list as infrastructure-as-code you own, and for credit-eligible companies the build is often AWS-funded so you pay $0. See the Kubernetes consulting path, the $100K AWS credits route, and the startup engagement detail.
The compute model is the choice that most shapes cost, scaling, and operational burden. These aren't mutually exclusive — many clusters run Karpenter for general workloads, a small node group for critical add-ons, and Fargate selectively — but pick a primary on purpose.
| Variable | Managed node groups | AWS Fargate | Karpenter |
|---|---|---|---|
| What it is | EKS-managed sets of EC2 instances of chosen types | Serverless pods — each on its own AWS-managed micro-VM, no nodes | Autoscaler that launches right-sized EC2 per pending pod, then consolidates |
| Best for | Stable, cluster-critical capacity (core add-ons); predictable baseload | Spiky/unpredictable load, strong per-pod isolation, eliminating node ops | General production workloads wanting cost-efficient, fast, dynamic capacity |
| Scaling | Static — fixed instance types, slow to add capacity | Instant per-pod; AWS handles it entirely | Seconds to provision; bin-packs and consolidates continuously |
| Ops burden | Medium — you pick types, patch via managed updates, right-size manually | Lowest — no nodes to patch, scale, or secure | Low–medium — configure NodePools; Karpenter does the rest |
| Cost shape | Pay for running instances; wastes money half-empty if untuned | Premium per-vCPU/GB at steady state; great for bursty/idle-heavy | Lowest for variable load — Spot + consolidation often cut compute 30–50% |
| Watch out for | Quiet waste; slow capacity; manual right-sizing across workloads | No DaemonSets, no GPU/privileged, slower start, steady-state premium | Needs interruption handling for Spot; NodePool/limits must be set sensibly |
Situation: The product had outgrown a hand-managed EC2 host running containers via docker-compose — no autoscaling, no clean deploys, a looming SOC 2 audit, and a growing services count (12 and rising) across two product teams. They'd decided they genuinely needed Kubernetes for the multi-team platform story, but had no one in-house who had stood up production EKS, and a contractor's earlier proof-of-concept cluster had IP-exhaustion issues and pods running on the node IAM role. They wanted a cluster they could own, not a black box.
What CloudRoute did: Routed within 16 hours to a US-East partner with EKS production references and a containers specialization. Discovery confirmed EKS was the right call for their service count and team structure. Over ~5 weeks the partner built it in Terraform/OpenTofu: a multi-AZ VPC with prefix delegation sized against IP exhaustion, Karpenter for general workloads with a Spot split plus a small managed node group for core add-ons, EKS Pod Identity with namespace-scoped RBAC (no static keys), the AWS Load Balancer Controller with consolidated ALBs and ACM TLS, Managed Prometheus + Grafana with symptom-based alerts, Argo CD GitOps with automated rollback, Pod Security Standards and control-plane audit logging for the audit, and Kubecost for per-team visibility — all handed over as IaC with runbooks.
Outcome: Production cutover on schedule against the full readiness checklist. Deploys went from manual SSH to reviewable GitOps with one-click rollback; a node failure during week 4 was a non-event thanks to PDBs and topology spread. The SOC 2 IAM and logging gaps were closed by design. Because the company was credit-eligible, the engagement was AWS-funded and the AWS usage during the build was credit-covered — the customer paid $0 to the partner, and CloudRoute's commission came from the partner's AWS engagement funding.
build window: ~5 weeks · founder/eng time: ~14 hours · deploys: manual → GitOps · audit gaps: closed · cost to customer: $0
CloudRoute matches you to a vetted AWS partner who stands up EKS to the full readiness checklist — networking, Karpenter, IRSA/Pod Identity, ingress, observability, security, and cost — handed over as infrastructure-as-code. Credit-eligible companies often pay $0. No hiring search, no black box.