A production Amazon ECS service is a task definition behind an Application Load Balancer, running on Fargate (or EC2), in private subnets with awsvpc networking, autoscaling on real signals, secrets pulled from Secrets Manager, logs in CloudWatch, and a deploy strategy — rolling or blue-green via CodeDeploy — that can roll back. This page walks the real decisions (Fargate vs EC2, task sizing, ALB, autoscaling, networking, secrets, deploys, observability, cost), gives you a production checklist, makes the honest ECS-vs-EKS call, and shows how a vetted AWS partner builds it for you — often AWS-funded if you qualify for credits.
ECS (Elastic Container Service) is AWS's own container orchestrator. It is simpler than Kubernetes by design: there is no control plane for you to run, no cluster to upgrade, and a small set of objects to understand. Before you set anything up, it pays to know the four nouns that everything else hangs off.
A container image is your app packaged once (a Dockerfile build) and stored in a registry — almost always Amazon ECR. A task definition is the blueprint for running that image: which image and tag, how much CPU and memory, environment variables and secret references, the port it listens on, the log configuration, the IAM roles it assumes, and (on Fargate) the platform settings. A task is one running instance of that blueprint — one or more containers scheduled together. A service keeps a desired number of tasks running, replaces them when they die, registers them with a load balancer, and orchestrates deployments when you ship a new version. A cluster is the logical boundary the services run in.
The launch type is the one decision that colours everything else. With Fargate, AWS runs the compute for you — you specify CPU and memory per task and AWS schedules it on capacity you never see, patch, or scale. With the EC2 launch type, you run a fleet of EC2 instances (an Auto Scaling group, registered to the cluster via a capacity provider) and ECS bin-packs tasks onto them; you own the host OS, the AMI, and the instance scaling. Most teams start on Fargate because it removes an entire category of operational work; some move specific workloads to EC2 for cost or hardware reasons. Section IV is the full comparison.
The honest framing for this page: ECS itself is genuinely easy to get running — a single task behind a load balancer is an afternoon. What is not trivial is the production envelope around it: private networking, least-privilege IAM, secrets that are never baked into images, autoscaling that reacts to the right signal, a deploy strategy that can roll back, health checks that actually catch a bad release, observability you can debug an incident with, and all of it defined as code rather than clicked together once. That envelope is the work — and it is exactly what a good AWS partner hands you in a week or two.
The task definition and the service are where most of the real configuration lives. Get the sizing, the two IAM roles, and the health checks right here and the rest of the setup is comparatively mechanical.
A task definition is versioned — every change creates a new revision, and the service points at a revision. This is a feature: a deploy is really "move the service from task-def revision N to N+1," and a rollback is "point it back at N." Treat task definitions as immutable artifacts the same way you treat the image: define them in code (Terraform/OpenTofu/CDK/CloudFormation), not by hand-editing in the console.
On Fargate you pick a CPU/memory combination from a fixed matrix (for example 0.25 vCPU / 0.5 GB at the small end, up to 16 vCPU / 120 GB, with newer larger configurations available) — task CPU and memory must be a valid pair. On EC2 you set CPU and memory reservations per container and ECS bin-packs them onto instances, so you can oversubscribe more flexibly. Start conservative, then size from real data: ECS publishes per-task CPU and memory utilization to CloudWatch, and Container Insights gives you the percentiles. Most teams over-provision early; right-sizing after a week of real traffic is one of the cheapest wins available.
Every ECS task has up to two distinct roles, and conflating them is the single most common setup mistake. The task execution role is used by the ECS agent / Fargate to start the task — pulling the image from ECR, fetching secret values from Secrets Manager/SSM at launch, and writing logs to CloudWatch. The task role is the identity your application code assumes at runtime to call other AWS services (S3, DynamoDB, SQS, etc.). Scope both to least privilege: the execution role needs only ECR pull, the specific secret ARNs, and the log group; the task role needs only the application's actual AWS calls. Never give either an admin policy.
The service is what makes ECS production-grade: it maintains the desired task count, replaces unhealthy tasks, and during a deploy spins up new-revision tasks before draining old ones. Define a container health check (or rely on the ALB target-group health check) so ECS and the load balancer agree on what "healthy" means, and set a sensible healthCheckGracePeriodSeconds so a slow-starting app is not killed before it is ready. Set minimumHealthyPercent and maximumPercent deliberately — they govern how many tasks must stay up and how many extra can spin up during a rolling deploy. Getting these wrong is how a deploy briefly takes the service below capacity.
How tasks get their network identity and how traffic reaches them is the part that, done wrong, either exposes you or makes everything mysteriously unreachable. On modern ECS there is really one networking mode that matters, and one front door that matters.
In awsvpc mode — the default and the only option on Fargate — each task gets its own elastic network interface and its own private IP inside your VPC, with its own security group. This is the model you want: a task is a first-class citizen on the network, you control its inbound and outbound rules at the task level, and there is none of the port-juggling that older bridge/host modes required. Run tasks in private subnets with no public IP; give them outbound internet (for pulling images, calling external APIs) via a NAT gateway, or, better for cost and security, via VPC endpoints (PrivateLink) to ECR, S3, Secrets Manager, and CloudWatch Logs so image pulls and secret fetches never leave the AWS network.
The front door for an HTTP/HTTPS service is an Application Load Balancer. The ALB lives in public subnets; your tasks live in private subnets; the ALB's security group is the only thing allowed to reach the task security group on the app port. ECS integrates natively: the service registers and deregisters task IPs with an ALB target group automatically as tasks come and go, and the target group's health check is what gates whether a task receives traffic. Terminate TLS at the ALB with an ACM certificate, route by host or path with listener rules, and put AWS WAF in front if you need request filtering. For non-HTTP/TCP/UDP workloads (game servers, gRPC at L4, databases) you use a Network Load Balancer instead; for most web and API services the ALB is the right choice.
A few specifics that save incidents: enable connection draining (deregistration delay) on the target group so in-flight requests finish before a task is killed during a deploy or scale-in; right-size the health-check thresholds so a brief blip does not flap tasks out of service; and if you run many services, consider ECS Service Connect (or AWS Cloud Map service discovery) for clean service-to-service communication inside the cluster instead of routing everything through the ALB.
This is the decision people agonize over and usually overthink. The short version: default to Fargate, and move specific workloads to EC2 only when a concrete reason appears. Here is the reasoning, and the full comparison is in the table below.
Choose Fargate when you want to stop thinking about servers. There are no instances to patch, no AMIs to maintain, no node Auto Scaling group to tune, and no bin-packing to reason about — you pay per task for the vCPU and memory you request, billed per second. For the overwhelming majority of web services, APIs, workers, and scheduled jobs at startup and scaleup scale, this is the right answer; the slightly higher per-unit compute price buys back a large amount of engineering time and removes a whole class of operational risk. Fargate also supports Spot for fault-tolerant workloads, which claws back much of the cost gap.
Choose the EC2 launch type when you have a concrete reason: you need GPUs or specialized hardware (ML inference/training) that Fargate does not offer; you need very large instance types or particular CPU architectures and want full control of the host; you run at high, steady utilization where reserved or Spot EC2 capacity that you bin-pack densely is meaningfully cheaper than per-task Fargate; you need daemon workloads (a log shipper or agent on every host); or you have host-level requirements (custom kernel parameters, specific networking, privileged access) that Fargate's managed model does not permit. With EC2 you trade operational simplicity for control and, at scale and high utilization, lower cost — but you now own instance patching, scaling, and capacity planning.
The pragmatic pattern many teams land on is both: Fargate as the default for stateless services and bursty/spiky workloads, and an EC2 capacity provider for the specific workloads that justify it (GPU jobs, a few always-on high-utilization services). ECS supports mixing launch types and capacity providers in the same cluster, so this is not an all-or-nothing decision. Start on Fargate; let real cost and hardware needs — not theory — pull individual workloads onto EC2 later.
Autoscaling on ECS happens at two layers, and the distinction matters. Service autoscaling adjusts how many tasks run; cluster/capacity scaling (EC2 only) adjusts how much underlying compute exists for those tasks to land on. On Fargate you only deal with the first.
Service Auto Scaling changes the desired task count of a service via Application Auto Scaling, on a policy you choose. Target tracking is the default and the right starting point: pick a metric and a target (for example "keep average CPU at 60%," "keep average memory at 70%," or — often the best for request-driven services — "keep ALB requests-per-target at N"), and ECS adds or removes tasks to hold that target. Step scaling reacts to CloudWatch alarm thresholds in defined increments for more bespoke behaviour, and scheduled scaling pre-warms capacity for known traffic patterns (business-hours ramps, a daily batch window, a marketing event). Always set sensible minimum and maximum task counts, and tune cooldowns so the service does not thrash.
On the EC2 launch type you also need the underlying instances to scale, or tasks will sit in PENDING with nowhere to run. The modern mechanism is a capacity provider with managed scaling: ECS watches how much capacity your desired tasks need and scales the EC2 Auto Scaling group up and down to match a target utilization you set, including scaling in to zero spare capacity when idle. (The older "Cluster Auto Scaler" wiring is superseded by capacity providers for new builds.) On Fargate this entire layer disappears — there is no cluster to scale, so service autoscaling is all you configure. That deletion of an entire scaling concern is a big part of why Fargate is the default recommendation for teams without dedicated infra staff.
A practical note on cold starts and headroom: target-tracking reacts to load, it does not predict it, so for spiky traffic give yourself a little static headroom (a higher minimum task count, or scheduled scaling ahead of known spikes) rather than relying on the autoscaler to catch a sudden surge in time. The combination of request-per-target target tracking plus a sane minimum is what keeps latency flat during traffic bursts.
Secrets are where homegrown ECS setups are most often wrong: API keys and database passwords end up in plaintext environment variables baked into the image or the task definition. ECS gives you a clean, native way to avoid that entirely.
Store application secrets in AWS Secrets Manager (or in SSM Parameter Store as SecureString for simpler, non-rotating config), encrypted with KMS. In the task definition, reference them under secrets by ARN rather than putting values in environment; at task launch, the task execution role fetches the values and injects them as environment variables into the container, so the plaintext never lives in your image, your repo, your IaC state in cleartext, or the task-definition JSON. Scope the execution role to exactly the secret ARNs that task needs — nothing broader. Secrets Manager also handles rotation (for example database credentials) so you are not redeploying to change a password.
For non-secret configuration (feature flags, environment names, tunables) plain environment entries or Parameter Store String values are fine. The discipline that matters: never echo secret values in logs, never bake them into the container image, and keep the boundary clear — the execution role pulls secrets to start the task, and the application's own task role governs what AWS APIs the running code may call. For image pulls and secret fetches, prefer VPC endpoints (PrivateLink) to ECR, Secrets Manager, and SSM so those calls stay on the AWS network even from private subnets.
How a new task-definition revision reaches production is where outages are prevented or caused. ECS gives you a built-in rolling deploy and, via CodeDeploy, a true blue-green with instant rollback. Choosing between them is mostly about how much risk a single release carries.
The rolling update is ECS's native, default deployment. The service starts tasks on the new revision and drains tasks on the old one a few at a time, governed by minimumHealthyPercent and maximumPercent, registering new tasks with the ALB target group and deregistering old ones as it goes. It needs no extra services, costs nothing extra, and is the right default for lower environments and lower-risk services. Its limit: during the roll, both versions serve live traffic, so a bad release reaches some users before health checks catch it — which is why good health checks and connection draining matter so much here. ECS also supports deployment circuit breaker, which automatically rolls a failed rolling deploy back to the last known-good revision if the new tasks never reach a healthy steady state — turn it on.
Blue-green via AWS CodeDeploy stands up the new revision as a separate ("green") task set alongside the live ("blue") one, lets you validate green against a test listener, then shifts production traffic at the ALB — all at once, or as a canary (a small percentage first, then the rest) or linear ramp. Rollback is effectively instant because blue is still running: if a CloudWatch alarm trips during or just after the shift, CodeDeploy reverts traffic to blue automatically. This is the pattern for production services where a bad release is expensive: you get pre-shift validation, a controlled traffic shift, automated alarm-based rollback, and a clean previous version held warm. The cost is briefly running two task sets and the extra moving part of CodeDeploy. A common, sensible split is rolling (with the circuit breaker on) for dev/staging and CodeDeploy blue-green or canary for production.
Because every deploy is just a pointer to an immutable, SHA-tagged image and a task-definition revision, rollback is unambiguous: redeploy the previous revision (or, with blue-green, flip back to blue). The asterisk, as always, is database migrations — make them backward-compatible with expand-then-contract (add the new column, deploy code that writes both, backfill, remove the old later) so the previous task revision still runs against the new schema. Test the rollback path before you need it; a rollback you have never run is a hope, not a plan.
| Strategy | How traffic shifts | Rollback | Extra cost | How it runs |
|---|---|---|---|---|
| Rolling (ECS native) | Replace a few tasks at a time behind the ALB | Auto via deployment circuit breaker → last good revision | None | Built into the ECS service |
| Blue-green (CodeDeploy) | Validate green, shift at the ALB (all-at-once / canary / linear) | Instant — revert to blue, auto on CloudWatch alarm | Two task sets briefly | ECS + CodeDeploy deployment controller |
| Canary (CodeDeploy) | Small % first, watch alarms, then the rest | Auto on alarm; only a slice exposed | Small (extra task set) | CodeDeploy traffic-shifting config |
A service you cannot observe is a service you cannot operate, and a service whose cost you do not understand is one that surprises you on the invoice. ECS gives you native answers for both; the trick is turning them on deliberately rather than discovering the gaps during an incident.
Logging: configure the awslogs log driver (or FireLens/Fluent Bit for routing to a third party) so container stdout/stderr lands in CloudWatch Logs, one log group per service, with a retention policy set (unbounded retention is a silent cost leak). Metrics and tracing: turn on CloudWatch Container Insights for per-task and per-service CPU, memory, network, and task-count metrics, and instrument the application with AWS X-Ray or OpenTelemetry for distributed traces. Teams already on Datadog, Grafana, or Prometheus typically ship ECS metrics/logs there via the OpenTelemetry collector or a sidecar — fine, as long as something is watching. The non-negotiable: alarms on the signals that matter (error rate, p99 latency, unhealthy host/target count, task restarts) wired to the deploy rollback and to on-call, so the system tells you it is unhealthy instead of a customer doing it.
Cost: on Fargate you pay per task for requested vCPU and memory, per second — so cost scales directly with how many tasks run and how big each is, which makes right-sizing task CPU/memory and tuning autoscaling the two biggest levers. Fargate Spot can cut compute cost substantially for interruption-tolerant workloads (workers, batch, stateless services with graceful draining). On EC2, cost is driven by your instances, so the levers are dense bin-packing, Reserved Instances / Savings Plans for steady-state capacity, and Spot for the fault-tolerant portion — at high, steady utilization this can undercut Fargate, which is one of the main reasons to choose EC2. Across both, the recurring savings come from right-sized tasks, autoscaling that scales in as well as out, log retention limits, NAT-vs-PrivateLink choices for egress, and killing the over-provisioned defaults nobody revisited. A good build sets these correctly from day one; Compute Savings Plans cover Fargate too, so even the serverless path has a committed-use discount.
A task running behind a load balancer is a demo. A production ECS service is the list below. None of it is exotic; all of it is skipped under deadline pressure, and all of it is cheaper to do up front than to retrofit after the first incident.
Run this before you put real traffic on an ECS service:
Promote an immutable, SHA-tagged image through every environment and define every ECS object as code. A deploy then becomes "point the service at a new task-definition revision," a rollback becomes "point it back," and there is no console drift to debug at 3 a.m. If you take one thing from this page, take this.
The most common question right after "how do I set up ECS" is "should it be ECS or EKS (managed Kubernetes)?" The honest answer for most teams is ECS — and it is worth being clear about why, and about the cases where EKS genuinely wins.
Default to ECS when your goal is to run containers on AWS with the least operational overhead. There is no control plane to manage or upgrade, no Kubernetes version treadmill, far fewer moving parts, deep native integration with ALB/IAM/CloudWatch/Secrets Manager/CodeDeploy, and — on Fargate — no nodes at all. For a startup or scaleup running web services, APIs, workers, and jobs, ECS gets you to production faster and keeps you there with a smaller surface area to secure and operate. It is the pragmatic choice precisely because it does less.
Choose EKS when you have a concrete Kubernetes-shaped reason: you need the Kubernetes ecosystem (Helm, operators, CRDs, a specific controller or service mesh), you want portability across clouds or a consistent platform with on-prem, you are building an internal developer platform on Kubernetes primitives, or your team already has deep Kubernetes expertise and tooling. EKS is excellent and fully managed at the control-plane level — but you still own node groups (or use Fargate/Karpenter), cluster upgrades, add-ons, and a much larger configuration and security surface. That power is worth it when you will use it, and pure overhead when you will not.
The blunt version: do not adopt Kubernetes to run a handful of containers. If you cannot name the Kubernetes feature you need, ECS is the right answer, and you can revisit EKS the day a real requirement (a specific operator, multi-cloud, a platform play) actually appears. We have a dedicated ECS-vs-EKS breakdown and an EKS setup guide linked below if you want to go deeper on the decision before committing.
You can build everything above yourself; this page is the map. But most teams searching "amazon ecs setup" do not actually want to spend three weeks becoming ECS experts — they want a production-grade container platform shipped correctly so the team can get back to the product. That is what CloudRoute routes you to.
CloudRoute matches you to a vetted AWS partner who builds the ECS platform end to end: the VPC and awsvpc networking, the cluster and capacity providers (Fargate, EC2, or both), task definitions and services sized from real data, the ALB and target groups, service autoscaling on the right signal, least-privilege IAM, secrets via Secrets Manager, the rolling or CodeDeploy blue-green/canary deploy with tested rollback, CloudWatch logging/Container Insights/X-Ray observability, and all of it as infrastructure-as-code in your repo. You get the work done by people who do this for a living, without running a hiring loop or vetting agencies yourself.
The commercial part, stated honestly: for credit-eligible companies, the partner engagement is frequently AWS-funded — the partner is paid through AWS partner-funding programs and your AWS usage during the build is covered by credits — so the customer pays $0 or low cost. If you are not credit-eligible, it is a straightforward vetted-partner referral: you still skip the hiring-and-vetting slog, you just pay the partner for the engagement directly. CloudRoute is paid a commission by the partner, not by you. We tell you which bucket you are in up front; we do not pretend everything is free.
If you also want the AWS credits themselves — which is what funds the engagement — that runs in parallel. See the AWS credits routes (the $100K Activate Portfolio tier is the common one for funded startups) and the startup persona page below; the ECS build and the credit application are typically filed by the same partner in the same week.
Repo access + which AWS account(s) + your container(s) and what they need (ports, secrets, dependencies) + Fargate or EC2 preference + how hands-on you want to stay. The partner returns a production ECS service — networking, ALB, autoscaling, secrets, safe deploys, observability, and the IaC in your repo — with a rollback you have watched work. For credit-eligible companies, often at $0.
The launch-type decision compared on the axes that actually drive it. The honest default for most teams is Fargate; the table makes clear exactly when an EC2 capacity provider earns its keep.
| Dimension | Fargate | EC2 launch type |
|---|---|---|
| Who runs the compute | AWS — no instances you see | You — an EC2 Auto Scaling group you own |
| Ops burden on you | None (no patching, no AMIs, no node scaling) | You patch the OS/AMI, scale and bin-pack instances |
| Pricing model | Per task: requested vCPU + memory, per second | Per EC2 instance (On-Demand / RI / Savings Plan / Spot) |
| Cost sweet spot | Variable, spiky, or low-utilization workloads | High, steady utilization with dense bin-packing |
| Cheaper-at-scale option | Fargate Spot for interruption-tolerant tasks | Spot + Reserved/Savings Plans on the fleet |
| GPUs / specialized hardware | Not available | Yes — GPU and specialized instance types |
| Daemon workloads (per-host agents) | No (no hosts to run them on) | Yes (DAEMON scheduling on every instance) |
| Host-level control | None (managed runtime) | Full (kernel params, custom AMI, privileged needs) |
| Networking mode | awsvpc only (task-level ENI + SG) | awsvpc (recommended); bridge/host also possible |
| Scaling layers to manage | One — service autoscaling only | Two — service autoscaling + capacity-provider scaling |
| Best fit | Most teams; the default to start on | GPU/ML, very large instances, steady high-utilization, daemons |
Situation: Already on AWS with a containerized API limping along on a single ECS service the founding engineer had clicked together: tasks in public subnets with broad security groups, a database password and a third-party API key sitting in plaintext environment variables in the task definition, no autoscaling, no real health checks, deploys done by editing the service in the console, and no rollback when a release went bad. A recent deploy had dropped the service below capacity mid-roll and caused a 25-minute partial outage. They had no in-house DevOps hire and could not justify one yet — and they were raising and qualified for AWS credits.
What CloudRoute did: CloudRoute routed them within a day to a US-based AWS partner with an ECS/Fargate track record. The partner rebuilt the platform as code in Terraform: tasks moved to private subnets with awsvpc and tight per-service security groups behind an ALB, the database and API secrets moved into Secrets Manager (referenced by ARN, injected at launch via a least-privilege execution role), a separate least-privilege task role for the app's S3/DynamoDB calls, service autoscaling on ALB requests-per-target across two AZs, CodeDeploy blue-green deploys gated on CloudWatch alarms with automatic rollback, and CloudWatch Container Insights + X-Ray with alarms wired to on-call. They left it on Fargate (no GPU or steady-high-utilization reason to go EC2) and filed the AWS Activate Portfolio credit application in the same week.
Outcome: Production-grade ECS service live in under three weeks. Deploys went from a console edit and crossed fingers to a gated blue-green shift with automatic rollback on a failed health check; zero plaintext secrets remained in task definitions; the service now survives an AZ loss and scales on real request load. Because the company was credit-eligible, the engagement was AWS-funded and the customer paid $0; CloudRoute was paid by the partner.
build window: < 3 weeks · plaintext secrets removed: 100% · prod deploys: CodeDeploy blue-green, auto-rollback · cost to customer: $0 (credit-eligible)
CloudRoute routes you to a vetted AWS partner who builds your ECS platform end to end — networking, ALB, autoscaling, secrets, safe deploys, observability, and IaC. For credit-eligible companies it is often AWS-funded — customer pays $0. Otherwise, a clean vetted-partner referral.