infrastructure as code · AWS best practices · 2026

Infrastructure-as-code best practices on AWS — the senior-engineer reference (2026).

Most IaC failures are not tooling failures. They are state failures, blast-radius failures, and review-gate failures. This is the definitive 2026 guide: how to choose between Terraform, OpenTofu, CDK, CloudFormation, and Pulumi; how to design modules that survive reuse; how to run remote state with locking; how to structure multi-account environments; and how to put plan/apply gates, policy-as-code, and drift detection in front of every change — with the honest tradeoffs at each fork.

tools compared
5
failure modes covered
9
blast-radius unit
1 state file
ClickOps drift goal
0
TL;DR
  • Tool choice matters less than the disciplines around it. Terraform and OpenTofu dominate multi-cloud and AWS-heavy estates; AWS CDK wins when your team prefers real programming languages and an AWS-native object model; CloudFormation is the zero-dependency AWS baseline; Pulumi is the managed-state, real-code alternative. Pick one primary tool per estate and standardize — running three in parallel is the most common avoidable mess.
  • The three disciplines that separate a healthy IaC estate from a fragile one are: (1) remote state with locking and per-environment isolation so two applies cannot corrupt each other, (2) small modules with explicit inputs/outputs and pinned versions so reuse does not propagate breakage, and (3) a CI/CD pipeline where every change produces a reviewable plan and apply is gated behind that plan plus policy-as-code. Skip any one of these and the estate degrades within a quarter.
  • Drift and ClickOps are the slow killers. Every manual console change makes your code a lie; the fix is a one-way door — detect drift on a schedule, import or codify what exists, then lock the console down to read-only for humans and let the pipeline be the only thing that writes. Migrating off ClickOps is incremental: import the highest-blast-radius resources first, accept a messy interim state, and converge.
the bar

IWhat "good IaC on AWS" actually looks like in 2026

Before the tool debate, agree on the target. A healthy infrastructure-as-code estate has a small number of observable properties. If you can answer yes to all of them, the tool you picked is almost irrelevant. If you cannot, no tool will save you.

A mature estate is reproducible: any environment can be rebuilt from code in a fresh account with no undocumented manual steps. It is reviewable: every change to production infrastructure arrives as a pull request with a machine-generated plan attached, so a second engineer can see exactly which resources will be created, changed, or destroyed before anyone clicks merge. It is isolated: a mistake in the staging networking module cannot take down production, because the blast radius of any single apply is bounded by design.

It is governed: policy runs automatically against every plan — no public S3 buckets, no unencrypted volumes, no 0.0.0.0/0 SSH — so compliance is enforced by the pipeline rather than by a human remembering. And it is honest: the code matches reality, because drift is detected on a schedule and the console is effectively read-only for humans. Manual changes are the exception that triggers an alert, not the norm.

Almost every real-world IaC problem is a violation of one of those five properties, not a bug in Terraform or CDK. A team that says "Terraform is painful" usually means "we share one state file across five environments and our applies block each other" (an isolation failure) or "our root module is 4,000 lines and every change touches everything" (a blast-radius failure). The rest of this guide is organized around getting those five properties right, in roughly the order they bite.

the one-sentence test

If a senior engineer can open a pull request, read the plan, and predict the exact set of AWS resources that will change with high confidence — and a misconfiguration is rejected automatically before it merges — you have good IaC. Everything below is in service of making that sentence true.

the fork that matters

IIChoosing the tool: Terraform, OpenTofu, CDK, CloudFormation, Pulumi

There are five credible tools in 2026, and the honest framing is that the choice is mostly about your team and your estate, not about a technical winner. Below is what each is genuinely good at, what it costs you, and the situations where it is the right default.

The decision splits on two axes. First, declarative configuration (HCL, YAML) vs real programming languages (TypeScript, Python, Go): configuration languages are easier to read and review and harder to abuse; real languages give you loops, abstractions, and unit tests at the cost of unbounded complexity. Second, multi-cloud vs AWS-native: if you are AWS-only forever, native tools integrate more deeply; if you run or might run anything else, a cloud-agnostic engine avoids a rewrite.

The single most important rule across all five: pick one primary tool per estate and standardize on it. The most common avoidable mess in 2026 is a company running Terraform for networking, CDK for application stacks, and ad-hoc CloudFormation for "that one legacy thing," with no shared state strategy and three sets of conventions. A second tool is justified only when there is a hard boundary — for example, CDK for application teams who ship constructs, with a platform team owning the foundational accounts in Terraform — and even then the boundary must be explicit and the state stores must not overlap.

Terraform — the incumbent default

What it is: declarative HCL, provider-based, the most widely adopted IaC tool with the deepest module ecosystem and the largest hiring pool. The AWS provider is comprehensive and tracks new services quickly.

Strengths: enormous community, the Terraform Registry of reusable modules, mature tooling (tflint, terratest, tfsec, Checkov), and provider coverage well beyond AWS. Plans are readable and reviewable.

Costs: HCL hits a ceiling on dynamic logic; complex conditionals get awkward. Since the August 2023 license change, Terraform is under the Business Source License (BSL) rather than a true open-source license — a real consideration for some organizations and the reason OpenTofu exists.

Right default for: most teams, most estates, especially anything multi-cloud or where hiring matters.

OpenTofu — the open-source Terraform fork

What it is: the Linux Foundation fork of Terraform created after the BSL change, under the MPL-2.0 open-source license. It is HCL-compatible and a near drop-in for most Terraform configurations.

Strengths: genuinely open-source governance, no license risk, and feature parity for the vast majority of workflows. State-encryption support landed natively. The migration from Terraform is usually low-effort.

Costs: the ecosystem still references "Terraform" everywhere; some commercial tooling and modules track HashiCorp releases first. The two projects are diverging slowly, so deep edge-case features may differ.

Right default for: teams that want the Terraform workflow without BSL exposure, or that prioritize open-source governance. See the dedicated comparison below for the decision detail.

AWS CDK — real code, AWS-native object model

What it is: the AWS Cloud Development Kit lets you define infrastructure in TypeScript, Python, Java, Go, or C#; it synthesizes to CloudFormation under the hood and deploys via CloudFormation stacks.

Strengths: high-level constructs encode AWS best practices (a single L2 construct can provision a queue with a sensible dead-letter setup and IAM in a few lines); you get loops, type-checking, IDE autocomplete, and unit tests in a language your engineers already write.

Costs: you inherit CloudFormation's execution model, including its rollback behavior and per-stack resource limits and slower convergence; the abstraction can hide what is actually being created; and it is AWS-only.

Right default for: AWS-committed teams with strong software engineers who want abstraction and testing, and who are comfortable on the CloudFormation engine.

CloudFormation — the AWS-native baseline

What it is: AWS's first-party declarative service, authored in YAML or JSON, with no external dependency and no separate state store (AWS manages state inside the stack).

Strengths: zero extra tooling, deep service integration, managed state and drift detection built in, StackSets for multi-account rollout, and it is the substrate CDK and SAM compile to.

Costs: verbose templates, weaker module ergonomics than Terraform, AWS-only, and rollback semantics that can strand a stack in a failed state requiring manual intervention.

Right default for: AWS-only shops that want no third-party dependencies, or teams already standardized on SAM/CDK where CloudFormation is implied.

Pulumi — real code with managed state

What it is: infrastructure in TypeScript, Python, Go, or C#, multi-cloud like Terraform, but with a hosted state backend (Pulumi Cloud) as the default and a strong secrets-encryption story.

Strengths: real languages plus multi-cloud plus managed state out of the box; good testing support; appealing to teams that find HCL limiting but do not want to be AWS-locked like CDK.

Costs: smaller community and module ecosystem than Terraform; the managed backend is a dependency (self-hosting is possible but less common); fewer engineers have it on their résumé.

Right default for: multi-cloud teams who want real code and are happy with a managed control plane.

the thing that bites hardest

IIIRemote state and locking — get this wrong and nothing else matters

State is where IaC estates actually fail. The tool tracks what it believes exists in a state file; if two people apply at once, or the file lives on someone's laptop, or it has no isolation between environments, you get corruption, race conditions, and the worst outcome of all — an apply that destroys production because the state disagreed with reality.

For Terraform and OpenTofu on AWS, the canonical pattern is an S3 backend for the state object plus locking. Historically locking used a DynamoDB table; as of late 2024 the S3 backend supports native state locking via the use_lockfile option, which removes the separate DynamoDB dependency for new setups. Either way, the non-negotiables are the same: server-side encryption on the bucket, versioning enabled (so you can roll back a corrupted state), bucket policies that block public access, and a lock so that a second apply blocks rather than racing.

CDK and CloudFormation sidestep the state-file problem because AWS manages state inside the stack — there is no terraform.tfstate to lose. That is a genuine operational advantage. The flip side is that you inherit CloudFormation's drift-detection and rollback model, and you give up the granular plan/state introspection that Terraform users rely on. Pulumi defaults to its hosted backend, which similarly removes the "where does state live" decision at the cost of a managed dependency.

The isolation rule is universal and is the part teams most often get wrong: one state file per environment per component, never a single shared state for everything. A monolithic state where dev, staging, and production networking all live together means every plan locks every environment, blast radius is unbounded, and a fat-fingered destroy can cascade. Split state by environment first (dev / staging / prod in separate accounts and separate state objects), then by component within an environment (networking, data, application) once an environment's state grows large enough that plans become slow or scary.

remote state + locking patterns by tool · 2026
ToolState storageLockingIsolation unitKey risk if done wrong
Terraform / OpenTofuS3 object (versioned, encrypted)S3 native lockfile or DynamoDBOne state per env + componentShared state → races, cascade destroys
AWS CDKCloudFormation-managed (per stack)CloudFormation handles concurrencyCloudFormation stackGiant stacks → resource limits, slow rollbacks
CloudFormationCloudFormation-managed (per stack)Built inStack / StackSetFailed stack stranded mid-rollback
PulumiPulumi Cloud (or self-hosted)Backend-managedStack (per project + env)Managed-backend dependency, secrets sprawl
The throughline: isolate state per environment and ideally per component. The "one state file to rule them all" anti-pattern is the single most common cause of scary, slow, dangerous applies on AWS.
reuse without breakage

IVModule design and reuse — small, versioned, explicit

Modules are how you stop copy-pasting infrastructure. Done well, a module is a reliable building block; done badly, it becomes a god-object whose every change ripples unpredictably across environments. The difference is almost entirely about size, interface discipline, and versioning.

A good module does one thing and exposes a small, explicit interface: a clear set of input variables with types and sensible defaults, and a clear set of outputs. It should be possible to read a module's inputs and outputs and understand exactly how it composes into the rest of the estate without reading its internals. Resist the urge to build a single "platform" module that provisions networking, compute, data, and IAM in one call — that is a god-object, and it forces every consumer to upgrade in lockstep.

The reuse boundary that matters most is the line between foundational modules (a VPC, a baseline IAM setup, a logging configuration) that change rarely and are shared across the whole org, and composition (the root configuration for a given environment that wires foundational modules together with environment-specific values). Foundational modules should live in their own repository or registry, be semantically versioned, and be consumed by pinned version. Composition lives close to the environment it describes.

Versioning is the discipline that makes reuse safe. Pin module versions explicitly — never consume a shared module from a floating main branch, because then any merge to that module silently changes every environment that references it. Pin to a tag or a version constraint, bump deliberately, and let the change flow environment-by-environment through the normal review pipeline. The same applies to providers: pin the AWS provider version and the required tool version so that an unrelated upgrade does not surprise you mid-apply.

  • One module, one responsibility — A module provisions a coherent unit (a VPC, a service, a bucket-with-policy). If you cannot describe it in one sentence, split it.
  • Explicit typed inputs, explicit outputs — Every variable typed with a description and a default where safe; outputs that downstream modules actually consume. The interface is the contract.
  • Pin versions — never float on main — Consume shared modules and providers by version tag or constraint. Floating references turn one merge into an estate-wide change.
  • Separate foundational modules from composition — Shared building blocks live in their own versioned repo/registry; environment root configs wire them together with local values.
  • Prefer composition over deep inheritance — Wire small modules together at the root rather than nesting modules many layers deep — deep nesting hides blast radius and makes plans hard to read.
  • Keep environment differences in variables, not in forks — Dev and prod should run the same module code with different inputs. If they have diverged into separate copies, drift is inevitable.
structure + blast radius

VMulti-account and multi-environment structure

On AWS, the account is the strongest isolation boundary there is — stronger than a VPC, stronger than an IAM policy. Serious 2026 estates use multiple accounts deliberately, and the IaC structure should mirror that account topology rather than fight it.

The widely adopted pattern is a multi-account landing zone: a management/organization account at the root, with separate accounts (or organizational units) for security/audit, shared services, logging, and then per-workload-per-environment accounts — a distinct account for production, another for staging, another for development. This is what AWS Control Tower and AWS Organizations are built to manage, and it gives you hard blast-radius boundaries: a runaway process in dev cannot touch production resources, and per-account billing makes cost attribution trivial.

IaC structure should follow that topology. Each account-environment gets its own state (for Terraform/OpenTofu) or its own set of stacks (for CDK/CloudFormation), and cross-account concerns — a centralized logging bucket, an organization-wide guardrail — are provisioned from the account that owns them and referenced by ID elsewhere. Resist the temptation to provision production resources from a state that also contains dev; the account boundary is only as strong as the isolation of the thing that writes into it.

Two structural anti-patterns are worth naming. The first is one giant account with everything in it, distinguished only by resource tags — tags are not a security boundary, and a single compromised credential or a single bad apply reaches everything. The second is directory-per-environment with copy-pasted code, where dev/ and prod/ contain forked copies of the same configuration that have quietly drifted apart. The healthy middle is shared module code consumed by thin per-environment root configurations, each pointing at its own account and its own state, differing only in variable values.

why the account boundary wins

IAM policies and VPC segmentation are necessary, but the AWS account is the only boundary that contains service quotas, blast radius, billing, and most security incidents at once. Structuring environments as separate accounts — and mirroring that in your IaC state/stack layout — is the highest-leverage structural decision in the whole estate.

plan/apply gates

VICI/CD for IaC — every change is a reviewed plan

The pipeline is what turns a pile of HCL or TypeScript into a governed system. The defining property of a healthy IaC pipeline is that <strong>nobody applies from a laptop</strong>: changes flow through pull requests, a plan is generated automatically and posted for review, and apply happens only from CI after the plan is approved.

The canonical flow for Terraform/OpenTofu is: a pull request triggers fmt, validate, a lint pass, a security scan, and a plan; the plan output is posted as a comment on the PR so a reviewer can read exactly what will change; on merge to the main branch, CI runs apply against the saved plan artifact — never a fresh plan, because re-planning at apply time can pick up changes the reviewer never saw. CDK and CloudFormation follow the same shape: synth/diff on PR, deploy on merge, with change sets giving CloudFormation users an equivalent of the plan-review step.

The apply gate is the load-bearing control. Production applies should require an explicit human approval after the plan is reviewed (a protected environment or a manual approval step), so that even a merged change pauses before it touches production. Many teams run apply automatically for dev, require review-then-auto-apply for staging, and require review-plus-manual-approval for production. The point is graduated friction that scales with blast radius.

A few execution details separate robust pipelines from fragile ones. Save the plan as an artifact and apply that exact artifact so review and apply cannot diverge. Use short-lived credentials — OIDC federation from the CI provider into an AWS IAM role, not long-lived access keys baked into CI secrets. Serialize applies per state so two merges cannot apply against the same state concurrently. And make plans legible by keeping state files small enough that a plan is reviewable in a minute, not a 600-line diff nobody reads.

The pipeline stages, in order

1. Format + validateterraform fmt -check / cdk synth to fail fast on syntax and style.

2. Lint — tflint (or cfn-lint for CloudFormation) to catch provider misuse, deprecated arguments, and convention violations.

3. Policy + security scan — Checkov / tfsec / OPA / Sentinel against the plan; this is the policy-as-code gate described below.

4. Plan / diff — generate the plan, save it as an artifact, and post a human-readable summary on the PR.

5. Review — a second engineer reads the plan and approves the PR. The plan is the review object, not just the code.

6. Apply — on merge, apply the saved plan artifact from CI using short-lived OIDC credentials, with a manual gate for production.

policy-as-code + testing

VIIPolicy-as-code, testing, and secrets

Review by a human catches design problems; it does not reliably catch a forgotten encryption flag at 6pm on a Friday. Policy-as-code and automated tests are how you make the non-negotiables non-negotiable — enforced by the pipeline rather than by vigilance.

Policy-as-code evaluates rules against your plan before apply and fails the build on a violation. The three common engines: Sentinel (HashiCorp's policy language, tightly integrated with Terraform Cloud/Enterprise), OPA/Rego (Open Policy Agent — vendor-neutral, works across tools and beyond IaC), and Checkov (an open-source scanner from Bridgecrew/Prisma with a large library of built-in checks for AWS misconfigurations). A typical policy set blocks public S3 buckets, unencrypted EBS volumes and RDS instances, security groups open to 0.0.0.0/0 on sensitive ports, IAM policies with wildcard actions on wildcard resources, and missing mandatory tags. The win is that these rules run on every plan automatically — compliance becomes a property of the pipeline.

Testing spans a spectrum. At the cheapest end, tflint and validate catch syntax and provider misuse in seconds. Static security scanners (tfsec, Checkov) catch misconfigurations without deploying anything. At the more expensive end, Terratest (a Go library) and Terraform's native test framework actually deploy resources into a sandbox account, assert that they behave correctly, and tear them down — real integration tests for infrastructure. CDK and Pulumi users get unit testing in their native language essentially for free, asserting on the synthesized template. Most teams should run lint plus static scanning on every PR and reserve full deploy-and-assert tests for the highest-value modules.

Secrets must never live in state in plaintext, in variable files committed to git, or in plan output. Terraform/OpenTofu state can contain sensitive values, so encrypt state at rest (S3 SSE, or OpenTofu/Pulumi native state encryption) and pull secrets at apply time from AWS Secrets Manager, SSM Parameter Store (SecureString), or a vault — referenced by name, not embedded. Mark sensitive variables as such so they are redacted from logs, and prefer IAM roles and short-lived credentials over any stored long-lived key. The cardinal sin is a long-lived AWS access key committed to a .tfvars file in the repo.

policy + testing layers for AWS IaC · 2026
LayerRepresentative toolsRuns whenCostCatches
Lint / validatetflint, cfn-lint, validate/synthEvery PRSecondsSyntax, deprecated args, convention breaks
Static security scanCheckov, tfsecEvery PRSecondsPublic buckets, missing encryption, open SGs
Policy-as-codeSentinel, OPA/RegoEvery planSecondsOrg guardrails, tagging, IAM wildcards
Integration testsTerratest, native test, CDK assertionsHigh-value modules / nightlyMinutes–hoursReal behavior in a sandbox account
Run the top three on every change; reserve deploy-and-assert integration tests for foundational modules where a regression is expensive. The goal is that an insecure plan fails the build before a human ever sees it.
the slow killers

VIIIDrift detection and migrating off ClickOps

The two slow killers of an IaC estate are drift (reality diverging from code) and ClickOps (humans changing infrastructure in the console). They are the same disease: every manual change makes the code a little more of a lie until, eventually, nobody trusts a plan and people stop using the pipeline at all.

Drift happens when something changes a resource outside your IaC — an engineer tweaks a security group in the console during an incident, an auto-scaling process adjusts capacity, or a separate tool mutates a tag. The defense is detection on a schedule: run terraform plan (or CloudFormation drift detection, or the Pulumi equivalent) on a cron, and alert when the plan is non-empty against an unchanged codebase. Detected drift forces a decision — either the change was legitimate and should be codified, or it was a mistake and should be reverted by re-applying. What you must not do is let drift accumulate silently; a quarter of unaddressed drift turns every future plan into a minefield.

Migrating off ClickOps — adopting IaC for an estate that was built by hand — is the most common real-world starting point, and it is an incremental, one-way journey rather than a big-bang rewrite. The sequence that works: inventory what exists; write module code that describes the target state; import the existing resources into state so the tool adopts them without recreating them (Terraform/OpenTofu import blocks make this far less painful than the old per-resource CLI imports; CloudFormation supports importing existing resources into a stack); reconcile until plan shows no changes; then repeat for the next set of resources. Start with the highest-blast-radius, slowest-changing resources (networking, IAM, data stores) because those are where a manual mistake hurts most and where codification pays off fastest.

The decisive final step is closing the one-way door: once a resource is under IaC, take write access away from humans in the console. Scope human IAM to read-only for managed resource types and let the CI role be the only principal that can mutate them. Until you do this, drift will keep returning, because the path of least resistance during an incident is always the console. Making the pipeline the only thing that writes is what converts IaC from "we have some Terraform" into "infrastructure is genuinely code."

  • Detect drift on a schedule — Run plan / drift detection on a cron and alert on any non-empty diff against unchanged code. Never let drift accumulate silently.
  • Codify or revert — decide every time — Each detected drift is a fork: legitimate change → bring it into code; mistake → re-apply to revert. Leaving it unresolved is the failure mode.
  • Import the highest-blast-radius resources first — Networking, IAM, and data stores before stateless app resources. Use import blocks (Terraform/OpenTofu) or stack resource import (CloudFormation).
  • Accept a messy interim state — A half-migrated estate is normal and fine. Converge incrementally; do not block on a perfect big-bang cutover.
  • Close the door: read-only console for humans — Once a resource is managed, the CI role is the only principal that may write it. This is what makes the migration stick.
the reference checklist

IXThe IaC best-practices checklist

The compressed reference. If you are auditing an existing AWS IaC estate or standing one up from scratch, work down this list. Each item maps to one of the five properties from the opening section — reproducible, reviewable, isolated, governed, honest.

No estate hits all of these on day one, and that is fine — the list is a direction, not a gate. The ordering roughly tracks how badly each gap will hurt: state and isolation first, because they cause the catastrophic failures; review gates and policy next, because they prevent the recurring ones; testing and drift discipline last, because they are what keep a healthy estate healthy over years.

  • One primary tool, standardized — A single primary IaC tool per estate with documented conventions. A second tool only at an explicit, non-overlapping boundary.
  • Remote state, encrypted, versioned, locked — State in S3 (or managed by CDK/CloudFormation/Pulumi), encrypted at rest, versioned, with locking so concurrent applies cannot race.
  • State isolated per environment and per component — No single shared state across environments. Split by env first, then by component as state grows. Mirror your AWS account topology.
  • Small, versioned, explicitly-typed modules — One responsibility per module; pinned versions, never floating on main; foundational modules separated from environment composition.
  • Multi-account landing zone — Separate AWS accounts per environment (and for security/logging/shared services). Tags are not a security boundary; the account is.
  • PR-driven plan, CI-only apply — Every change is a PR with a machine-generated plan posted for review. Apply the saved plan artifact from CI — never from a laptop, never a fresh re-plan.
  • Graduated apply gates — Auto-apply dev, review staging, review-plus-manual-approval for production. Friction scales with blast radius.
  • Short-lived CI credentials (OIDC) — CI federates into an AWS IAM role via OIDC. No long-lived access keys in CI secrets or repos.
  • Policy-as-code on every plan — Sentinel / OPA / Checkov enforce org guardrails — no public buckets, no unencrypted volumes, no wildcard IAM, mandatory tags — automatically.
  • Tiered testing — Lint + static scan on every PR; deploy-and-assert integration tests (Terratest / native test / CDK assertions) for high-value modules.
  • Secrets out of state and out of git — Pull from Secrets Manager / SSM SecureString / vault at apply time; encrypt state; mark sensitive variables; never commit keys.
  • Scheduled drift detection + read-only console — Cron drift detection with alerting; codify-or-revert every diff; revoke human console write access once a resource is managed.
where most teams actually are

In practice, most teams have a tool and some modules but fail on three things: state isolation (one shared state), apply gates (people still apply from laptops), and the read-only-console door (ClickOps never stopped). Fixing those three moves an estate from fragile to durable faster than any tool migration.

tool selection at a glance

Terraform vs OpenTofu vs CDK vs CloudFormation vs Pulumi

The honest one-screen summary. There is no universal winner; there is a right default for your team and estate. Read across the row that matches your constraints.

DimensionTerraformOpenTofuAWS CDKCloudFormationPulumi
LanguageHCL (declarative)HCL (declarative)TS/Py/Java/Go/C#YAML/JSONTS/Py/Go/C#
LicenseBSL (source-available)MPL-2.0 (open source)Apache 2.0AWS-nativeApache 2.0 (core)
Cloud scopeMulti-cloudMulti-cloudAWS-onlyAWS-onlyMulti-cloud
State modelSelf-managed (S3 + lock)Self-managed (S3 + lock)CloudFormation-managedCloudFormation-managedManaged backend (default)
Ecosystem / hiringLargestLarge (Terraform-compatible)Strong on AWSAWS-nativeSmaller
Best default forMost teams; multi-cloudTerraform workflow, no BSLAWS-committed, strong devsAWS-only, zero depsMulti-cloud + real code
Terraform and OpenTofu share a workflow; the choice between them is mostly governance/licensing (see the dedicated OpenTofu vs Terraform cornerstone). CDK and CloudFormation share an engine; CDK adds abstraction and testing on top. Pulumi is the multi-cloud, real-code, managed-state option. Standardize on one — running several without a clear boundary is the actual mistake.
foundation, not a workshop
Get a production-grade IaC foundation built — often AWS-funded
Get matched with a partner →
a recent match

From ClickOps to a governed IaC estate — anonymized

inquiry · seed-stage b2b SaaS, 9 engineers, single AWS account
Seed-stage B2B SaaS, ~9 engineers, everything hand-built in one AWS account via the console

Situation: The entire production environment had been click-built over 18 months: one account, no separation between prod and experiments, security groups edited live during incidents, and zero state in code. A failed SOC 2 pre-assessment flagged the missing change control and the absent account isolation. The team had no one with deep Terraform or landing-zone experience and could not afford a 2-month internal detour from product.

What CloudRoute did: Routed within 24 hours to a vetted AWS partner with landing-zone and Terraform delivery experience. The partner stood up a multi-account landing zone (separate prod / staging / dev accounts plus security + logging), imported the highest-blast-radius existing resources — VPC, IAM, RDS — into Terraform with import blocks, wired remote state in S3 with native locking and encryption, and built a GitHub Actions pipeline with tflint + Checkov + a gated plan/apply flow using OIDC short-lived credentials. Human console write access was scoped to read-only for managed resources.

Outcome: Within the engagement, every infrastructure change moved to PR-with-plan review; the SOC 2 change-control and isolation gaps closed; drift detection ran nightly with alerting. Because the work qualified under AWS partner-funding for foundational engagements, the customer paid $0 — AWS funded the partner, and CloudRoute was paid a routing commission by the partner.

engagement window: ~6 weeks · accounts created: 4 · resources imported: highest-blast-radius first · console writes for humans: revoked · cost to customer: $0

faq

Common questions

Terraform or OpenTofu in 2026 — which should a new AWS project start with?
For most teams the workflow is identical, so the decision is governance, not capability. Choose OpenTofu if open-source licensing matters to your organization or you want to avoid the Business Source License entirely — it is MPL-2.0, HCL-compatible, and a near drop-in. Choose Terraform if you want the largest ecosystem, the deepest registry of modules, the biggest hiring pool, and first-class support from commercial tooling that still tracks HashiCorp releases. Either is a defensible default; both are far better than running neither.
When does AWS CDK make more sense than Terraform on AWS?
CDK is the stronger choice when your team is AWS-committed for the foreseeable future, has strong software engineers who prefer a real programming language, and values high-level constructs that encode AWS best practices plus native unit testing. You inherit the CloudFormation engine (its rollback model, stack limits, and slower convergence) and you give up multi-cloud portability and Terraform-style plan introspection. If you might run other clouds, or you want the largest module ecosystem and hiring pool, Terraform or OpenTofu is the safer default.
What is the single most common IaC mistake on AWS?
Sharing one state file (or one giant stack) across all environments. It makes every plan lock every environment, gives every apply an unbounded blast radius, and turns a fat-fingered destroy into a cascade across dev, staging, and production at once. The fix is to isolate state per environment first — ideally in separate AWS accounts — and then per component once a state grows large. This single change prevents most catastrophic IaC incidents.
How should remote state and locking be configured for Terraform/OpenTofu on AWS?
Store state in an S3 bucket with server-side encryption and versioning enabled, block public access via bucket policy, and enable locking so a second apply blocks rather than races. As of late 2024 the S3 backend supports native lockfile-based locking (the use_lockfile option), which removes the older separate DynamoDB lock-table dependency for new setups; existing DynamoDB-based locking remains valid. Versioning matters because it lets you roll back a corrupted state object.
How do you keep secrets out of IaC and out of state?
Never commit credentials to .tfvars or any file in the repo, and never embed plaintext secrets in resource definitions. Pull secrets at apply time from AWS Secrets Manager or SSM Parameter Store SecureString (or a vault), referenced by name. Encrypt state at rest — S3 server-side encryption, or OpenTofu/Pulumi native state encryption — because state can contain sensitive values. Mark sensitive variables so they are redacted from logs, and use short-lived OIDC-federated IAM roles for CI instead of any long-lived access key.
What does policy-as-code actually enforce, and which tool should I use?
Policy-as-code evaluates rules against your plan before apply and fails the build on a violation — typically blocking public S3 buckets, unencrypted volumes and databases, security groups open to 0.0.0.0/0 on sensitive ports, wildcard IAM actions on wildcard resources, and missing mandatory tags. Sentinel is the right fit if you are on Terraform Cloud/Enterprise; OPA/Rego is the vendor-neutral choice that works across tools; Checkov is an easy open-source scanner with a large built-in AWS rule library. Many teams run Checkov on every PR and layer OPA or Sentinel for custom org guardrails.
How do you migrate an existing ClickOps AWS environment to IaC without downtime?
Incrementally, never big-bang. Inventory what exists, write module code describing the target state, then use import (Terraform/OpenTofu import blocks, or CloudFormation resource import) so the tool adopts the live resources without recreating them, and reconcile until plan shows no changes. Start with the highest-blast-radius, slowest-changing resources — networking, IAM, data stores. Accept a messy half-migrated interim state and converge. Crucially, once a resource is managed, revoke human console write access so the pipeline is the only thing that writes it — otherwise drift returns.
How should CI/CD gate IaC changes for production?
Every change is a pull request that triggers format/validate, lint, a policy + security scan, and a plan whose output is posted for human review. On merge, CI applies the saved plan artifact — not a fresh re-plan, which could pick up changes the reviewer never saw — using short-lived OIDC credentials. Production specifically should require a manual approval step after the plan is reviewed, so even a merged change pauses before touching prod. A common graduation is auto-apply for dev, review for staging, and review-plus-manual-approval for production.

Want this foundation built right — without the 2-month detour?

CloudRoute routes you to a vetted AWS partner who builds the landing zone, remote state, modules, and gated pipeline to these exact standards. Often AWS-funded, so the customer pays $0. No procurement theater.

matched within< 24h
typical engagement4–8 weeks
cost to you$0
Infrastructure-as-code best practices on AWS (2026) — the reference · CloudRoute