AWS gives you four well-defined DR strategies — Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active — that trade money for recovery speed on a sliding scale. This page explains each one with honest RTO/RPO numbers and cost ranges, how to pick the right tier per workload, the services that actually implement them, and why a DR plan you have never tested is not a DR plan.
Every DR conversation collapses into two numbers. Get them right and the architecture chooses itself. Get them wrong — or never define them — and you either overspend on a second region you do not need or discover during an outage that "we have backups" meant something very different from "we can be back online in an hour."
RTO — Recovery Time Objective is how long you can be down. From the moment a region, AZ, or critical service fails to the moment you are serving traffic again. If your RTO is four hours, anything that restores in three hours passes; anything that takes six fails. RTO is the variable the four DR strategies primarily compete on.
RPO — Recovery Point Objective is how much data you can afford to lose. If your last good copy is from 15 minutes before the failure, your RPO is 15 minutes — everything written in that window is gone. RPO is set almost entirely by your replication and backup cadence: continuous replication gives seconds, hourly snapshots give up to an hour, nightly backups give up to a day.
The trap is treating these as a single company-wide setting. They are per-workload. The blast radius of losing an hour of payment records is not the blast radius of losing an hour of a staging environment. Mature teams write an RTO/RPO pair next to each tier-1 system, then map each system to the cheapest DR strategy that meets it. That mapping is the entire job — the AWS services are just implementation.
One honest caveat for 2026: vendors love quoting "RTO of minutes." Real RTO includes the unglamorous parts — DNS propagation, connection draining, warming caches and connection pools, re-pointing application config, and a human deciding to actually pull the trigger. A design that can technically fail over in 4 minutes often takes 25 in practice because step one is paging someone at 3am. Test to find your real number.
For each tier-1 system, answer two questions in dollars: what does an hour of this being down cost us? and what does an hour of lost data cost us? If both answers are large, you are looking at Warm Standby or Active/Active. If downtime is expensive but stale data is survivable, Pilot Light. If both are cheap, Backup & Restore is the correct, non-lazy answer.
AWS's own Well-Architected guidance defines four DR strategies. They form a ladder: as you climb, RTO and RPO shrink and the monthly bill grows, because you are paying to keep progressively more of the recovery region running before disaster strikes. Here is each one with what it really involves.
Read these as a spectrum, not four boxes. The dividing line between them is simply "how much of the second region is already running when the primary dies?" — nothing for Backup & Restore, just the data for Pilot Light, a small live stack for Warm Standby, a full live stack for Active/Active.
What runs in the recovery region: nothing, until you need it. You keep backups — AMIs, EBS snapshots, RDS snapshots, S3 data — copied to a second region (or at least a second AZ). When disaster hits, you provision fresh infrastructure from those backups.
RTO: hours to days, depending on how much you have automated the rebuild. With infrastructure-as-code (Terraform/OpenTofu/CloudFormation) the environment can stand back up in a few hours; without it, expect a long, error-prone day. RPO: the age of your last backup — typically 1–24 hours.
Cost: the cheapest tier by far. You pay for snapshot/object storage and cross-region transfer, not for idle compute. For most workloads this is single- to low-double-digit dollars per month over your normal bill.
Right for: internal tools, analytics, batch pipelines, dev/staging, and any tier-2/tier-3 system where a few hours offline is annoying but not existential. It is also the correct default for brand-new startups who cannot yet justify a second live region.
What runs in the recovery region: the core data layer, always on and continuously replicated — your database (via RDS/Aurora cross-region replicas) and critical object storage (via S3 Cross-Region Replication). The application and compute tier exist as definitions (AMIs, IaC, container images) but are switched off. The "pilot light" is lit; you scale up the rest on failover.
RTO: roughly 10–60 minutes — the time to launch and scale the compute tier and re-point traffic, since the slow part (restoring data) is already done. RPO: seconds to a few minutes, set by replication lag.
Cost: moderate. You pay continuously for the replicated database and storage in the second region, but not for idle application servers. Often 20–40% of running a full duplicate stack.
Right for: production systems where losing recent data is unacceptable but a 15–45 minute recovery is tolerable — many B2B SaaS apps, line-of-business systems, and workloads with strict RPO but relaxed RTO sit here comfortably.
What runs in the recovery region: a fully functional but scaled-down copy of the entire stack — data layer plus a minimal always-on compute tier. The standby can serve traffic immediately; on failover you scale it up to full capacity and shift traffic over.
RTO: minutes. There is no cold start — the application is already running, just small. RPO: seconds, via continuous replication. Cost: meaningfully higher than Pilot Light because you run real (if minimal) compute around the clock in two regions; typically 40–60% of a full second stack.
Right for: revenue-critical and customer-facing production systems where minutes of downtime cost real money or trust — payment processing, core auth, primary customer APIs. This is the most common landing spot for a funded startup's tier-1 services.
What runs in the recovery region: everything, at full scale, actively serving production traffic. Two (or more) regions both take live requests behind Route 53 or a global load balancer. If one region fails, the other simply absorbs the load — there is no "failover event," just degraded capacity.
RTO: effectively zero (seconds of DNS/health-check reaction). RPO: near-zero, but this is the hard part — active/active forces you to solve multi-region data consistency (Aurora Global Database with write-forwarding, DynamoDB Global Tables, or app-level conflict handling).
Cost: the most expensive tier — you are running 150–200%+ of a single-region footprint, plus the engineering cost of genuinely multi-region-safe application code. Right for: systems where downtime is catastrophic or contractually forbidden — large-scale fintech, healthcare, trading, and anything with an SLA that leaves no room for a recovery window. Most startups do not need this on day one and should not pretend they do.
The single most common DR mistake is picking one strategy for the whole company. The second most common is picking the most expensive one out of anxiety. The right design is almost always a mix, assigned workload by workload against the RTO/RPO you wrote down in section I.
Start by sorting your systems into tiers. Tier-1 is anything whose outage directly stops revenue or breaks a contractual SLA — payments, auth, the core product API. Tier-2 is important but survivable for an hour or two — internal dashboards, secondary features, async workers. Tier-3 is everything you could lose for a day without a customer noticing — analytics, batch ETL, dev and staging.
Then assign the cheapest strategy that meets each tier's RTO/RPO. The output for a typical funded startup looks like a blend, not a single choice:
Two practical notes. First, AZ failure versus Region failure are different problems: a multi-AZ deployment (a single RDS Multi-AZ instance, an Auto Scaling group spanning AZs) protects you from the far more common single-AZ outage almost for free, and should be your baseline before you ever discuss cross-region DR. Second, do not over-engineer for a region-wide failure that is rare — but do not assume it cannot happen either. The honest framing is: multi-AZ is table stakes; cross-region DR is a deliberate, costed decision per tier-1 workload.
Each DR tier is assembled from a small, stable set of AWS services. Knowing which service does what — and where the sharp edges are — is the difference between a DR plan that works on the day and one that fails in a new and surprising way.
These are the load-bearing services for DR on AWS as of 2026. None of them is exotic; the skill is in wiring them together correctly and proving the seams hold.
The component teams forget is everything that is not the database: secrets, parameter store values, TLS certs, DNS records, IAM roles, KMS keys, and Auto Scaling/launch templates in the recovery region. A database that replicates perfectly is useless if the failover region has no decryption key, no certs, and no idea how to scale the app. Replicate — or IaC — the supporting plane too.
This is where most DR programs quietly fail. The architecture diagram is correct, the replication is green, and nobody has ever actually failed over. Then a real outage arrives and the team discovers the runbook is three jobs out of date and the one person who understood it has left.
A DR runbook is the precise, ordered, copy-pasteable procedure to recover a system: who declares the disaster, what the failover steps are in order, the exact commands or console actions, how you verify the recovery region is healthy, and — critically — how you fail back once the primary returns. "Restore from backup" is not a runbook. A runbook is the literal sequence, written so a competent on-call engineer who did not build the system can execute it at 3am.
A game day is a scheduled, deliberate test where you actually execute the runbook against real (or production-like) infrastructure — ideally injecting a realistic failure (kill the primary database, black-hole a region, terminate the primary AZ) and recovering under a stopwatch. The goals are to (1) prove the real RTO/RPO, (2) find the broken/stale steps, and (3) build muscle memory so the real event is boring. AWS Fault Injection Service (FIS) is purpose-built for injecting these failures safely.
The two findings that come out of almost every first game day: backups that had never been restored end-to-end turn out to be subtly unusable (wrong encryption key, missing dependency, untested restore path), and the measured RTO is 2–4× the design RTO because of the human and DNS steps nobody timed. Both are cheap to fix once found and catastrophic to discover during a real outage.
DR is not only about a region going dark. In 2026 the more probable disaster for many companies is ransomware or a malicious/compromised credential deleting your data — including your backups. A backup an attacker can encrypt or delete is not a backup. Immutability is what turns "we have backups" into "we can actually recover."
The classic discipline still holds: keep multiple copies, on more than one medium/account, with at least one copy isolated. On AWS, that translates into a concrete set of controls that make backups tamper-resistant even if an attacker holds production credentials.
Ask: "If an attacker had full admin in our production account right now, could they delete our backups?" If the answer is anything but a confident "no, the vault is locked and lives in a separate account," your backups are part of the blast radius — and your DR plan does not actually cover the most likely 2026 disaster.
For many companies DR stops being a nice-to-have and becomes a control an auditor will test. SOC 2, ISO 27001, HIPAA, PCI DSS, and increasingly customer security questionnaires all expect a documented, tested business-continuity and DR capability — not just a diagram.
The recurring theme across frameworks is the same three demands: you have defined RTO/RPO for critical systems, you have a documented recovery procedure, and you have evidence you have tested it. The thing that fails audits is rarely the absence of backups; it is the absence of proof that recovery has ever been exercised.
Practically, that means your game days double as audit evidence. A dated game-day report — what failed, the measured RTO/RPO, the gaps found, the remediation — is exactly the artifact a SOC 2 or ISO auditor wants to see. Backup immutability and retention policies map directly to common controls (data integrity, availability, backup), and AWS Backup's reports plus Vault Lock give you defensible, machine-generated evidence.
The honest sequencing for a startup heading into its first SOC 2: get multi-AZ in place (table stakes), define RTO/RPO for tier-1 systems, stand up AWS Backup with cross-region copy and Vault Lock, write the runbooks, and run one real game day before the audit window. That is a few weeks of focused work — and exactly the kind of bounded engagement CloudRoute routes to a partner, often AWS-funded for credit-eligible companies.
Knowing the four strategies is the easy part. Designing the right blend for your workloads, wiring up the services correctly, writing runbooks, and actually running a game day is a real chunk of senior platform-engineering work — the kind most startups cannot spare a person for. That is the gap CloudRoute fills.
CloudRoute routes you to a vetted AWS partner who does the work for you: assesses your workloads, sets honest RTO/RPO per tier, designs the cheapest DR architecture that meets them, implements it as infrastructure-as-code, writes the runbooks, and runs the first game day so your real RTO/RPO is measured rather than assumed. You get a tested DR capability without hiring a dedicated SRE.
The economics are the part founders do not expect. For credit-eligible companies, this engagement is frequently substantially AWS-funded — the partner is paid through AWS partner-funding programs and your AWS consumption during the work is covered by credits — so the customer pays $0 or a low cost. For companies that are not credit-eligible, it is a vetted-partner referral that skips the hire-and-vet slog: you get a proven DR specialist without spending three months recruiting one. We are deliberately honest about which bucket you are in — the AWS-funded path applies to credit-eligible engagements; otherwise it is a straightforward, high-quality referral.
If DR is on your roadmap because of an audit deadline, a near-miss outage, or a customer security review, the fastest path is to let a partner who has built this dozens of times design and test it — rather than learn cross-region failover for the first time during your first real incident.
DR work is exactly the kind of engagement AWS partner funding and Activate credits are built to cover. If you have not claimed your credits yet, start there — see $100K AWS credits and the startup path — then have the partner put that funding toward designing and testing the DR your auditors (and your future 3am self) will thank you for.
The same four-tier ladder in one view. Read it top-to-bottom as increasing cost buying decreasing RTO/RPO, and remember the right answer is usually a blend assigned per workload tier.
| Strategy | RTO (real-world) | RPO | What runs in recovery region | Relative cost | Best for |
|---|---|---|---|---|---|
| Backup & Restore | Hours → days | Hours (1–24h) | Nothing — rebuild from backups on demand | $ (storage only) | Tier-2/3, internal tools, dev/staging, day-one startups |
| Pilot Light | ~10–60 min | Seconds → minutes | Core data live + replicated; compute cold | $$ (~20–40% of full duplicate) | Strict RPO, relaxed RTO production systems |
| Warm Standby | Minutes | Seconds | Scaled-down but live full stack | $$$ (~40–60% of full duplicate) | Revenue-critical tier-1 (payments, auth, core API) |
| Multi-Site Active/Active | Near-zero (seconds) | Near-zero | Full stack, full scale, serving live traffic | $$$$ (150–200%+ of single region) | Downtime catastrophic / contractually forbidden |
Situation: Everything ran in a single region with no tested DR. A prospect's security questionnaire demanded documented RTO/RPO and evidence of DR testing, and a SOC 2 + HIPAA audit was 10 weeks out. The team had "nightly RDS snapshots" but had never restored one end-to-end, no runbooks, and no spare engineer to own it. They also worried — correctly — that an attacker with prod credentials could delete the snapshots.
What CloudRoute did: Routed within a day to a US-East partner with HIPAA + DR track record. The partner tiered the workloads, set RTO/RPO per tier, and designed a blend: Warm Standby in us-west-2 for the patient-facing API and Aurora (Aurora Global Database, sub-second replication), Pilot Light for secondary services, and Backup & Restore for analytics/dev. They implemented it as Terraform, added AWS Backup with cross-account Vault Lock (Compliance mode) so backups were immutable and isolated, wired Route 53 failover, wrote the runbooks, and ran a game day with AWS FIS killing the primary region.
Outcome: Measured RTO for the tier-1 API came in at 7 minutes (design target was 15); RPO under 5 seconds. The first restore test surfaced a wrong-KMS-key issue that would have made a real recovery fail — fixed before the audit. Dated game-day report became the SOC 2 / HIPAA evidence and cleared the prospect's questionnaire. The engagement was credit-eligible, so AWS funding covered the partner work and the AWS spend during the build — customer paid $0.
engagement window: 5 weeks · founder time: ~6 hours · tier-1 RTO achieved: 7 min · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who sets your RTO/RPO, builds the right strategy as IaC, writes the runbooks, and runs the game day. Credit-eligible? Often AWS-funded — customer pays $0.