for AWS partners →Have a partner run your SRE + on-call →

sre aws · 2026 reference

Site reliability engineering on AWS — SLOs, error budgets, on-call, and the reliability patterns that actually hold.

SRE is the discipline of running production like an engineering problem: you set a reliability target, measure against it, spend an error budget instead of arguing about it, and reduce toil until on-call is survivable. This page explains the principles honestly — SLI/SLO/error budgets, blameless postmortems, incident command — then maps them onto the AWS toolchain, the reliability patterns that earn their keep, and the one question most teams get wrong: build the SRE function in-house, or buy fractional SRE and on-call coverage.

Have a partner run your SRE + on-call →→ jump to SLI/SLO/error budgets

reliability target

an SLO

budget unit

error budget

on-call goal

< 2 pages/shift

credit-eligible cost

often $0

TL;DR

SRE turns reliability into a number you manage. You define SLIs (what you measure — latency, availability, error rate), set SLOs (the target, e.g. 99.9% of requests succeed over 28 days), and the gap between the SLO and 100% is your error budget. While budget remains you ship fast; when it is gone you stop feature work and fix reliability. That one mechanism settles most "ship vs. stabilize" arguments without anyone having to win them.
The day-to-day work is toil reduction and incident management, not heroics: alerts that map to user-visible SLO burn (not CPU at 80%), runbooks for the top failure modes, real incident command, and blameless postmortems that produce action items — so the same outage does not page you twice. On AWS the toolchain is CloudWatch + X-Ray + Managed Prometheus/Grafana + OpenTelemetry + Incident Manager, with AWS FIS for chaos game days.
Most startups should not hire a full SRE team at seed/Series-A — the on-call math does not work, and a single SRE hire burns out on a solo rotation. CloudRoute routes you to a vetted AWS partner that brings the SRE practice and a real 24/7 on-call roster: SLOs defined, alerting cleaned up, runbooks written, game days run. For credit-eligible companies it is often AWS-funded, so the customer pays $0 or close to it.

definition

IWhat SRE actually is — and what it is not

SRE (Site Reliability Engineering) is the practice of applying software-engineering discipline to operations: you treat reliability as a measurable, budgeted property of the system rather than a vibe, and you reduce manual operational work (toil) by automating it away. It was formalized at Google; the version most teams need in 2026 is lighter than the book but built on the same load-bearing ideas.

The single most useful sentence about SRE is this: 100% reliability is the wrong target for almost everything. Chasing the last fraction of a nine costs exponentially more than the previous one, and users cannot tell 99.99% from 100% because their own network, browser, and ISP introduce more failure than your service does. So SRE starts by choosing a reliability target that is good enough for the business, then treats every fraction below 100% as a budget you are allowed to spend on shipping features.

That reframing is what makes SRE different from "ops" or "DevOps with extra steps." Ops keeps the lights on; SRE decides, with a number, how bright the lights need to be and how much engineering to invest in keeping them there — and it gives product and engineering a shared currency (the error budget) to negotiate speed against stability without it becoming a personality conflict.

SRE is also not a rebranding of the on-call pager. On-call is one input, but bolting a rotation onto an unstable system buys a burnout factory, not a reliability practice. The work that makes on-call humane — good alerts, runbooks, automated remediation, postmortems that actually change the system — is the SRE work; the pager is downstream of it. And none of this is a job title you must hire before you have a problem: a four-engineer startup can adopt SLOs, error budgets, and blameless postmortems in an afternoon. What does not scale down is a full-time, in-house, 24/7 on-call SRE team — exactly the gap a fractional SRE partner fills.

the core mechanism

IISLIs, SLOs, and error budgets — the mechanism the whole practice runs on

Everything else in SRE is downstream of three definitions. Get these right and the rest of the practice — alerting, on-call, the ship-vs-stabilize decision — falls out almost mechanically. Get them wrong (or skip them) and you are back to arguing about reliability with opinions instead of numbers.

An SLI (Service Level Indicator) is a quantitative measure of one dimension of service health, expressed as a ratio of good events to total. The ones that cover most services are availability (what fraction of requests succeeded), latency (what fraction were faster than a threshold), error rate (the inverse of availability), and — for data systems — freshness or correctness. The discipline is to measure as close to the user as possible: a load-balancer-level success rate is far more honest than a CPU graph.

An SLO (Service Level Objective) is the target you commit to for an SLI, over a window. "99.9% of HTTP requests return a non-5xx status within 300 ms over a rolling 28-day window" is complete: indicator, threshold, target, and window. The 28-day rolling window is the common default — long enough that one bad deploy does not blow the quarter, short enough that the budget refreshes usefully.

The error budget is 100% minus the SLO, converted into a concrete allowance: a 99.9% availability SLO over 28 days allows roughly 40 minutes of unavailability. That 40 minutes is a budget the team may spend — on risky deploys, migrations, chaos experiments. While budget remains, the default is to keep shipping; when it is exhausted, the policy agreed in advance is to freeze risky changes and direct engineering at reliability until it recovers.

This works socially, not just technically, because it removes the loudest-voice dynamic: product need not argue that velocity matters and SRE need not argue that stability matters — the error budget already encodes the trade-off, and the only question left is factual (how much budget is left?). That is why SLOs and error budgets are the first thing a good SRE engagement establishes, before touching any infrastructure.

Picking SLO targets you can defend, and alerting on burn rate

The most common mistake is setting an SLO at 99.99% because it sounds responsible. Four nines allows about 4 minutes of downtime per 28 days, so one slightly-slow deploy or a single bad node blows the whole monthly budget and parks you in a permanent feature freeze. Pick the target from the business — what unreliability would a customer actually notice and churn over? For most B2B SaaS, 99.9% (about 40 min/28 days) is plenty, and internal/admin tools live happily at 99.5%. And set SLOs per critical user journey, not per microservice: "checkout succeeds" and "login succeeds" are felt by users; "the inventory service responded" is an implementation detail. A handful of journey-level SLOs maps to revenue, where two hundred per-service dashboards do not.

Once you have an SLO, alert on how fast you are spending the error budget, not on raw thresholds. A multi-window, multi-burn-rate alert pages a human only when the budget is being consumed fast enough to matter — a fast-burn alert (28-day budget gone in hours at the current rate) pages immediately, a slow-burn alert (gone in days) opens a ticket. This is the single biggest lever for reducing pager noise: it replaces dozens of "CPU > 80%" pages that do not correlate with user pain with a few alerts that always mean a user-visible problem is in progress.

the daily work

IIIToil reduction and blameless postmortems — where the real engineering lives

Between incidents, the SRE function does two things that compound: it removes toil so the system needs fewer humans to run, and it runs postmortems that convert each incident into a permanent improvement. Skipping either one is how teams end up firefighting the same problems forever.

Toil is the manual, repetitive, automatable operational work that scales with traffic and produces no lasting value — restarting a stuck worker by hand, rotating a credential manually, hand-editing a config to scale up before a sale. Google's rule of thumb caps toil at roughly 50% of an SRE's time; the practical small-team version is "if you have done this by hand three times, automate it before the fourth." The non-obvious payoff is not saved labor hours — it is reducing the chances a tired human fumbles a step at 3 a.m. Every manual step removed from incident response is a step that cannot go wrong under pressure, which is why automated remediation (auto-scaling, auto-healing, self-restarting tasks, automated failover) is core SRE work, not a nice-to-have.

Blameless postmortems are the other half. After any real incident you write up what happened, when, the impact, and — crucially — the contributing causes, assuming everyone acted reasonably given what they knew. The point is not to absolve carelessness; it is that systems fail for systemic reasons (a missing guardrail, a misleading dashboard, an alert that fired too late), and you only surface those if people can speak honestly without fear of being scapegoated. "Bob should be more careful" has found nothing; "the deploy had no automated rollback and the alert lagged 12 minutes behind the SLO burn" has found two fixable things.

The output is action items with owners and dates, tracked like any other work and prioritized against the error budget. The fastest way to tell whether a team actually practices SRE is to ask to see the action items from their last three incidents and count how many shipped. "We wrote a doc and moved on" is the ritual without the engine.

the reliability flywheel

SLOs tell you when reliability matters this week (budget remaining). Postmortems tell you what to fix (contributing causes). Toil reduction tells you how to make the fix stick (automate the manual step out of existence). Run all three and each incident leaves the system stronger; run none and you firefight the same five problems forever.

when it breaks

IVOn-call and incident management on AWS — making 3 a.m. survivable

On-call is where reliability practice meets human cost. A healthy rotation is the visible proof that the upstream SRE work is paying off: few pages, each one actionable, each one with a runbook, and a clear command structure when something genuinely large breaks. This section is the operational backbone, and it is the part CloudRoute partners most often run end-to-end.

A humane rotation has a few non-negotiables. Enough people that no one is on call more than roughly one week in four (a two-person rotation is a burnout guarantee — the single most common reason a lone in-house SRE hire quits). Every alert actionable, meaning a page corresponds to a real, user-visible problem with a runbook linked from the alert itself. And an explicit policy that a non-actionable page is the highest-priority follow-up to fix or delete — noisy pagers are treated as incidents in their own right.

When something large breaks, ad-hoc heroics do not scale; an incident command structure does. Borrowed from emergency response, it separates fixing from coordinating: an Incident Commander runs the response and makes calls (explicitly not hands-on-keyboard), a Resolver lead changes the system, a Communications lead owns the status page and stakeholder updates, and a Scribe keeps the timeline. Early on one person may wear two hats, but separating "who decides" from "who types" is what keeps a severe incident from becoming five people debugging in parallel and overwriting each other.

Severity levels keep the response proportional: SEV1 is a full or near-full outage of a critical journey (all hands, exec comms, status page); SEV2 is significant degradation or one critical feature down (on-call plus one or two pulled in); SEV3 is minor or contained (handled in the rotation, no escalation). Write the definitions down in advance — debating "is this a SEV1?" mid-outage wastes the minutes that matter most.

On AWS the tooling is first-party and worth using. AWS Systems Manager Incident Manager manages on-call schedules, escalation, and engagement (paging via SMS/voice/chat or third-party tools), opens an incident record with a response plan, and attaches runbooks (Systems Manager Automation documents) that remediate automatically or with one click. CloudWatch alarms (including composite alarms) feed it, EventBridge routes events, and AWS Chatbot into Slack/Teams gives responders a shared command surface. Many teams pair or replace it with PagerDuty or Opsgenie — the AWS-native path is credible, and the right choice depends on what your stack already uses.

The on-call starter kit, concretely

A workable setup for a small AWS team: SLO burn-rate alarms in CloudWatch as the primary page source; a paging tool (Incident Manager, PagerDuty, or Opsgenie) with a rotation of at least four; a runbook per top-5 failure mode linked from each alert; a Slack/Teams channel auto-created per incident via chatops; and a severity rubric pinned where everyone sees it. That is enough to make on-call defensible — automated remediation, dependency-aware alerting, and error-budget-driven freezes are refinements on top of this base.

engineering for failure

VReliability patterns on AWS: multi-AZ, graceful degradation, and chaos game days

SLOs measure reliability; architecture produces it. A handful of AWS patterns do most of the heavy lifting, and the ones that earn their keep are usually cheaper and less exotic than teams expect. The expensive mistake is reaching for multi-region active/active when multi-AZ plus graceful degradation would have hit the SLO.

Multi-AZ is the floor, not a feature. An AWS Region is several physically separated Availability Zones; spreading compute, database, and load balancing across at least two (ideally three) means a single data-center-level failure degrades capacity rather than causing an outage. Table stakes: RDS/Aurora Multi-AZ for the database, Auto Scaling groups or ECS/EKS services spread across AZs for compute, an Application Load Balancer fronting them. If production is single-AZ, no amount of SRE process compensates — the architecture has a built-in single point of failure.

Graceful degradation separates a resilient system from a brittle one: when a dependency fails, shed the non-essential and keep the core path alive rather than failing the whole request. Concretely — timeouts on every outbound call, circuit breakers that stop hammering a failing dependency, retries with exponential backoff and jitter (plus a retry budget so retries cannot amplify an outage into a self-inflicted DDoS), bulkheads that isolate one failing component, and sensible fallbacks (serve stale cache, hide the recommendations widget, queue the write for later) so a peripheral failure does not take down checkout. Most "total outages" are really a peripheral failure allowed to cascade because nothing isolated it.

Chaos engineering and game days are how you find out whether any of the above works before a real incident tests it for you. AWS Fault Injection Service (FIS) runs controlled experiments — terminate instances, inject latency or errors, throttle an API, simulate an AZ impairment, fail over an Aurora cluster — against a hypothesis ("if one AZ goes dark, error rate stays within SLO"). You schedule these as game days and verify monitoring caught it, alarms fired, automation kicked in, runbooks worked, and the SLO held. The value is not the breaking; it is finding the gap (an alarm that never fired, a wrong runbook step, an 8-minute failover) in daylight with everyone watching, not at 3 a.m. alone.

On cost, these patterns sit on a ladder that mirrors the disaster-recovery ladder. Multi-AZ and graceful degradation are cheap and belong on essentially every production workload. Multi-region (warm standby or active/active) is materially more expensive and should be reserved for the journeys whose downtime genuinely justifies it — for most startups, a short list. The reliability budget, like the error budget, is spent deliberately, not maximally.

core AWS reliability patterns · what they buy and what they cost · 2026

Pattern	What it protects against	Key AWS services	Relative cost
Multi-AZ	Single data-center / AZ failure	RDS/Aurora Multi-AZ, ASG, ECS/EKS across AZs, ALB	Low — table stakes
Graceful degradation	A failing dependency cascading into a full outage	App-level: timeouts, circuit breakers, retries+jitter, SQS, ElastiCache	Low — mostly engineering time
Auto-healing / auto-scaling	Instance/task death; load spikes	Auto Scaling, ECS service auto-recovery, EKS + Karpenter, health checks	Low–medium
Chaos / game days	Unknown gaps in detection + recovery	AWS FIS, CloudWatch, Systems Manager runbooks	Low — time, not infra
Multi-region (warm/active-active)	Whole-region failure	Route 53 failover, Aurora Global, DynamoDB Global Tables, cross-region replication	High — reserve for top journeys

Most startups hit their availability SLO with the first four rows. Multi-region is a deliberate, expensive step up — justified per critical journey, not adopted wholesale. See the disaster-recovery reference for the full RTO/RPO ladder.

the stack

VIThe AWS SRE toolchain, mapped to the job it does

There is no single "SRE product" on AWS; there is a toolchain, and the skill is wiring it into a coherent practice. Here is the stack mapped to the function it serves, so you can tell what is load-bearing from what is optional.

The foundation is observability, because you cannot set or defend an SLO you cannot measure. Amazon CloudWatch is the metrics, logs, alarms, and dashboards backbone, and CloudWatch can host SLO/burn-rate alarms directly. AWS X-Ray provides distributed tracing so you can see where latency is actually spent across services. For teams standardizing on open standards, OpenTelemetry (via the AWS Distro for OpenTelemetry) instruments applications vendor-neutrally, and Amazon Managed Service for Prometheus plus Amazon Managed Grafana give you a Prometheus/Grafana stack without running the servers. Many teams add Datadog or Grafana Cloud on top; the AWS-native combination is fully credible on its own.

On top of observability sits the incident and automation layer: CloudWatch alarms (including composite alarms) as the page source, Amazon EventBridge to route events and trigger responses, AWS Systems Manager Incident Manager for on-call schedules, escalation, and incident records, and Systems Manager Automation runbooks to encode remediation. AWS Chatbot wires alarms and runbooks into Slack or Teams so responders share one surface. For resilience validation, AWS Fault Injection Service (FIS) runs the chaos experiments and game days. And underpinning the architecture itself: Auto Scaling and health checks for self-healing, Route 53 for health-based and failover routing, and the AWS Well-Architected Framework — especially its Reliability pillar — as the checklist you review against.

A useful mental model: observability (CloudWatch / X-Ray / OTel / Prometheus / Grafana) is how you see; alarms + Incident Manager + Chatbot are how you respond; FIS + game days are how you test; and Auto Scaling / Route 53 / multi-AZ are how the system heals itself. An SRE engagement is largely the work of assembling these into a loop that runs without heroics — which is why a good observability setup is a prerequisite for, not a substitute for, an SRE practice.

disambiguation

VIISRE vs DevOps vs Platform Engineering — three roles people keep conflating

These three terms get used interchangeably, badly, and the confusion leads teams to hire for one when they needed another. They overlap heavily and the same partner often covers all three, but they answer different questions — and knowing which question you have tells you what you actually need.

DevOps is a culture and set of practices for shipping software fast and safely: collapsing the wall between dev and ops, automating the build/test/deploy pipeline (CI/CD), managing infrastructure as code, and shortening the loop from commit to production. The question DevOps answers is "how do we ship changes quickly and reliably?" It is broad and somewhat cultural rather than a single role.

SRE is, in the original framing, one specific way to implement the operations side of DevOps — "class SRE implements interface DevOps." Where DevOps says "automate operations," SRE is prescriptive about how: with SLOs, error budgets, toil limits, blameless postmortems, and a disciplined on-call. The question SRE answers is "how reliable is the system, and how do we keep it that way without burning people out?" SRE is the measurement-and-reliability discipline; it lives inside the DevOps philosophy.

Platform Engineering is the newest of the three and is about building an internal platform (often an Internal Developer Platform) that lets product engineers self-serve infrastructure, deployments, and environments through golden paths — so they do not each reinvent the wheel or file a ticket for every namespace. The question platform engineering answers is "how do we let our developers move fast safely without a human in the loop for every change?" It productizes the DevOps/SRE capabilities into something the rest of engineering consumes.

In practice the lines blur and, at startup scale, the same one or two people (or the same partner) do all three. The reason the distinction still matters is hiring and scoping: if your problem is "deploys are slow and manual," that is a DevOps/CI-CD problem; if it is "we keep having outages and on-call is killing us," that is an SRE problem; if it is "every team builds infra differently and onboarding takes weeks," that is a platform problem. CloudRoute partners cover the whole spectrum, but naming the actual problem gets you to the right fix faster.

the real decision

VIIIWhen to buy fractional SRE and on-call instead of hiring

This is the decision most of this page exists to inform. Standing up an in-house SRE function — and especially a 24/7 on-call rotation — has a minimum viable size that is larger than most startups realize, and getting it wrong is expensive in both dollars and people.

The hard constraint is on-call math. Sustainable 24/7 coverage needs roughly four to six engineers so no one carries the pager more than about one week in four with real recovery between shifts. A single SRE hire cannot provide round-the-clock coverage — they are either permanently on call (and quit within a year, taking the tribal knowledge with them) or you have multi-hour gaps with no owner. So the honest in-house number is not one SRE; it is a team, plus the months to hire and onboard them, plus a senior leader to set the practice.

At seed and Series-A that usually does not pencil out. Four-to-six experienced SREs is a large fraction of an early engineering org and a larger fraction of payroll, hired for a load (overnight incidents) that — if the upstream reliability work is done well — should be light. You would be staffing a full rotation to handle pages you are simultaneously trying to eliminate; the capital is almost always better spent on product and on the engineering that removes the incidents in the first place.

Fractional SRE-as-a-service exists for this gap. A vetted partner brings an existing, trained, multi-person on-call roster (already staffed and humane), an established SRE practice (SLOs, runbooks, postmortem discipline) they install rather than invent, and the AWS depth to wire up CloudWatch, Incident Manager, and FIS correctly the first time. You get coverage and a real reliability practice on day one, at a fraction of the cost and lead time of building the team — and you can graduate to an in-house team later, with a working practice to hand over.

The buy case is strongest when you have a production system that pages someone (or should), you cannot dedicate four-plus engineers to a rotation, you need coverage in weeks not quarters, and you would rather your seniors build product than sit overnight on-call. The build case strengthens with scale — past roughly Series-B/C with a system whose reliability is a core moat, owning it in-house starts to make sense. For most teams reading this page, fractional is the right first move; the comparison below lays out the trade.

the core trade-off

In-house SRE team vs fractional SRE-as-a-service

This is the decision, laid out honestly. Neither column is universally right — the answer turns on your stage, your reliability requirements, and whether you can realistically staff a humane rotation. For most pre-Series-B teams, fractional wins on every axis that matters; past that, owning it in-house starts to pay off.

Variable	In-house SRE team	Fractional SRE-as-a-service
Minimum viable size	4–6 engineers for sustainable 24/7 on-call	0 new hires — partner brings the roster
Time to coverage	3–9 months to hire + onboard a full rotation	Days to weeks — practice + roster already exist
24/7 on-call	Only once the rotation is fully staffed	Day one, already humane (week-in-four or better)
SRE practice maturity	You build SLOs / runbooks / postmortems from scratch	Installed from an existing, proven playbook
Typical cost	Several senior salaries + recruiting + management	A fraction of one team — and often AWS-funded if credit-eligible
Key-person / burnout risk	High — a solo hire on a 1-in-2 rotation burns out	Low — load is spread across the partner's roster
Best fit	Series-B/C+ where reliability is a core moat	Seed–Series-A (and most teams without a full rotation)

Most CloudRoute-routed SRE engagements are fractional: the customer gets SLOs, cleaned-up alerting, runbooks, game days, and a real 24/7 roster without hiring a team. For credit-eligible companies the engagement is frequently AWS-funded, so the customer pays $0 or close to it.

tired of a pager nobody can sustain?

Get a vetted partner to run your SRE practice and your on-call

Match me with an SRE partner →

a recent match

From solo on-call burnout to a real SRE practice — anonymized

inquiry · series-a b2b saas, single platform hire on permanent on-call

Series-A B2B SaaS, ~22 engineers, one platform/infra engineer carrying the pager solo, already on AWS (ECS Fargate + Aurora) at ~$9K/month

Situation: The lone platform engineer had been effectively on call 24/7 for nine months and was close to quitting. Alerting was raw CloudWatch thresholds (CPU, memory) that paged constantly but rarely mapped to user pain, so real incidents got lost in the noise. There were no SLOs, no runbooks, and the last two "total outages" were actually a single non-critical dependency cascading because nothing isolated it. Leadership wanted reliability fixed but could not justify hiring four SREs at their stage.

What CloudRoute did: Routed within ~20 hours to an AWS partner with a standing SRE practice and a multi-person on-call roster. The partner defined journey-level SLOs (login, the core write path, billing), replaced threshold alarms with multi-window burn-rate alerts in CloudWatch, stood up Systems Manager Incident Manager with a four-person rotation plus the in-house engineer as escalation-only, wrote runbooks for the top failure modes, added timeouts/circuit breakers/retry-with-jitter around the dependency that had been cascading, and ran two AWS FIS game days (AZ impairment + dependency latency) to verify the new alerting and failover actually held.

Outcome: Pages dropped from dozens a week to a handful, and every remaining page was actionable. The in-house engineer moved off the primary rotation to escalation-only and stayed. The login SLO held at 99.95% over the first full quarter; the dependency that used to cause "outages" now degraded gracefully. Because the company was credit-eligible, the engagement was AWS-funded — the SRE setup and the first quarter of on-call ran at $0 to the customer, with CloudRoute's commission paid by the partner from AWS engagement funding.

time-to-coverage: ~2 weeks · pages/week: dozens → single digits · primary on-call headcount kept: 1 (retained) · cost to customer: $0

faq

Common questions

What is the difference between SRE and DevOps in one sentence?

DevOps is the broad culture and practice of shipping software fast and safely (automation, CI/CD, infrastructure as code, dev+ops collaboration); SRE is a specific, prescriptive way to run the operations/reliability side of that — using SLOs, error budgets, toil limits, blameless postmortems, and a disciplined on-call. The common framing is "class SRE implements interface DevOps": SRE is one concrete implementation of DevOps's operational goals.

What is an error budget and how do I use it?

An error budget is 100% minus your SLO, expressed as a concrete allowance of failure over a window. A 99.9% availability SLO over 28 days allows roughly 40 minutes of unavailability — that 40 minutes is a budget. The rule, agreed in advance: while budget remains, you ship at full speed (and can spend it on risky deploys, migrations, or chaos experiments); when it is exhausted, you freeze risky changes and direct engineering at reliability until it recovers. It turns the "ship vs. stabilize" argument into a factual question about how much budget is left.

What SLO target should a startup actually set?

For most B2B SaaS, 99.9% (about 40 minutes of downtime per 28 days) on your critical user journeys — login, the core read/write paths, billing — is plenty, and internal/admin tools can live at 99.5%. Avoid defaulting to 99.99% because it sounds responsible: four nines allows only ~4 minutes per 28 days, so one slightly-slow deploy blows the whole budget and you live in a feature freeze. Set the target from what a customer would actually notice and churn over, per journey, not per microservice.

Which AWS services make up an SRE toolchain?

Observability: CloudWatch (metrics, logs, alarms, SLO/burn-rate alarms), X-Ray (tracing), OpenTelemetry via the AWS Distro, and Amazon Managed Prometheus + Managed Grafana. Incident response: CloudWatch alarms + EventBridge as the page source, Systems Manager Incident Manager for schedules/escalation/records, Systems Manager Automation runbooks for remediation, and AWS Chatbot into Slack/Teams. Resilience testing: AWS Fault Injection Service (FIS) for game days. Self-healing: Auto Scaling, health checks, Route 53 failover. Many teams add PagerDuty/Opsgenie and Datadog alongside the AWS-native stack.

How big does an on-call rotation need to be?

For sustainable 24/7 coverage, roughly four to six engineers, so no one carries the pager more than about one week in four with real recovery between shifts. A two-person rotation is a burnout guarantee, and a single SRE hire genuinely cannot provide round-the-clock coverage — they will either be permanently on call (and likely quit within a year) or leave multi-hour gaps with no owner. This on-call math is the main reason early-stage teams buy fractional on-call instead of trying to staff a rotation in-house.

What is a game day and why run one with AWS FIS?

A game day is a planned exercise where you deliberately inject a failure in a controlled window and verify your system actually responds the way you assume it does. AWS Fault Injection Service (FIS) runs the controlled experiments — terminating instances, injecting latency or errors, throttling an API, simulating an AZ impairment, failing over an Aurora cluster — against a hypothesis like "if one AZ goes dark, error rate stays within SLO." The value is finding the gaps (an alarm that never fired, a wrong runbook step, an 8-minute failover) in daylight with the team watching, instead of discovering them at 3 a.m. during a real outage.

When should I hire an in-house SRE team versus buying SRE-as-a-service?

Buy fractional when you have a production system that needs on-call but cannot dedicate four-plus engineers to a rotation, you need coverage in weeks not quarters, and you would rather your seniors build product than sit overnight. Build in-house once you are past roughly Series-B/C with a system whose reliability is a core competitive moat and the scale to justify a full, humane rotation plus a senior lead. For most pre-Series-B teams, fractional gives you a real practice and a staffed roster on day one — and you can graduate to in-house later with the practice already running.

How does CloudRoute's SRE engagement work, and what does it cost?

CloudRoute routes you to a vetted AWS partner that brings an existing SRE practice and a multi-person 24/7 on-call roster: they define SLOs, clean up alerting, write runbooks, set up Incident Manager, run FIS game days, and carry (or backstop) the pager. For credit-eligible companies the engagement is often AWS-funded — the partner is paid through AWS partner programs and your AWS usage is credit-covered — so the customer pays $0 or low cost. For companies that are not credit-eligible, it is a vetted-partner referral that skips the months of hiring and vetting a rotation yourself. Either way, CloudRoute is paid by the partner, not by you.

Want a real SRE practice and a 24/7 on-call roster — without hiring a team?

CloudRoute routes you to a vetted AWS partner that installs the SLOs, cleans up the alerting, writes the runbooks, runs the game days, and carries the pager. Often AWS-funded for credit-eligible companies, so the customer pays $0.

Match me with an SRE partner →→ see the startup persona detail

matched within< 24h

time-to-coveragedays–weeks

credit-eligible cost$0