for AWS partners →Get a partner to run anomaly response →

aws cost anomaly detection · 2026 setup + response

AWS Cost Anomaly Detection — the free ML alarm for the spike you never forecast.

Q: Is AWS Cost Anomaly Detection free?

Yes — it is a fully free, native AWS service. There is no per-monitor fee and no per-alert fee; it runs on the Cost & Usage data AWS already collects to bill you. The only real cost is operational: someone has to read the alerts and act on them. That makes it the highest-ROI thing you can enable in the Billing console, since the downside is zero and the upside is catching a five-figure surprise early.

Q: How is Cost Anomaly Detection different from AWS Budgets?

A budget is a threshold YOU set — "alert me at 85% of $20K" — which works for spend you can anticipate but cannot catch a surprise, because you cannot write a threshold for something you never predicted. Cost Anomaly Detection sets nothing: a machine-learning model learns each service's normal pattern and alerts when actual spend deviates, so it catches the unknown-unknowns (a leaked key, a misconfig, a deploy that 10×'d data transfer). Budgets also enforce on breach (Budget Actions); Anomaly Detection only alerts. They are complementary — run both.

Q: What types of monitors can I create?

Four. An AWS-services monitor (models every service independently — the one everyone should have); a linked-account monitor (anomalies scoped to specific accounts under your Organization); a Cost Category monitor (anomalies along a business dimension like Environment or Product, via AWS Cost Categories); and a Cost Allocation Tag monitor (anomalies filtered to a tag value like team=growth, for shared accounts). Start with one services monitor, then layer account / category / tag monitors as your structure matures.

Q: How do I stop Cost Anomaly Detection from sending false positives?

The biggest lever is the threshold expression: alert only when impact is both above an absolute dollar floor AND above a percentage of expected (the AND expression) — this kills both the trivial-dollar / huge-percent spike on a near-zero service and the large-dollar / tiny-percent wobble on a big steady service. Beyond that: scope monitors to coherent business dimensions (cleaner baselines), give the model a couple of weeks to relearn after a migration or launch, and provide feedback on flagged anomalies so relevance improves over time.

Q: How fast does Cost Anomaly Detection alert me?

Typically within about 24 hours of the anomalous spend beginning, because it runs on Cost & Usage data that lands on a lag — it is not a real-time circuit breaker. That is still fast enough to turn a multi-week leak into a one-day one: catching a leaked-key GPU spike on day one instead of on next month's invoice is the difference between a few hundred dollars and five figures. For hard, immediate enforcement (auto-stopping resources at a cap), pair it with AWS Budgets and Budget Actions.

Q: What should I do when I get an anomaly alert?

Run the triage loop. (1) Triage — did we cause this on purpose (a load test, a launch)? If yes, mark it and move on. (2) Read the anomaly record's attribution — it names the service, account, and usage type driving the impact. (3) Root-cause in Cost Explorer, filtered to that service and date range, grouped by usage type / account / tag, until "EC2 went up" becomes "an unbounded ASG launched 60 instances after the 14:00 deploy." (4) Fix the underlying resource — stop it, rotate the key, roll back the deploy. (5) Prevent recurrence with a Budget Action, SCP, or IAM guardrail.

Q: Can I send Cost Anomaly Detection alerts to Slack or PagerDuty?

Yes — both flow through Amazon SNS. Point an alert subscription at an SNS topic (granting the Cost Anomaly Detection service principal permission to publish), then for Slack either subscribe AWS Chatbot / Amazon Q Developer (no-code, formatted messages) or use SNS → Lambda → Slack webhook for a richer message with a Cost Explorer deep-link. For PagerDuty, wire the same SNS topic to PagerDuty's SNS integration so material anomalies page on-call. A common split: route everything to a #finops Slack channel, but only page PagerDuty above a higher dollar threshold.

Q: Does a partner really set this up — and run it — for free?

Often, yes — for qualifying accounts. AWS funds partner-led cost-optimization and Well-Architected engagements, and a Well-Architected Review can unlock remediation credits, so the customer frequently gets the detection setup (monitors, alert subscriptions, SNS / Slack / PagerDuty wiring) plus the anomaly-response loop and the underlying right-sizing / commitment work for $0. Honest framing: AWS-funding applies to qualifying engagements; where it does not, it is a vetted-partner referral that pays for itself out of the savings. CloudRoute is paid by the partner, not by you.

Cost Anomaly Detection is the AWS service that learns each service's normal spend pattern and pages you when something deviates — a leaked key spinning up GPUs, a deploy that 10×'d data transfer, a forgotten cluster left running. It is free, it is machine-learning-based, and it catches the surprises a budget threshold structurally cannot. This guide covers every monitor type, how to set alert thresholds (absolute $ vs % vs ML-confidence), how the model works and how to kill false positives, the triage-to-root-cause workflow, and when Anomaly Detection beats Budgets or Cost Explorer.

Get a partner to run anomaly response →→ Anomaly vs Budgets vs Explorer

service cost

free

detection lag

~24h

monitor types

partner setup

often $0

TL;DR

AWS Cost Anomaly Detection is a free, fully managed service that applies machine learning to your Cost & Usage data to learn each service's normal spend rhythm and alert when actual spend deviates beyond what the model expects. Unlike AWS Budgets — where you set a number — here you set nothing; the model defines "normal" for you and surfaces the unknown-unknowns: a misconfigured pipeline, a leaked credential, a deploy that quietly 10×'d NAT Gateway egress.
You create monitors that define what the ML watches — all AWS services, a single linked account, a Cost Category, or a Cost Allocation Tag — then attach alert subscriptions that decide who gets told and at what size. The threshold can be an absolute dollar impact (e.g. alert above $100 of unexpected spend), a percentage deviation from expected, or AWS's newer ML-confidence-weighted "dollar impact + percentage" expressions. Alerts go to email, or to an SNS topic that fans out to Slack, PagerDuty, or a Lambda remediation.
Detection is one half; response is the other. An anomaly alert tells you something moved; Cost Explorer tells you why, grouped by service / account / usage type / tag, so you can root-cause and fix it. Anomaly Detection catches the spike, Budgets enforces the number you committed to, Cost Explorer explains the move — run all three. A CloudRoute-matched AWS partner stands up detection and actually runs the anomaly-response loop, and because cost / Well-Architected work is often AWS-funded for qualifying accounts, you frequently get it for $0.

context

IWhat Cost Anomaly Detection actually is — and why a budget can't do its job

Cost Anomaly Detection is a free, native AWS service that uses machine learning to model your normal spend per service and alert you when actual spend deviates beyond the model's expectation. It is the detection layer of FinOps on AWS — the alarm for the costs you never saw coming.

Mechanically, you do two things: create a monitor that scopes what the ML watches, and attach an alert subscription that decides who hears about a detected anomaly and how large it has to be. Behind that, AWS ingests your Cost & Usage data, builds a per-service baseline of expected spend, and continuously compares actual spend against it. When the deviation is large and confident enough, it raises an anomaly record with a start date, an estimated total impact in dollars, the root-cause service / account / usage-type breakdown it can attribute, and a chart of expected-versus-actual.

The service itself costs nothing — no per-monitor fee, no per-alert fee. It runs on the same Cost & Usage data AWS already collects to bill you, so enabling it is pure upside. The only real cost is operational: someone has to read the alerts and act on them, which is exactly the loop a CloudRoute partner can run for you.

Here is why this is a distinct tool and not a feature of Budgets. A budget is a threshold you set: "tell me at 85% of $20K." That works beautifully for spend you can anticipate — but you cannot write a threshold for a surprise. The night a leaked access key spins up a fleet of GPU instances in a region you never use, no budget you would have thought to create catches it; the spend is novel, in a service you do not normally touch, at a scale you never forecast. Anomaly Detection catches it precisely because it does not depend on you predicting it — the model knows that service has historically cost ~$0, so $4,000 of it overnight is, statistically, an obvious anomaly. Budgets enforce the knowns; Anomaly Detection surfaces the unknowns.

The honest boundary: it is not instant. Cost data lands on a lag, so an anomaly is typically surfaced within about 24 hours of the spend beginning — fast enough to stop a multi-day leak from becoming a multi-week one, but not a real-time circuit breaker. It also does not fix anything; it alerts. The fix happens in Cost Explorer (to find the why) and then in the console / IaC (to kill the resource, rotate the key, roll back the deploy). Anomaly Detection's one job is "tell me, fast and with low noise, when spend does something it has never done before."

the four monitor types

IISetting up monitors — service, account, Cost Category, and tag

A monitor defines the lens the ML looks through. The lens you choose decides how precisely an anomaly gets attributed — and the single most common mistake is creating one broad monitor and calling it done. Here is what each monitor type watches, and the order a practitioner usually creates them.

AWS services monitor — the one everyone should have

Watches: every AWS service in the account (or, from the management account, across the whole Organization), modelling each service's spend pattern independently.

Use it for: the default, always-on monitor. Because it models each service separately, a spike in a service that normally costs almost nothing — Bedrock, SageMaker, a new region — stands out instantly instead of being drowned in the total. AWS recommends every account run one of these, and it is the first monitor to create.

Pro move: you usually only need one services monitor — it already decomposes by service internally, so creating several overlapping ones just multiplies alerts. Scope the precision with account / category / tag monitors instead.

Linked account monitor — anomalies per account

Watches: spend within one or more specific linked accounts under your AWS Organization.

Use it for: attributing an anomaly to a team or environment when you run an account-per-team structure. A monitor on the prod-payments account tells you immediately that that team's spend moved, not just that "AWS went up."

Pro move: pair account monitors with the org-wide services monitor — the services monitor tells you what service spiked; the account monitor tells you whose account it spiked in. Together that is most of your root-cause before you even open Cost Explorer.

Cost Category monitor — anomalies per business dimension

Watches: spend grouped by a Cost Category — AWS's rules engine for mapping raw line items into business buckets (e.g. Environment = prod / staging / dev, or Product = api / web / data-platform) regardless of which account or tag they came from.

Use it for: monitoring along the dimension your business actually thinks in. If "the data-platform product" spans three accounts and a dozen services, a Cost Category monitor watches it as one coherent thing.

Pro move: define Cost Categories first (they are also what makes Cost Explorer and chargeback legible), then point a monitor at the categories that matter. This is the most "FinOps-mature" monitor type.

Cost Allocation Tag monitor — anomalies per tag value

Watches: spend filtered to a specific Cost Allocation Tag key/value — e.g. team=growth or service=checkout — for teams running shared accounts rather than account-per-team.

Use it for: per-team detection without the overhead of separate accounts. The catch is identical to per-tag budgeting: it only works if your resources are actually tagged, and untagged spend falls into a blind spot no tag monitor catches.

Pro move: activate and enforce your tags (a deny-on-untagged policy helps) before leaning on tag monitors — otherwise the monitor quietly misses every untagged resource, which is exactly where stealth spend tends to hide. Cost Allocation Tags are the foundation under both tag monitors and per-team budgets.

the minimum viable detection setup

If you do nothing else today: create one AWS-services monitor covering the whole account (or whole Organization from the payer), and attach one individual-alert subscription with a sensible dollar threshold. That single pair catches the leaked-key / runaway-resource class of surprise — the most expensive kind, because nobody set a budget for a service that never cost anything before. Add account, category, and tag monitors as your structure matures.

alert subscriptions + thresholds

IIIAlert subscriptions and thresholds — absolute $, percentage, and ML-confidence

A monitor with no subscription is a model talking to itself. The subscription decides who gets told, how often, and — critically — how big an anomaly has to be before it interrupts a human. Set the threshold too low and you train the team to mute; too high and you miss the slow leak.

An alert subscription attaches to one or more monitors and has three moving parts: a frequency, a threshold expression, and a set of recipients. The threshold is the whole game, and AWS gives you three ways to express it.

Absolute dollar impact — "tell me above $X of surprise"

The simplest and most-used threshold: alert when the anomaly's total estimated dollar impact exceeds an absolute figure — say $100, or $500 for a larger account. This is intuitive and ties directly to materiality: you do not want a page for $3 of unexpected spend, but $500 you never forecast deserves a look.

Tune the figure to the account size. On a ~$5K/month account, $100 is a reasonable floor; on a ~$200K/month account, $100 is noise — set it to $1,000+ so only genuinely material anomalies break through.

Percentage deviation — "tell me when it jumps X% above expected"

Alert when actual spend exceeds the model's expected spend by more than a chosen percentage — e.g. 40% above expected. This scales with the workload instead of a fixed dollar line, which suits a service whose normal spend is itself growing.

The trap: on a service that normally costs near-zero, a tiny absolute jump can be a huge percentage and fire a noisy alert. Which is exactly why AWS lets you combine the two.

Combined / ML-weighted expressions — "$X AND Y% above expected"

The most robust setup uses an AND expression: alert only when the anomaly is both above an absolute dollar impact and above a percentage of expected spend. That single condition kills the two classic false-positive sources at once — the trivial-dollar-but-huge-percent spike on a near-zero service, and the large-dollar-but-tiny-percent wobble on a big steady service. AWS's anomaly impact also carries a confidence weighting from the model, so the dollar figure you threshold against is already the model's best estimate of unexpected impact, not raw spend.

A practical default: total impact > $100 AND impact > 40% of expected, tuned up as the account grows. That one expression is the difference between an alert channel the team trusts and one they mute by week two.

individual vs daily/weekly summary

Each subscription is either individual alerts (one notification per detected anomaly — use this with SNS for anything you might need to act on in real time) or a daily / weekly summary (a digest of everything detected — good for a finance stakeholder who wants the roll-up, not the pages). A common pattern: an individual-alert subscription to the on-call SNS topic for material anomalies, plus a weekly-summary email to finance for the long tail.

under the hood

IVHow the ML works — and how to tune out false positives

You cannot tune what you do not understand. The model is not magic, and its failure modes are predictable — which means false positives are fixable rather than something you suffer. Here is the mental model, and the levers.

Cost Anomaly Detection builds a statistical baseline of each monitored dimension's spend over time, learning daily and weekly seasonality — weekday-versus-weekend patterns, month-boundary effects, the natural rhythm of your batch jobs. When new spend lands, it compares actual against the predicted range and scores how far outside the expected band it is. A large, confident deviation becomes an anomaly with an estimated dollar impact; small wobbles inside the expected range are ignored. The model also adapts: a one-off spike it flags today becomes part of the learned pattern if it persists, so a genuine, sustained step-up in spend stops alerting once it is the new normal.

Two consequences worth internalising. First, the model needs history — roughly several weeks — to predict well; on a brand-new account, or right after a major architecture change, expect more false positives until it relearns. Second, anomalies are about shape, not amount: a steady $50K/month service that stays steady never alerts, while a $20 service that suddenly costs $200 might — because the second is a deviation from its pattern and the first is not.

Raise the threshold expression — The first and bluntest lever. If you are getting noise, increase the absolute-dollar floor, the percentage, or both (the AND expression). Most "too many alerts" complaints are solved here in thirty seconds.
Use Cost Category / tag monitors instead of one giant monitor — A monitor scoped to a coherent business dimension has a cleaner, more predictable baseline than "all spend lumped together," so its anomalies are sharper and its false-positive rate lower.
Give it time after big changes — A planned migration, a new product launch, or a deliberate scale-up will trip alerts while the model relearns. That is expected — let it absorb the new pattern over a couple of weeks rather than fighting it.
Don't monitor what you don't care about — If a noisy low-value service keeps firing and you genuinely do not need to watch it, narrow the monitor's scope so the model isn't alerting on spend that has no business consequence.
Provide feedback on flagged anomalies — Each anomaly can be marked as a real issue or not — this signal helps tune relevance over time and keeps the team's trust in the channel high, because confirmed-false anomalies stop being treated as fires.

detection → response

VResponding to an anomaly — the triage workflow to root cause and fix

Detection without a response runbook is just a louder way to be surprised by the invoice. The value is in the loop: an alert fires, a human triages it, roots-causes it in Cost Explorer, and fixes the underlying resource. Here is the workflow a practitioner runs, start to finish.

When an individual alert lands, the anomaly record itself already gives you a head start: the start date (when the deviation began), the estimated total impact in dollars, the root-cause attribution AWS could infer (which service, which linked account, which usage type, sometimes which region), and an expected-versus-actual chart. That is usually enough to know whether this is a five-minute fix or a real incident. The disciplined response runs in five steps:

1 — Triage: real or expected? — First question: did we cause this on purpose? A planned load test, a launch, a new customer onboarding can all be legitimate. If it is expected, mark it and move on. If not, escalate. Thirty seconds of triage prevents most fire drills.
2 — Read the attribution on the anomaly itself — The record names the service, account, and usage type driving the impact. "EC2 / prod-data / BoxUsage:p4d.24xlarge in us-west-2" tells you almost everything: GPU instances in a region you don't use — likely a leaked key or a runaway job.
3 — Root-cause in Cost Explorer — Open Cost Explorer filtered to the anomaly's service and date range, then group by usage type, then by linked account, then by tag. This is where "EC2 went up" becomes "an unbounded auto-scaling group launched 60 c5.4xlarge after the 14:00 deploy." Cost Explorer is the microscope behind the alarm.
4 — Fix the underlying resource — Now act on the actual cause: stop / terminate the runaway resource, rotate and quarantine a leaked credential, roll back the deploy, right-size the over-provisioned cluster, or add the missing NAT/VPC-endpoint to kill the cross-AZ egress. The fix lives in the console or your IaC, not in the anomaly tool.
5 — Prevent the recurrence — Close the loop so it cannot happen the same way twice: a Budget Action to auto-stop tagged non-prod resources at a cap, an SCP guardrail on the offending region, tighter IAM on instance launches, or a tag rule so the next occurrence is at least attributable. Detection finds it once; prevention stops the repeat.

the loop, in one line

Anomaly Detection tells you something changed → the anomaly record tells you roughly what → Cost Explorer tells you exactly why → you fix the resource → a Budget Action / SCP / IAM guardrail stops the recurrence. Speed through that loop is the entire ROI; a CloudRoute partner runs it for you so an alert at 2am doesn't wait for Monday.

choosing the right tool

VICost Anomaly Detection vs Budgets vs Cost Explorer — when each one

These three native tools get constantly confused, and teams waste effort forcing one to do another's job. They are complementary layers of a single practice, not competitors. Here is the clean mental model, with the full comparison table below.

Cost Anomaly Detection answers "did something just change that I did not predict?" A free ML monitor that learns each dimension's normal pattern and alerts on significant deviations — the unknown-unknowns, because you cannot set a threshold for a surprise. Best for a misconfigured service, a leaked key, or a deploy that 10×'d data transfer overnight. It detects; it does not enforce or explain.

AWS Budgets answers "are we on track against a number I set, and what should happen if we are not?" Proactive and threshold-driven — you define the target, Budgets enforces it and can even act on breach via Budget Actions (apply an SCP/IAM policy, stop EC2/RDS). Best for monthly cost caps, commitment-utilization floors, per-team guardrails, and automated stops on non-prod. It enforces the knowns; it cannot catch the spike you never forecast.

Cost Explorer answers "why did this number move, and where is the money going?" The investigation tool — interactive charts, group-by service / account / usage-type / tag, historical trends, plus where you build forecasts and RI/SP recommendations. It does not alert and it does not enforce. Best for root-causing an alert, planning commitments, and reporting. Anomaly Detection or a Budgets alert tells you something is wrong; Cost Explorer is where you find out why.

one-line rule of thumb

Anomaly Detection = catch the spike you never predicted. Budgets = enforce the number you committed to. Cost Explorer = explain why any number moved. Turn on Anomaly Detection and Budgets on day one — they are your two complementary alarms (the surprises, and the thresholds you chose) — then live in Cost Explorer whenever either alarm fires.

routing alerts

VIIIntegrating with Slack and PagerDuty via SNS

An anomaly alert that lands in an inbox nobody watches is a detection system that fails silently. The value is in routing the signal to where the team already lives — and where it can page someone if it has to. Both flow through Amazon SNS.

Cost Anomaly Detection alert subscriptions can send to email directly, or — far more powerfully — to an Amazon SNS topic. SNS is the fan-out hub: one anomaly alert can simultaneously hit email subscribers, a Lambda function, an SQS queue, and chat / paging integrations. You must grant the Cost Anomaly Detection service principal permission to publish to the topic — a small SNS resource (access) policy that is the one step people miss, after which alerts flow.

For Slack, two clean patterns. Simplest is AWS Chatbot (now part of Amazon Q Developer): subscribe it to the anomaly SNS topic, authorize your Slack workspace and channel, and anomalies render as formatted messages with no code to maintain. More flexible is SNS → Lambda → Slack incoming webhook, where a small function reshapes the alert into a rich message (service, account, dollar impact, a direct Cost Explorer deep-link to the affected service + date range) and posts to a channel webhook — so the responder can start step 3 of the triage workflow in one click.

For PagerDuty (or Opsgenie), wire the same SNS topic to the provider's AWS/SNS integration so a material anomaly creates an incident and pages on-call — appropriate for the leaked-key class of event where a 2am $4K spike genuinely warrants waking someone. The usual split: route everything to a #finops Slack channel for visibility, but only page PagerDuty for anomalies above a higher dollar threshold, so on-call is woken for real fires, not the long tail.

Route both Cost Anomaly Detection and AWS Budgets alerts to that same #finops channel via SNS, so every spend signal — the surprises and the thresholds you set — lives in one place. That single channel is the heartbeat of a working FinOps practice.

practitioner playbook

VIIICost Anomaly Detection best practices that actually move the needle

A detection setup either becomes a trusted alarm the team acts on, or noise everyone mutes by week two — and muted detection is worse than none, because it gives false confidence. The difference is a handful of deliberate choices; these are the practices CloudRoute partners apply, ranked by impact.

Turn it on day one — it's free — There is no pricing reason to wait. Create at minimum one AWS-services monitor covering the whole account / Organization, with one individual-alert subscription. The most expensive surprises hit accounts that never enabled the one free tool built to catch them.
Use a combined $-AND-% threshold — The single biggest false-positive killer: alert only when impact is both above an absolute dollar floor AND above a percentage of expected. That one expression eliminates the trivial-dollar / huge-percent and large-dollar / tiny-percent noise that trains teams to mute the channel.
Scope monitors to how the business thinks — Layer a services monitor (always-on) with account, Cost Category, and tag monitors so anomalies are pre-attributed to a team / environment / product. Scoped monitors have cleaner baselines and sharper, lower-noise alerts than one giant lump.
Write the response runbook, not just the alert — Detection without a triage-to-root-cause-to-fix loop is just louder surprise. Document the five steps (triage → read attribution → root-cause in Cost Explorer → fix the resource → prevent recurrence) so an alert produces an action, not a shrug.
One channel for all spend signals — Fan Anomaly Detection and Budgets alerts into the same #finops Slack channel via SNS, and page PagerDuty only above a higher dollar bar. Signals scattered across inboxes get ignored; one watched channel becomes the FinOps heartbeat.
Close every anomaly with a guardrail — After a real anomaly, add the prevention — a Budget Action to auto-stop non-prod, an SCP on the offending region, tighter launch IAM, or a tag rule. Detection finds it once; the guardrail stops the same mistake twice.
Tag first, then trust tag monitors — Per-tag (and per-team) detection is only as good as your Cost Allocation Tags. Activate and enforce tags before relying on tag monitors, or untagged spend becomes a blind spot — the exact place stealth cost hides.
Expect noise after big changes, and let it learn — A migration, launch, or deliberate scale-up will trip alerts while the model relearns the new normal. Don't rip the monitor out — give it a couple of weeks to absorb the pattern, and the false positives subside on their own.

side by side

Cost Anomaly Detection vs Budgets vs Cost Explorer

The three native AWS cost tools, mapped to the question each one answers — so you stop forcing one to do another's job. Run all three; they are layers of one practice, not alternatives.

Dimension	Cost Anomaly Detection	AWS Budgets	Cost Explorer
Core question	Did something change I did NOT predict?	On track vs a number I set?	Why did this number move?
Posture	Proactive — ML, learns normal	Proactive — threshold-driven	Reactive — investigate & analyze
Who defines "normal"?	The model — you set nothing	You — you define the target	N/A — exploratory
Trigger	Statistically significant deviation	Actual or forecast crosses your threshold	You open it and explore
Can it act?	No — alert only	Yes — Budget Actions (SCP / IAM / stop)	No — analysis only
Latency	~24h after the anomaly begins	~8–12h (3× daily eval)	Historical, on demand
Cost	Free	First 2 budgets free, then ~cents/day	Free UI; API has a small per-request fee
Best for	Spikes, leaks, misconfigs you never forecast	Cost caps, commitment floors, per-team guardrails	Root cause, commitment planning, reporting

Turn on Cost Anomaly Detection and AWS Budgets on day one — they are your two complementary alarms (the spikes you could not predict, and the thresholds you chose). Reach for Cost Explorer the moment either alarm fires, to find the why before you fix.

don't want to babysit anomaly alerts yourself?

Get a vetted AWS partner to stand up detection — and actually run the anomaly response — often AWS-funded

Get matched in 24h →

a recent match

From a $19K leaked-key surprise to a 4-hour catch — anonymized

inquiry · seed-stage AI SaaS, ~$24K/mo AWS

Seed-stage AI SaaS, 11 engineers, ~$24K/month AWS across 6 accounts under one Organization, no anomaly detection enabled

Situation: A credential committed to a public repo was scraped and used to launch p4d GPU instances in two regions the company never operates in. With no Cost Anomaly Detection and only a single org-wide budget set well above run-rate, nothing fired — the spend accrued for nine days before the cloud lead happened to open the Billing console and found a ~$19K unexpected charge already booked. The internal cloud lead was full-time on the product and had no bandwidth to build or watch a detection layer.

What CloudRoute did: Routed within 20 hours to a US-based AWS partner with a FinOps / security-remediation track record. The partner ran a short cost audit, then stood up Cost Anomaly Detection: one org-wide AWS-services monitor, per-account monitors on the six linked accounts, and a Cost Category monitor for prod-vs-non-prod. Alert subscriptions used a combined threshold (impact > $150 AND > 40% of expected) fanned out via SNS to a new #finops Slack channel, with PagerDuty paging for anomalies above $1,000. They also rotated the leaked key, added an SCP denying instance launches outside the two approved regions, and tightened launch IAM. Newly enabled AWS Budgets (forecast alerts + non-prod auto-stop Budget Actions) landed in the same SNS pipeline.

Outcome: Two weeks later a misconfigured batch job 8×'d SageMaker spend overnight; the services monitor caught it within ~4 hours and paged on-call via PagerDuty, the responder root-caused it in Cost Explorer (an unbounded training loop) and killed it the same morning — estimated impact ~$300 instead of another five-figure surprise. Because the engagement qualified for AWS funding, the customer paid $0 for the setup and the first month of partner-run anomaly response; CloudRoute's commission came from the partner.

first leak (undetected): ~$19K / 9 days · next anomaly: caught in ~4h, ~$300 · detection live: <1 week · cost to customer: $0

faq

Common questions

Is AWS Cost Anomaly Detection free?

Yes — it is a fully free, native AWS service. There is no per-monitor fee and no per-alert fee; it runs on the Cost & Usage data AWS already collects to bill you. The only real cost is operational: someone has to read the alerts and act on them. That makes it the highest-ROI thing you can enable in the Billing console, since the downside is zero and the upside is catching a five-figure surprise early.

How is Cost Anomaly Detection different from AWS Budgets?

A budget is a threshold YOU set — "alert me at 85% of $20K" — which works for spend you can anticipate but cannot catch a surprise, because you cannot write a threshold for something you never predicted. Cost Anomaly Detection sets nothing: a machine-learning model learns each service's normal pattern and alerts when actual spend deviates, so it catches the unknown-unknowns (a leaked key, a misconfig, a deploy that 10×'d data transfer). Budgets also enforce on breach (Budget Actions); Anomaly Detection only alerts. They are complementary — run both.

What types of monitors can I create?

Four. An AWS-services monitor (models every service independently — the one everyone should have); a linked-account monitor (anomalies scoped to specific accounts under your Organization); a Cost Category monitor (anomalies along a business dimension like Environment or Product, via AWS Cost Categories); and a Cost Allocation Tag monitor (anomalies filtered to a tag value like team=growth, for shared accounts). Start with one services monitor, then layer account / category / tag monitors as your structure matures.

How do I stop Cost Anomaly Detection from sending false positives?

The biggest lever is the threshold expression: alert only when impact is both above an absolute dollar floor AND above a percentage of expected (the AND expression) — this kills both the trivial-dollar / huge-percent spike on a near-zero service and the large-dollar / tiny-percent wobble on a big steady service. Beyond that: scope monitors to coherent business dimensions (cleaner baselines), give the model a couple of weeks to relearn after a migration or launch, and provide feedback on flagged anomalies so relevance improves over time.

How fast does Cost Anomaly Detection alert me?

Typically within about 24 hours of the anomalous spend beginning, because it runs on Cost & Usage data that lands on a lag — it is not a real-time circuit breaker. That is still fast enough to turn a multi-week leak into a one-day one: catching a leaked-key GPU spike on day one instead of on next month's invoice is the difference between a few hundred dollars and five figures. For hard, immediate enforcement (auto-stopping resources at a cap), pair it with AWS Budgets and Budget Actions.

What should I do when I get an anomaly alert?

Run the triage loop. (1) Triage — did we cause this on purpose (a load test, a launch)? If yes, mark it and move on. (2) Read the anomaly record's attribution — it names the service, account, and usage type driving the impact. (3) Root-cause in Cost Explorer, filtered to that service and date range, grouped by usage type / account / tag, until "EC2 went up" becomes "an unbounded ASG launched 60 instances after the 14:00 deploy." (4) Fix the underlying resource — stop it, rotate the key, roll back the deploy. (5) Prevent recurrence with a Budget Action, SCP, or IAM guardrail.

Can I send Cost Anomaly Detection alerts to Slack or PagerDuty?

Yes — both flow through Amazon SNS. Point an alert subscription at an SNS topic (granting the Cost Anomaly Detection service principal permission to publish), then for Slack either subscribe AWS Chatbot / Amazon Q Developer (no-code, formatted messages) or use SNS → Lambda → Slack webhook for a richer message with a Cost Explorer deep-link. For PagerDuty, wire the same SNS topic to PagerDuty's SNS integration so material anomalies page on-call. A common split: route everything to a #finops Slack channel, but only page PagerDuty above a higher dollar threshold.

Does a partner really set this up — and run it — for free?

Often, yes — for qualifying accounts. AWS funds partner-led cost-optimization and Well-Architected engagements, and a Well-Architected Review can unlock remediation credits, so the customer frequently gets the detection setup (monitors, alert subscriptions, SNS / Slack / PagerDuty wiring) plus the anomaly-response loop and the underlying right-sizing / commitment work for $0. Honest framing: AWS-funding applies to qualifying engagements; where it does not, it is a vetted-partner referral that pays for itself out of the savings. CloudRoute is paid by the partner, not by you.

Want the spend-spike alarm — and someone to actually answer it — set up for you?

CloudRoute routes you to a vetted AWS partner who stands up Cost Anomaly Detection, wires Slack / PagerDuty, runs the anomaly-response loop, and does the underlying right-sizing / commitment work in one engagement. Often AWS-funded → you cut the bill for $0. Otherwise it pays for itself out of the savings.

Get matched in 24h →→ see the startup persona detail

matched within< 24h

typical bill cut20–40%

cost to youoften $0