Cost Anomaly Detection is the AWS service that learns each service's normal spend pattern and pages you when something deviates — a leaked key spinning up GPUs, a deploy that 10×'d data transfer, a forgotten cluster left running. It is free, it is machine-learning-based, and it catches the surprises a budget threshold structurally cannot. This guide covers every monitor type, how to set alert thresholds (absolute $ vs % vs ML-confidence), how the model works and how to kill false positives, the triage-to-root-cause workflow, and when Anomaly Detection beats Budgets or Cost Explorer.
Cost Anomaly Detection is a free, native AWS service that uses machine learning to model your normal spend per service and alert you when actual spend deviates beyond the model's expectation. It is the detection layer of FinOps on AWS — the alarm for the costs you never saw coming.
Mechanically, you do two things: create a monitor that scopes what the ML watches, and attach an alert subscription that decides who hears about a detected anomaly and how large it has to be. Behind that, AWS ingests your Cost & Usage data, builds a per-service baseline of expected spend, and continuously compares actual spend against it. When the deviation is large and confident enough, it raises an anomaly record with a start date, an estimated total impact in dollars, the root-cause service / account / usage-type breakdown it can attribute, and a chart of expected-versus-actual.
The service itself costs nothing — no per-monitor fee, no per-alert fee. It runs on the same Cost & Usage data AWS already collects to bill you, so enabling it is pure upside. The only real cost is operational: someone has to read the alerts and act on them, which is exactly the loop a CloudRoute partner can run for you.
Here is why this is a distinct tool and not a feature of Budgets. A budget is a threshold you set: "tell me at 85% of $20K." That works beautifully for spend you can anticipate — but you cannot write a threshold for a surprise. The night a leaked access key spins up a fleet of GPU instances in a region you never use, no budget you would have thought to create catches it; the spend is novel, in a service you do not normally touch, at a scale you never forecast. Anomaly Detection catches it precisely because it does not depend on you predicting it — the model knows that service has historically cost ~$0, so $4,000 of it overnight is, statistically, an obvious anomaly. Budgets enforce the knowns; Anomaly Detection surfaces the unknowns.
The honest boundary: it is not instant. Cost data lands on a lag, so an anomaly is typically surfaced within about 24 hours of the spend beginning — fast enough to stop a multi-day leak from becoming a multi-week one, but not a real-time circuit breaker. It also does not fix anything; it alerts. The fix happens in Cost Explorer (to find the why) and then in the console / IaC (to kill the resource, rotate the key, roll back the deploy). Anomaly Detection's one job is "tell me, fast and with low noise, when spend does something it has never done before."
A monitor defines the lens the ML looks through. The lens you choose decides how precisely an anomaly gets attributed — and the single most common mistake is creating one broad monitor and calling it done. Here is what each monitor type watches, and the order a practitioner usually creates them.
Watches: every AWS service in the account (or, from the management account, across the whole Organization), modelling each service's spend pattern independently.
Use it for: the default, always-on monitor. Because it models each service separately, a spike in a service that normally costs almost nothing — Bedrock, SageMaker, a new region — stands out instantly instead of being drowned in the total. AWS recommends every account run one of these, and it is the first monitor to create.
Pro move: you usually only need one services monitor — it already decomposes by service internally, so creating several overlapping ones just multiplies alerts. Scope the precision with account / category / tag monitors instead.
Watches: spend within one or more specific linked accounts under your AWS Organization.
Use it for: attributing an anomaly to a team or environment when you run an account-per-team structure. A monitor on the prod-payments account tells you immediately that that team's spend moved, not just that "AWS went up."
Pro move: pair account monitors with the org-wide services monitor — the services monitor tells you what service spiked; the account monitor tells you whose account it spiked in. Together that is most of your root-cause before you even open Cost Explorer.
Watches: spend grouped by a Cost Category — AWS's rules engine for mapping raw line items into business buckets (e.g. Environment = prod / staging / dev, or Product = api / web / data-platform) regardless of which account or tag they came from.
Use it for: monitoring along the dimension your business actually thinks in. If "the data-platform product" spans three accounts and a dozen services, a Cost Category monitor watches it as one coherent thing.
Pro move: define Cost Categories first (they are also what makes Cost Explorer and chargeback legible), then point a monitor at the categories that matter. This is the most "FinOps-mature" monitor type.
Watches: spend filtered to a specific Cost Allocation Tag key/value — e.g. team=growth or service=checkout — for teams running shared accounts rather than account-per-team.
Use it for: per-team detection without the overhead of separate accounts. The catch is identical to per-tag budgeting: it only works if your resources are actually tagged, and untagged spend falls into a blind spot no tag monitor catches.
Pro move: activate and enforce your tags (a deny-on-untagged policy helps) before leaning on tag monitors — otherwise the monitor quietly misses every untagged resource, which is exactly where stealth spend tends to hide. Cost Allocation Tags are the foundation under both tag monitors and per-team budgets.
If you do nothing else today: create one AWS-services monitor covering the whole account (or whole Organization from the payer), and attach one individual-alert subscription with a sensible dollar threshold. That single pair catches the leaked-key / runaway-resource class of surprise — the most expensive kind, because nobody set a budget for a service that never cost anything before. Add account, category, and tag monitors as your structure matures.
A monitor with no subscription is a model talking to itself. The subscription decides who gets told, how often, and — critically — how big an anomaly has to be before it interrupts a human. Set the threshold too low and you train the team to mute; too high and you miss the slow leak.
An alert subscription attaches to one or more monitors and has three moving parts: a frequency, a threshold expression, and a set of recipients. The threshold is the whole game, and AWS gives you three ways to express it.
The simplest and most-used threshold: alert when the anomaly's total estimated dollar impact exceeds an absolute figure — say $100, or $500 for a larger account. This is intuitive and ties directly to materiality: you do not want a page for $3 of unexpected spend, but $500 you never forecast deserves a look.
Tune the figure to the account size. On a ~$5K/month account, $100 is a reasonable floor; on a ~$200K/month account, $100 is noise — set it to $1,000+ so only genuinely material anomalies break through.
Alert when actual spend exceeds the model's expected spend by more than a chosen percentage — e.g. 40% above expected. This scales with the workload instead of a fixed dollar line, which suits a service whose normal spend is itself growing.
The trap: on a service that normally costs near-zero, a tiny absolute jump can be a huge percentage and fire a noisy alert. Which is exactly why AWS lets you combine the two.
The most robust setup uses an AND expression: alert only when the anomaly is both above an absolute dollar impact and above a percentage of expected spend. That single condition kills the two classic false-positive sources at once — the trivial-dollar-but-huge-percent spike on a near-zero service, and the large-dollar-but-tiny-percent wobble on a big steady service. AWS's anomaly impact also carries a confidence weighting from the model, so the dollar figure you threshold against is already the model's best estimate of unexpected impact, not raw spend.
A practical default: total impact > $100 AND impact > 40% of expected, tuned up as the account grows. That one expression is the difference between an alert channel the team trusts and one they mute by week two.
Each subscription is either individual alerts (one notification per detected anomaly — use this with SNS for anything you might need to act on in real time) or a daily / weekly summary (a digest of everything detected — good for a finance stakeholder who wants the roll-up, not the pages). A common pattern: an individual-alert subscription to the on-call SNS topic for material anomalies, plus a weekly-summary email to finance for the long tail.
You cannot tune what you do not understand. The model is not magic, and its failure modes are predictable — which means false positives are fixable rather than something you suffer. Here is the mental model, and the levers.
Cost Anomaly Detection builds a statistical baseline of each monitored dimension's spend over time, learning daily and weekly seasonality — weekday-versus-weekend patterns, month-boundary effects, the natural rhythm of your batch jobs. When new spend lands, it compares actual against the predicted range and scores how far outside the expected band it is. A large, confident deviation becomes an anomaly with an estimated dollar impact; small wobbles inside the expected range are ignored. The model also adapts: a one-off spike it flags today becomes part of the learned pattern if it persists, so a genuine, sustained step-up in spend stops alerting once it is the new normal.
Two consequences worth internalising. First, the model needs history — roughly several weeks — to predict well; on a brand-new account, or right after a major architecture change, expect more false positives until it relearns. Second, anomalies are about shape, not amount: a steady $50K/month service that stays steady never alerts, while a $20 service that suddenly costs $200 might — because the second is a deviation from its pattern and the first is not.
Detection without a response runbook is just a louder way to be surprised by the invoice. The value is in the loop: an alert fires, a human triages it, roots-causes it in Cost Explorer, and fixes the underlying resource. Here is the workflow a practitioner runs, start to finish.
When an individual alert lands, the anomaly record itself already gives you a head start: the start date (when the deviation began), the estimated total impact in dollars, the root-cause attribution AWS could infer (which service, which linked account, which usage type, sometimes which region), and an expected-versus-actual chart. That is usually enough to know whether this is a five-minute fix or a real incident. The disciplined response runs in five steps:
Anomaly Detection tells you something changed → the anomaly record tells you roughly what → Cost Explorer tells you exactly why → you fix the resource → a Budget Action / SCP / IAM guardrail stops the recurrence. Speed through that loop is the entire ROI; a CloudRoute partner runs it for you so an alert at 2am doesn't wait for Monday.
These three native tools get constantly confused, and teams waste effort forcing one to do another's job. They are complementary layers of a single practice, not competitors. Here is the clean mental model, with the full comparison table below.
Cost Anomaly Detection answers "did something just change that I did not predict?" A free ML monitor that learns each dimension's normal pattern and alerts on significant deviations — the unknown-unknowns, because you cannot set a threshold for a surprise. Best for a misconfigured service, a leaked key, or a deploy that 10×'d data transfer overnight. It detects; it does not enforce or explain.
AWS Budgets answers "are we on track against a number I set, and what should happen if we are not?" Proactive and threshold-driven — you define the target, Budgets enforces it and can even act on breach via Budget Actions (apply an SCP/IAM policy, stop EC2/RDS). Best for monthly cost caps, commitment-utilization floors, per-team guardrails, and automated stops on non-prod. It enforces the knowns; it cannot catch the spike you never forecast.
Cost Explorer answers "why did this number move, and where is the money going?" The investigation tool — interactive charts, group-by service / account / usage-type / tag, historical trends, plus where you build forecasts and RI/SP recommendations. It does not alert and it does not enforce. Best for root-causing an alert, planning commitments, and reporting. Anomaly Detection or a Budgets alert tells you something is wrong; Cost Explorer is where you find out why.
Anomaly Detection = catch the spike you never predicted. Budgets = enforce the number you committed to. Cost Explorer = explain why any number moved. Turn on Anomaly Detection and Budgets on day one — they are your two complementary alarms (the surprises, and the thresholds you chose) — then live in Cost Explorer whenever either alarm fires.
An anomaly alert that lands in an inbox nobody watches is a detection system that fails silently. The value is in routing the signal to where the team already lives — and where it can page someone if it has to. Both flow through Amazon SNS.
Cost Anomaly Detection alert subscriptions can send to email directly, or — far more powerfully — to an Amazon SNS topic. SNS is the fan-out hub: one anomaly alert can simultaneously hit email subscribers, a Lambda function, an SQS queue, and chat / paging integrations. You must grant the Cost Anomaly Detection service principal permission to publish to the topic — a small SNS resource (access) policy that is the one step people miss, after which alerts flow.
For Slack, two clean patterns. Simplest is AWS Chatbot (now part of Amazon Q Developer): subscribe it to the anomaly SNS topic, authorize your Slack workspace and channel, and anomalies render as formatted messages with no code to maintain. More flexible is SNS → Lambda → Slack incoming webhook, where a small function reshapes the alert into a rich message (service, account, dollar impact, a direct Cost Explorer deep-link to the affected service + date range) and posts to a channel webhook — so the responder can start step 3 of the triage workflow in one click.
For PagerDuty (or Opsgenie), wire the same SNS topic to the provider's AWS/SNS integration so a material anomaly creates an incident and pages on-call — appropriate for the leaked-key class of event where a 2am $4K spike genuinely warrants waking someone. The usual split: route everything to a #finops Slack channel for visibility, but only page PagerDuty for anomalies above a higher dollar threshold, so on-call is woken for real fires, not the long tail.
Route both Cost Anomaly Detection and AWS Budgets alerts to that same #finops channel via SNS, so every spend signal — the surprises and the thresholds you set — lives in one place. That single channel is the heartbeat of a working FinOps practice.
A detection setup either becomes a trusted alarm the team acts on, or noise everyone mutes by week two — and muted detection is worse than none, because it gives false confidence. The difference is a handful of deliberate choices; these are the practices CloudRoute partners apply, ranked by impact.
The three native AWS cost tools, mapped to the question each one answers — so you stop forcing one to do another's job. Run all three; they are layers of one practice, not alternatives.
| Dimension | Cost Anomaly Detection | AWS Budgets | Cost Explorer |
|---|---|---|---|
| Core question | Did something change I did NOT predict? | On track vs a number I set? | Why did this number move? |
| Posture | Proactive — ML, learns normal | Proactive — threshold-driven | Reactive — investigate & analyze |
| Who defines "normal"? | The model — you set nothing | You — you define the target | N/A — exploratory |
| Trigger | Statistically significant deviation | Actual or forecast crosses your threshold | You open it and explore |
| Can it act? | No — alert only | Yes — Budget Actions (SCP / IAM / stop) | No — analysis only |
| Latency | ~24h after the anomaly begins | ~8–12h (3× daily eval) | Historical, on demand |
| Cost | Free | First 2 budgets free, then ~cents/day | Free UI; API has a small per-request fee |
| Best for | Spikes, leaks, misconfigs you never forecast | Cost caps, commitment floors, per-team guardrails | Root cause, commitment planning, reporting |
Situation: A credential committed to a public repo was scraped and used to launch p4d GPU instances in two regions the company never operates in. With no Cost Anomaly Detection and only a single org-wide budget set well above run-rate, nothing fired — the spend accrued for nine days before the cloud lead happened to open the Billing console and found a ~$19K unexpected charge already booked. The internal cloud lead was full-time on the product and had no bandwidth to build or watch a detection layer.
What CloudRoute did: Routed within 20 hours to a US-based AWS partner with a FinOps / security-remediation track record. The partner ran a short cost audit, then stood up Cost Anomaly Detection: one org-wide AWS-services monitor, per-account monitors on the six linked accounts, and a Cost Category monitor for prod-vs-non-prod. Alert subscriptions used a combined threshold (impact > $150 AND > 40% of expected) fanned out via SNS to a new #finops Slack channel, with PagerDuty paging for anomalies above $1,000. They also rotated the leaked key, added an SCP denying instance launches outside the two approved regions, and tightened launch IAM. Newly enabled AWS Budgets (forecast alerts + non-prod auto-stop Budget Actions) landed in the same SNS pipeline.
Outcome: Two weeks later a misconfigured batch job 8×'d SageMaker spend overnight; the services monitor caught it within ~4 hours and paged on-call via PagerDuty, the responder root-caused it in Cost Explorer (an unbounded training loop) and killed it the same morning — estimated impact ~$300 instead of another five-figure surprise. Because the engagement qualified for AWS funding, the customer paid $0 for the setup and the first month of partner-run anomaly response; CloudRoute's commission came from the partner.
first leak (undetected): ~$19K / 9 days · next anomaly: caught in ~4h, ~$300 · detection live: <1 week · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who stands up Cost Anomaly Detection, wires Slack / PagerDuty, runs the anomaly-response loop, and does the underlying right-sizing / commitment work in one engagement. Often AWS-funded → you cut the bill for $0. Otherwise it pays for itself out of the savings.