Observability is how you answer "why is it slow / broken / on fire?" without SSHing into a box and guessing. This page walks the three pillars — metrics, logs, and traces — the AWS-native stack (CloudWatch, X-Ray, Managed Prometheus and Grafana) versus third-party (Datadog, Grafana Cloud, New Relic), why OpenTelemetry is the instrumentation layer you want underneath either, how to build dashboards and SLOs and alerts that do not page-fatigue your team, and how to stop logs from quietly becoming your second-biggest AWS line item.
The words get used interchangeably and they are not the same. Monitoring watches a fixed set of things you already knew to watch — CPU, error rate, a healthcheck endpoint — and fires when one crosses a line. Observability is the broader property: can you ask an arbitrary new question of your running system, after the fact, without shipping new code to answer it?
A useful framing: monitoring answers "is it broken?" against known failure modes. Observability answers "why is this specific thing slow for this specific cohort of users right now?" — questions you did not anticipate when you wrote the dashboards. The first is a subset of the second. You build monitoring on top of an observable system; you cannot bolt observability on after the incident.
The practical test is the 3am test. A request is timing out for one customer in eu-central-1 but not us-east-1, only on the checkout path, only since the last deploy. If your answer is "let me add some logging and redeploy," you have monitoring. If your answer is "let me filter the traces for that route and region and find the slow span," you have observability. The difference is whether the data to answer the question was already being collected.
This matters commercially because observability is what shrinks your MTTR — mean time to resolution. The cost of an incident is roughly its blast radius multiplied by how long it lasts, and observability attacks the second factor directly. A team that can localize a regression to a single downstream call in two minutes recovers in a fraction of the time of a team grepping logs across a fleet. For a revenue-critical system, that delta is the entire business case.
None of this requires a particular vendor. It requires that the three signals below are actually being emitted, stored long enough to be useful, and correlated so you can pivot from a metric spike to the logs and traces behind it. The rest of this page is how to get there on AWS without overspending or paging your engineers into burnout.
Observability is the ability to ask new questions of your system after it is already running — without deploying new instrumentation to answer them. Monitoring is the alerting layer you build on top once the system is observable. If answering a novel production question requires a code change and a deploy, you have monitoring, not observability.
Observability data comes in three shapes, each answering a different kind of question at a different cost. You need all three, and — critically — you need them linked, so a spike in one becomes a one-click pivot into the others. Here is what each pillar is for, and where each one is the wrong tool.
Think of them as zoom levels. Metrics are the wide shot — cheap, aggregated, always-on, great for "is something wrong and roughly where." Traces are the mid shot — one request's journey across services, great for "which hop is slow." Logs are the close-up — the exact event detail, great for "what precisely happened in that span." You move metrics → traces → logs as you narrow from symptom to root cause.
What they are: numeric measurements sampled over time — request rate, error rate, p50/p95/p99 latency, CPU, queue depth, saturation. They are pre-aggregated and tiny, so you can keep them at high resolution for a long time cheaply, and they are what dashboards and most alerts are built on.
What they are good for: the fast "is it healthy, and is it trending the wrong way?" question, and the four golden signals (latency, traffic, errors, saturation) for every service. Metrics are your always-on early-warning layer.
Where they fall short: metrics are aggregates, so they tell you that p99 latency doubled but not which requests or why. High-cardinality questions ("latency by customer ID by endpoint") get expensive or impossible in pure metrics — that is where traces and logs take over. On AWS, metrics live in CloudWatch Metrics and/or Amazon Managed Service for Prometheus.
What they are: timestamped records of discrete events — a request, an error with a stack trace, an audit entry, a state change. The richest signal per event, and the one engineers reach for instinctively. Structured logs (JSON with consistent fields) are vastly more useful than free-text, because you can filter and aggregate them.
What they are good for: the exact detail of a specific event once you have localized the problem — the error message, the parameters, the stack trace, the "what was this code actually doing." Also the system of record for audit and security events.
Where they fall short: volume and cost. Logs are by far the most expensive pillar to ingest and store at scale, and "log everything at debug in production" is the single most common way teams blow up an observability bill. Logs are also bad at answering aggregate questions cheaply — that is metrics' job. On AWS, logs land in CloudWatch Logs (and often get queried with Logs Insights), and log-cost control (section VI) is its own discipline.
What they are: the end-to-end record of a single request as it moves through your system — API gateway to service A to service B to the database — broken into timed "spans," one per operation, stitched together by a shared trace ID. The pillar most teams have least of, and the one that pays off most in a microservices or serverless architecture.
What they are good for: "where did the latency actually go?" and "which downstream call failed?" in a distributed system. A trace turns "checkout is slow" into "the 380ms is 350ms waiting on the inventory service's database call" in one view. Indispensable once a request touches more than two or three services.
Where they fall short: traces need instrumentation in your code (or auto-instrumentation agents), and at high traffic you sample them rather than keep every one, so a specific rare request may not have a stored trace. On AWS, traces go to AWS X-Ray natively, or to a third-party backend; either way, OpenTelemetry (section IV) is how you emit them portably.
In practice the gap is almost always traces. Teams have metrics (CloudWatch ships many for free) and logs (everyone logs), but no distributed tracing — so the moment a request crosses several services, root-causing latency becomes archaeology. If you do one thing after reading this page, add tracing via OpenTelemetry. It is the pillar with the highest marginal return on an already-monitored stack.
AWS gives you a complete, first-party observability stack. Its strengths are deep integration (most AWS services emit to it automatically), no extra vendor relationship, and a single bill. Its weaknesses are a UX and cross-signal correlation that lag the best third-party tools, and a cost model that is cheap to start and easy to let sprawl. Here is what each piece does.
These are the load-bearing native services as of 2026. You will rarely use all of them — most startups run CloudWatch for metrics/logs plus X-Ray or an OTel-fed tracer, and reach for Managed Prometheus/Grafana specifically when they are running Kubernetes (EKS) and want the Prometheus ecosystem.
AWS-native is the right default when you are all-in on AWS, cost-sensitive, and your stack is mostly AWS managed services — because those services instrument themselves into CloudWatch for free and you avoid a second vendor and a second bill. It is also the natural choice for credit-eligible startups, since CloudWatch/X-Ray/AMP/AMG usage is AWS spend your credits can cover. The tradeoff you accept is weaker out-of-the-box correlation than Datadog-class tools — which good dashboards and OTel-linked signals largely close.
The single most important decision in an observability setup is not which backend you pick — it is how you instrument. Instrument with a vendor's proprietary agent and your telemetry is married to that vendor forever. Instrument with OpenTelemetry (OTel) and the same code emits to CloudWatch, X-Ray, Datadog, Grafana Cloud, or New Relic by changing a config, not your application.
OpenTelemetry is the vendor-neutral, CNCF-graduated standard for generating and shipping telemetry — metrics, logs, and traces — with one set of SDKs and a wire protocol (OTLP) every serious backend now accepts. It has effectively won as the instrumentation layer; in 2026, building new services on proprietary-only agents is a self-inflicted lock-in.
Architecturally it has two halves. First, the SDKs / auto-instrumentation live in your application and produce the signals — many languages and frameworks get traces and metrics with little or no manual code via auto-instrumentation. Second, the OpenTelemetry Collector is a separate process (a sidecar, a DaemonSet on EKS, or a Lambda layer) that receives that data, processes it (batching, sampling, redaction, adding resource attributes), and exports it to one or more backends.
On AWS specifically, the AWS Distro for OpenTelemetry (ADOT) is Amazon's supported build of the Collector and SDKs, wired to land cleanly in CloudWatch, X-Ray, and AMP. So the native and "portable" paths are the same path: instrument with OTel/ADOT, point the Collector at CloudWatch/X-Ray today, and if you later adopt Datadog or Grafana Cloud you add an exporter in the Collector config — your application code never changes.
The strategic payoff is leverage. The reason teams stay on an overpriced observability vendor is that re-instrumenting hundreds of services to leave is brutal. With OTel that switching cost largely evaporates: the backend becomes a commodity you can shop, benchmark, and replace. It is the single highest-leverage thing you can do to keep observability costs honest over the life of the system.
Before adopting any observability tool, ask: "If we wanted to switch backends next year, would we have to re-instrument our code?" If the answer is yes, you are buying lock-in. Instrument with OpenTelemetry (via ADOT on AWS) and the answer becomes "no — we change a Collector exporter." That single decision is worth more over three years than which dashboard tool you pick on day one.
This is where most observability setups go wrong — not in the data, but in the alerting. The failure mode is alert fatigue: so many low-signal pages that engineers start ignoring them, and the one that mattered gets muted with the rest. The fix is to alert on user-facing symptoms tied to SLOs, not on every internal cause.
Start with dashboards organized around the four golden signals per service — latency, traffic, errors, and saturation. That is the SRE-standard top-level view: one screen per service that tells you in five seconds whether it is healthy. Build these before you build alerts, because the alert thresholds should come from what the dashboards show as normal versus abnormal.
Then define SLOs — Service Level Objectives — for the handful of user journeys that actually matter (checkout succeeds, the API responds under 300ms at p99, the page loads). An SLO is a target like "99.9% of checkout requests succeed over 28 days." The gap between that target and 100% is your error budget — the amount of failure you are explicitly allowed before it is a problem. This reframes alerting entirely: you do not page on every error, you page when you are burning the error budget too fast to make the target.
That is the core trick to killing page fatigue: symptom-based, SLO-burn alerting. Page a human for things that mean users are being hurt right now (error-budget burn rate is high, checkout success is dropping). Route everything else — a disk filling slowly, a single node degraded, a cause that has not yet become a symptom — to a non-paging channel (a ticket, a Slack message) to look at in business hours. A useful rule of thumb: a 2am page must be both urgent and actionable; if it is not both, it is a notification, not an alert.
Implementation-wise, you can do SLO/burn-rate alerting natively with CloudWatch metric math and composite alarms (combine conditions so a flapping single metric does not page), with Managed Grafana alerting, or with the SLO features built into Datadog/Grafana Cloud/New Relic. The tool matters less than the discipline: few, meaningful, symptom-based alerts, each with a runbook link, each genuinely worth waking someone for.
Logs are the pillar that quietly bankrupts observability budgets. Metrics are cheap and bounded; traces are sampled; logs are charged largely by volume ingested and stored, and a chatty service at debug level in production can turn into a four- or five-figure monthly surprise that nobody decided to spend. Controlling it is a real and recurring discipline.
The cost driver to internalize: on CloudWatch (and on every third-party tool) you pay primarily for data ingested, then for storage and for querying. So the highest-leverage control is upstream — emit less low-value data — not downstream cleanup. The biggest offenders are almost always health-check and load-balancer access logs, verbose framework debug logs left on in production, and a few hot code paths logging on every request.
A practical, ordered playbook for getting log spend under control without going blind:
Logging in structured JSON with consistent fields is not just tidier — it is a cost lever. Structured logs let you filter, sample, and convert-to-metrics precisely (drop exactly the noisy field/route, count exactly the event you care about), instead of bluntly keeping or dropping whole free-text streams. Teams that adopt structured logging routinely cut ingestion meaningfully while improving what they can actually query.
The recurring decision is CloudWatch-native versus a third-party platform (Datadog, Grafana Cloud, New Relic). There is no universally right answer — there is a right answer for your stage, stack, and budget. Here is the honest decision framework, before the side-by-side table below.
Choose AWS-native (CloudWatch + X-Ray + AMP/AMG) when you are cost-sensitive, your workload is mostly AWS managed services (which self-instrument into CloudWatch for free), and you would rather not add a vendor. It is the natural fit for credit-eligible startups, because the spend is AWS spend your credits can absorb. The cost: correlation and UX are good-not-great, and you will invest more effort to make dashboards and cross-signal pivots feel seamless.
Choose a third-party platform when correlation, breadth, and developer experience are worth a real second bill — typically once you have enough services and engineers that fast, unified root-causing across metrics/logs/traces directly saves expensive incident time. Datadog is the broad best-in-class (and the most likely to surprise you on cost at scale, which OTel + careful sampling mitigates). Grafana Cloud is the managed open-source path — Prometheus/Loki/Tempo, OTel-native, generally the friendliest on cost. New Relic's consumption-based pricing can be attractive for smaller teams. Crucially, because you instrument with OpenTelemetry, this choice is reversible — you can start native and graduate to a platform (or back) without re-instrumenting.
The pragmatic path most CloudRoute-routed startups take: instrument with OpenTelemetry/ADOT from day one, run AWS-native while small and credit-funded, and revisit a third-party platform when scale makes the correlation worth paying for. Because the instrumentation is portable, that is a config change later, not a re-platforming — which is exactly why the OTel decision in section IV matters more than the backend you start on.
Knowing the three pillars and the tool landscape is the easy part. Instrumenting a real codebase with OpenTelemetry, wiring the Collector, building golden-signal dashboards, defining SLOs, tuning symptom-based alerts, and getting log costs under control is a meaningful chunk of senior platform-engineering work — the kind most startups cannot spare a person for. That is the gap CloudRoute fills.
CloudRoute routes you to a vetted AWS partner who does the work end-to-end: instruments your services with OpenTelemetry/ADOT, stands up the backend (CloudWatch/X-Ray/AMP/AMG natively, or a third-party platform if that is the right call), builds the dashboards and the SLOs, tunes the alerting so it pages a human only when users are actually affected, and sets up log-cost controls before ingestion sprawls. You get a real observability capability — not just CloudWatch switched on — without hiring a dedicated SRE or platform engineer.
The economics are the part founders do not expect. For credit-eligible companies, this engagement is frequently substantially AWS-funded — the partner is paid through AWS partner-funding programs, and your AWS consumption during the build (CloudWatch, X-Ray, AMP, AMG) is exactly the kind of spend your Activate credits cover — so the customer pays $0 or a low cost. For companies that are not credit-eligible, it is a vetted-partner referral that skips the hire-and-vet slog: a proven observability specialist without three months of recruiting. We are deliberately honest about which bucket you are in — AWS-funded applies to credit-eligible engagements; otherwise it is a straightforward, high-quality referral.
If observability is on your roadmap because incidents take too long to diagnose, an SLO target is now in a customer contract, or your CloudWatch bill is climbing for reasons nobody can name, the fastest path is to let a partner who has instrumented dozens of stacks set it up correctly — rather than discover, mid-incident, that the trace you needed was never being collected.
Observability work — CloudWatch, X-Ray, Managed Prometheus and Grafana — is exactly the kind of AWS-native spend that Activate credits and partner funding are built to cover. If you have not claimed your credits yet, start there — see $100K AWS credits and the startup path — then have the partner put that funding toward instrumenting your stack so the next incident is two minutes of reading traces instead of an hour of grepping logs.
Three representative ways to do observability on AWS, compared on the dimensions that actually drive the decision. Remember the OTel point: instrument once, and which column you pick becomes a backend choice you can revisit — not a one-way door.
| Dimension | CloudWatch-native (+ X-Ray) | Grafana/Prometheus (AMP + AMG / Grafana Cloud) | Datadog |
|---|---|---|---|
| Setup friction on AWS | Lowest — AWS services self-instrument into it | Moderate — scrape config + dashboards (managed AMP/AMG removes ops) | Low — strong agents + auto-instrumentation |
| Metrics | CloudWatch Metrics | Prometheus / PromQL (the ecosystem standard) | Best-in-class, high-cardinality |
| Logs | CloudWatch Logs + Logs Insights | Loki (Grafana Cloud) or external | Best-in-class, deeply correlated |
| Traces | X-Ray (now OTel-compatible) | Tempo (Grafana Cloud) / OTel | Best-in-class APM + trace correlation |
| Cross-signal correlation / UX | Good, not great — improves with AMG | Very good — unified Grafana pane | Excellent — the benchmark |
| Cost model | Per ingest/storage/query; cheap to start, can sprawl | Generally friendliest, esp. self-hosted/OSS roots | Powerful but the most likely to surprise at scale |
| Lock-in (with OpenTelemetry) | Low — ADOT lands native + portable | Low — OTel-native, OSS underneath | Low if OTel-instrumented; high if agent-only |
| Best for | AWS-all-in, cost-sensitive, credit-funded startups | Kubernetes/EKS teams, OSS-leaning, cost-conscious | Scale-ups where unified correlation is worth a real bill |
Situation: Metrics and logs existed (CloudWatch was on), but there was no distributed tracing, so every cross-service latency issue turned into hours of correlating logs by hand across services. A new enterprise contract added a p99-latency SLO they had no way to measure or alert on. Separately, the CloudWatch Logs bill had roughly tripled in two quarters and nobody could say why. The lone platform engineer was fully allocated to shipping product.
What CloudRoute did: Routed within a day to a US-East partner with EKS + observability track record. The partner instrumented the services with OpenTelemetry via ADOT (DaemonSet Collector on EKS), sent traces to X-Ray and metrics to Amazon Managed Prometheus, and built golden-signal dashboards plus the contractual SLO in Amazon Managed Grafana with burn-rate alerting. They moved paging to symptom-based SLO alerts (routing cause-level signals to Slack), and ran a log-cost pass — set production log levels, sampled health-check and access logs at the Collector, applied per-log-group retention, and tiered cold logs to S3 + Athena.
Outcome: Cross-service latency issues went from a multi-hour log hunt to a two-minute trace filter. The p99 SLO was measured and alerting on error-budget burn within the build. Paging volume dropped sharply once cause-level noise stopped paging the on-call. CloudWatch Logs spend fell roughly 60% from the sampling, retention, and S3 tiering — without losing anything the team actually queried. The engagement was credit-eligible, so AWS funding covered the partner work and the AWS spend during the build — customer paid $0.
engagement window: 4 weeks · founder time: ~6 hours · log spend cut: ~60% · root-cause time: hours → minutes · cost to customer: $0
CloudRoute routes you to a vetted AWS partner who instruments with OpenTelemetry, builds the dashboards and SLOs, tunes alerting that does not page-fatigue, and controls log costs. Credit-eligible? Often AWS-funded — customer pays $0.