Helpful context:


“The Site Is Down”

Three words that strike fear into any on-call engineer at 2 AM. But what do they mean, exactly?

Is the site returning HTTP 500 errors? Is it timing out? Is it returning 200s with empty bodies? Is it working for users in Europe but not in Asia? Is the database slow, or the application, or the CDN? Is it affecting all users or just paid customers? And - most uncomfortably - how long has it been broken before your monitoring noticed?

This is the observability problem. If your system cannot answer these questions in minutes from the data it produces, you do not have a monitoring problem. You have a design problem. Monitoring tells you when something is wrong. Observability tells you why.

The distinction, popularized by Charity Majors and the team at Honeycomb, is not semantic. A system that exposes only aggregate metrics can tell you that request latency spiked at 2:15 AM. It cannot tell you that the spike was caused by a specific user’s query pattern hitting an unindexed code path in the orders service, affecting 0.3% of users but 100% of your highest-value accounts. Knowing the difference between those two is the gap between a five-minute fix and a three-hour incident.

The History: From Ping to Philosophy

Early infrastructure monitoring was binary: is the machine up or down? You pinged a host, checked if port 80 responded, and paged someone if it didn’t. Tools like Nagios embodied this model.

The next generation added metrics. Graphite (2006), then Prometheus (2012), made it possible to collect time-series data at scale and graph it. You could now ask “how is the system behaving?” rather than just “is it up?”

But distributed systems broke the metrics model. When you have fifty microservices calling each other, a latency spike in the user-facing API might be caused by any one of them - or by an emergent behavior in their interaction. Metrics tell you each service’s throughput; they don’t tell you which call chain caused the slowdown.

Distributed tracing solved this: a trace ID propagates through every service a request touches, allowing you to reconstruct the entire call graph for any request. But traces, metrics, and logs evolved as separate systems with separate tools. The field is still consolidating around OpenTelemetry as the unified standard.

The Three Pillars

Metrics, logs, and traces answer different questions. You need all three because no single signal is sufficient.

Metrics: The Numbers Over Time

Metrics are numeric measurements sampled at regular intervals. They are efficient to store (just numbers), efficient to query, and ideal for alerting. Their weakness is cardinality: to ask a question that wasn’t anticipated when the metric was defined, you need a new label - and every new label dimension multiplies the number of time series.

The three fundamental types:

  • Counter: monotonically increasing, reset on restart. Total requests served, total errors, total bytes written. You take the rate to get a useful signal.
  • Gauge: a value that goes up and down. Current memory usage, active connections, queue depth.
  • Histogram: records the distribution of a value across configurable buckets. Essential for latency - you cannot compute a meaningful p99 from an average.

Prometheus is the de facto standard. It uses a pull model: Prometheus scrapes an HTTP /metrics endpoint on each service at a configured interval. The pull model has a useful property - if a scrape target goes down, Prometheus knows immediately, rather than silently receiving no data.

PromQL is Prometheus’s query language. It’s functional, composable, and powerful once you internalize the mental model:

# Request rate over last 5 minutes
rate(http_requests_total[5m])

# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error ratio
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Google’s SRE book defines the Four Golden Signals as the minimum meaningful instrumentation for any service: latency (how long requests take), traffic (how many requests per second), errors (what fraction fail), and saturation (how close to capacity the system is). If you can only instrument four things, instrument these.

The Cardinality Problem

Here is the failure mode that has brought down production Prometheus clusters at multiple large companies.

Labels are what make Prometheus metrics useful - they let you slice http_requests_total by endpoint, status_code, region, user_tier. But each unique combination of label values creates a separate time series. If you have 100 endpoints × 10 status codes × 3 regions × 5 tiers, you have 15,000 time series for one metric. That’s fine.

Now someone adds user_id as a label. You have 10 million users. You now have 10 million × 100 × 10 × 3 × 5 = 150 billion time series. Prometheus runs out of memory. The metrics system is down. You have no visibility during an incident because your observability layer caused the incident.

High-cardinality dimensions (user IDs, session IDs, trace IDs, request IDs) do not belong in metric labels. They belong in logs and traces, which are designed to store per-event data. Metrics are for aggregate signals. This is not a limitation to work around - it is the correct division of responsibility.

For high-cardinality analytics on metrics, VictoriaMetrics and Thanos provide more efficient storage and query engines than vanilla Prometheus, but they do not change the fundamental constraint.

Logs: The Narrative

Logs are the narrative of what happened. They capture context that metrics cannot: the specific request payload that triggered an error, the user ID, the stack trace, the SQL query, the downstream response.

Unstructured logs - free-form text lines - are nearly useless at scale. They’re hard to parse programmatically, impossible to aggregate reliably, and expensive to search. Structured logs (JSON, with consistent field names) are indexable, filterable, and joinable.

Every log line should carry at minimum: timestamp (with millisecond precision and timezone), log level, service name, version, and a correlation ID tied to the originating request. The correlation ID is the thread that lets you reconstruct what happened across services:

{
  "timestamp": "2024-09-15T14:23:01.342Z",
  "level": "ERROR",
  "service": "orders-service",
  "version": "v2.4.1",
  "trace_id": "abc123def456",
  "user_id": "u_9k2m4",
  "message": "Payment gateway timeout",
  "duration_ms": 5032,
  "gateway": "stripe"
}

When an error surfaces in the payment service, you query all services for trace_id = "abc123def456" and see the full causal chain: the user request that came in, the order service that processed it, the payment gateway call that timed out, and the retry that succeeded - or didn’t.

The ELK stack (Elasticsearch, Logstash, Kibana) and its successor EFK (with Fluentd) are the dominant self-hosted log aggregation choices. AWS CloudWatch Logs and GCP Cloud Logging are the managed alternatives. Grafana Loki indexes only log labels (not the full text), making it far cheaper than Elasticsearch for workloads where you know what you’re filtering on.

The log cost problem is real. AWS CloudWatch Logs pricing has surprised many teams at scale: ingestion, storage, and query costs compound quickly. A service emitting 10 KB/request at 10,000 requests/second generates 860 GB/day of logs. At CloudWatch pricing, that is several thousand dollars per day before query costs. Log sampling, filtering at the source (drop debug logs in production), and tiered storage (short retention in CloudWatch, long retention in S3 + Athena) are necessary at that scale.

Distributed Tracing: The Call Graph

In a microservices architecture, understanding latency requires seeing the entire call graph, not individual service metrics. Distributed tracing is the mechanism for this.

A trace is a tree of spans. Each span represents one unit of work: handling an HTTP request, executing a database query, publishing a message to a queue. Every span has a start time, duration, service name, and arbitrary metadata. Spans are linked by two IDs: the trace ID (shared across all spans in a single request) and the parent span ID (linking child work to the span that initiated it).

The trace ID propagates through request headers. When service A calls service B, it adds traceparent: 00-{trace_id}-{span_id}-01 to the HTTP headers. Service B reads this header, creates a child span, and does the same when it calls service C. The result is a tree you can render as a Gantt chart showing exactly where time was spent.

OpenTelemetry is the open standard that defines the APIs, SDKs, and wire format for traces, metrics, and logs. It is vendor-neutral: you instrument your code once with the OpenTelemetry SDK and send data to any compatible backend - Jaeger, Tempo, Zipkin, Honeycomb, AWS X-Ray, Datadog. This is the correct choice for any new system. Vendor-specific SDKs create lock-in without offering meaningful advantages.

Auto-instrumentation is remarkable: the OpenTelemetry Python agent, attached at startup, automatically instruments FastAPI, Django, SQLAlchemy, httpx, boto3, and dozens of other libraries without code changes. You get traces for free for most of the call surface.

AWS X-Ray is the AWS-native distributed tracing system. It integrates with API Gateway, Lambda, ECS, and EKS without SDK changes for entry points. The sampling rules are configurable per route. Its weakness is that it’s AWS-only - the trace context doesn’t propagate cleanly to external services unless you also instrument them with the X-Ray SDK.

SLOs and Error Budgets: Making Reliability a Business Decision

Reliability without a target is religion - everyone agrees it matters, nobody agrees how much. SLOs make it concrete.

  • SLI (Service Level Indicator): a measured quantity. “The fraction of requests that complete successfully in under 300ms, measured over a one-minute window.”
  • SLO (Service Level Objective): the target. “The SLI must be ≥ 99.9% over any 28-day rolling window.”
  • SLA (Service Level Agreement): the contract with customers, usually set lower than the SLO to create a buffer. Breaching an SLA has financial or legal consequences.

The error budget is the key insight from Google’s SRE practice. If your SLO is 99.9%, you have 0.1% of requests that can fail - about 43 minutes of complete downtime per month, or proportionally more partial degradation. This budget is real and finite.

When the budget is healthy, the team can take risks: deploy new features, try experimental optimizations, run chaos experiments. When the budget is exhausted, reliability work takes priority over feature work. The error budget converts abstract reliability debates into concrete tradeoffs: “we can ship this risky refactor, but it costs X minutes of error budget.”

Burn rate alerts are the operationally correct way to alert on SLOs rather than on individual metrics. A burn rate of 1 means you’re consuming the budget at exactly the rate that would exhaust it by end of period. An alert at burn rate 14 over 1 hour means: at this rate, you’ll exhaust a monthly budget in two days. An alert at burn rate 3 over 6 hours catches slower, longer degradations. This two-alert structure is from the Google SRE Workbook and catches the full spectrum of failure modes while minimizing false positives.

Alerting and the Fatigue Problem

Alert fatigue is the number one operational reliability problem at scale. It is more dangerous than gaps in metric coverage.

When PagerDuty fires 50 times a week, engineers learn to check the alert, decide it’s probably not real, acknowledge it, and go back to sleep. When the real incident arrives - the one that costs you users and money - it is treated like the other 49. It gets acknowledged and ignored.

The symptom-based alerting principle: alert on user-visible impact, not internal implementation details. “p99 latency > 1 second for 5 minutes” is a symptom. “CPU on web-server-3 above 80%” is a cause. Alert on symptoms; investigate causes. Internal resource metrics belong on dashboards for investigation, not in your paging rotation.

Every alert that fires should require a human decision. If the correct response to an alert is “wait and see,” the alert should not page. If the correct response is “run this script,” automate the script and remove the alert. If the alert fires and the engineer consistently decides no action is needed, delete the alert.

Quarterly alert reviews - examining every alert that fired over the past 90 days and deciding whether it should exist, be tuned, or be automated - are active reliability work, not administrative overhead.

Cloud Observability Ecosystems

AWS provides CloudWatch for metrics, logs, and alarms; X-Ray for distributed tracing; and Container Insights for ECS and EKS. AWS DevOps Guru uses ML to surface anomalies in CloudWatch metrics automatically, which is useful for catching patterns that no human thought to write an alert for. The weakness is that CloudWatch metrics have 1-minute resolution by default (5 seconds costs extra), which is too coarse for fast-moving incidents.

Datadog is the dominant commercial observability platform. It unifies metrics, logs, traces, and synthetics in a single product. Its agent-based approach means it can instrument hosts that don’t expose Prometheus endpoints. The cost scales with host count and log volume, which makes it expensive at scale - a common complaint from teams that adopted it early and grew into large bills.

Grafana Stack (Grafana + Prometheus + Loki + Tempo) is the dominant open-source observability stack. It can run on-prem or in Grafana Cloud. The advantage is cost control; the disadvantage is operational burden for self-hosted deployments.

OpenTelemetry is the strategic bet that matters most: by instrumenting with OTel, you can switch backends without reinstrumenting. Teams that instrumented directly with Datadog’s proprietary SDK in 2020 face a significant migration cost if they want to switch.

Observability Is a System Property, Not a Tool

This is the uncomfortable truth: you cannot make an unobservable system observable by throwing tools at it.

If your application handles errors by returning HTTP 200 with an error object in the body, your HTTP metrics will show 100% success while users experience failures. If your database queries are dynamically constructed strings without parameterization, your slow query log captures nothing useful. If your microservices call each other without propagating trace context, your traces are disconnected fragments.

Observability requires design decisions made when writing the code: structured logging from the start, meaningful error types that propagate correctly, trace context in every outbound call, metrics that capture what the business actually cares about. Retrofitting observability onto a system that was designed without it is possible but expensive.

The best time to instrument a system is when it’s being written. The second best time is now.

Future: Profiling and ML-Based Detection

Continuous profiling is the emerging fourth pillar of observability. Where traces show which services are slow, profiling shows which lines of code are consuming CPU or memory. Tools like Pyroscope and Parca collect profiling data continuously in production (at low overhead) and allow you to correlate CPU spikes with specific code paths. This closes the loop between metrics (something is slow) and code (this function is why).

ML-based anomaly detection - as in AWS DevOps Guru and Datadog’s Watchdog - attempts to learn normal behavior and surface deviations without requiring humans to write specific alert thresholds. This is promising for detecting unknown failure modes but introduces a new challenge: false positive management. An algorithm that pages you for “anomalies” it doesn’t understand produces its own form of alert fatigue.

The field is moving toward unified observability: a single query language that joins metrics, logs, and traces together, so you can ask “show me all traces where the p99 latency > 500ms, filtered by users in the EU, correlated with the log messages that appeared during that window.” Honeycomb pioneered this model; OpenTelemetry’s unified data model makes it possible for other backends to follow.

Summary

Signal Answers Tool Failure Mode
Metrics How much? How fast? Prometheus, CloudWatch Cardinality explosion, high-cardinality labels
Logs What exactly happened? Loki, CloudWatch Logs, ELK Cost at scale; unstructured logs are unsearchable
Traces Where did the time go? Jaeger, X-Ray, Tempo Missing context propagation breaks traces
SLOs Are we reliable enough? Any (it’s math) Poorly chosen SLIs that don’t reflect user experience
Alerts What needs human attention now? PagerDuty, Alertmanager Alert fatigue - too many pages, all ignored

Read Next: