Monitoring & Observability // Megha Bose

Prerequisite:

Overview

Deploying software is not the end of the story - it is the beginning of running it. Production systems fail in ways that are impossible to predict in testing: traffic spikes, dependency outages, memory leaks that take days to surface, cascading failures across microservices. Without visibility into what a running system is doing, engineers are flying blind.

Observability is the ability to understand the internal state of a system from its external outputs. A system is observable if, when something goes wrong, you can ask arbitrary questions about its behavior and get answers - not just the questions you thought to ask in advance when you set up your dashboards.

The three pillars that make a system observable are metrics, logs, and distributed traces. You need all three because they answer different questions.

Metrics

Metrics are numeric measurements sampled over time. They answer questions like “how many requests per second is this service handling?” and “what is the 99th percentile response time?”.

There are three fundamental metric types:

Counter: a monotonically increasing number, reset on restart - tracks things like total requests served or total errors
Gauge: a value that goes up and down - tracks things like current memory usage or active connections
Histogram: records the distribution of a value (e.g., request latency) across configurable buckets - allows computing percentiles

Prometheus

Prometheus is the de facto standard metrics system for cloud-native applications. It uses a pull-based model: Prometheus scrapes an HTTP /metrics endpoint exposed by each service at a configured interval. This makes it easy to see when a scrape target goes down and avoids the need for each service to know where to push metrics.

Metrics are queried using PromQL (Prometheus Query Language), a functional language that supports filtering by labels, rate calculations, aggregations, and joins across metric series. Labels are key-value pairs attached to metrics - they allow you to slice a single metric by service, endpoint, status_code, region, and any other dimension that matters.

The Four Golden Signals

Google’s Site Reliability Engineering book identifies four signals that matter most for any service:

Latency: the time it takes to serve a request - distinguish between successful and failed requests
Traffic: the demand on the system (requests per second, queries per second)
Errors: the rate of failing requests - both explicit (HTTP 5xx) and implicit (HTTP 200 with wrong content)
Saturation: how “full” the system is - CPU utilization, memory pressure, queue depth

If you can only instrument four things, instrument these four.

SLIs, SLOs, and SLAs

Reliability targets must be specific and measurable.

SLI (Service Level Indicator): the actual measured metric - e.g., “the fraction of requests that complete in under 200ms”
SLO (Service Level Objective): the target you commit to internally - e.g., “p99 latency < 200ms, measured over a 28-day rolling window”
SLA (Service Level Agreement): the contractual commitment to customers, usually with financial penalties for breach - typically set lower than the SLO to give an internal buffer

Error Budgets

If your SLO is 99.9% availability, you have an error budget of 0.1% - about 43 minutes of downtime per month. The error budget is the key insight of SRE: it makes reliability concrete and tradeable. When the budget is healthy, teams can take risks and ship fast. When the budget is exhausted, the team stops shipping new features and focuses on reliability work.

Burn rate alerts catch problems before the budget is exhausted. A burn rate of 1 means you are consuming the budget at exactly the rate that would exhaust it by end of period. An alert at burn rate 14 over 1 hour catches failures that would exhaust a monthly budget in two days, while burn rate 3 over 6 hours catches slower, longer-running degradations.

Logs

Logs are the narrative of what happened. They answer questions metrics cannot: “what was the exact request payload when this error occurred?” or “which user triggered this code path?”.

Structured logs (JSON objects with consistent fields) are far more useful than unstructured text. You can filter, aggregate, and join structured logs programmatically. Every log line should include at minimum: timestamp, log level, service name, and a correlation ID tied to the originating request.

The correlation ID is threaded through every downstream call a request makes. When an error surfaces in service C, you can filter all services' logs by that ID to reconstruct the full call chain.

Loki (from Grafana Labs) indexes log streams by labels rather than full-text, making it efficient for querying structured logs alongside Prometheus metrics in Grafana dashboards.

Distributed Tracing

In a microservices architecture, a single user request may touch ten services. Metrics show that something is slow; logs show what happened in each service; distributed traces show the full picture - which services were called, in what order, how long each took, and where time was lost.

A trace is a tree of spans. Each span represents a unit of work (handling an HTTP request, executing a database query, publishing a message) with a start time, duration, service name, and metadata. Spans are linked by a shared trace ID passed in request headers.

OpenTelemetry is the open standard for instrumentation - it defines APIs and SDKs for emitting traces (and metrics and logs) from applications in any language. Traces are sent to a backend like Jaeger or Tempo for storage and visualization.

Alerting

An alert fires when a metric crosses a threshold that requires human attention. Good alerts have two properties: they fire on symptoms (user-visible degradation) not causes (a specific server’s CPU spiked), and they are actionable - every page should require a human response.

Alert fatigue is one of the biggest operational risks. When alerts fire too often for non-urgent reasons, engineers start ignoring them, and a real outage gets missed. Reducing alert volume by raising thresholds, adding minimum duration windows, and deleting alerts that never require action is active reliability work.

Health endpoints are a prerequisite for alerting. Every service should expose /health (is the process alive?) and /ready (is the service ready to handle traffic?). Kubernetes uses these for liveness and readiness probes respectively.

Examples

Prometheus + Grafana dashboard for a web service. Instrument the service with the Prometheus Python client. Record a http_request_duration_seconds histogram with method, endpoint, and status_code labels. In Grafana, build panels showing request rate (rate(http_requests_total[5m])), error rate (rate(http_requests_total{status_code=~"5.."}[5m])), and p99 latency (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))).

OpenTelemetry instrumentation in Python. Install opentelemetry-sdk and opentelemetry-instrumentation-fastapi. Configure the OTLP exporter to send traces to a local Jaeger instance. The auto-instrumentation patches FastAPI and httpx automatically, emitting spans for every incoming request and outgoing HTTP call with the trace context propagated in headers. Add custom spans for database queries with tracer.start_as_current_span("db.query").

Effective alert design. Write alert rules in three tiers: P1 (page immediately) for error budget burn rate > 14x over 1 hour; P2 (notify within 30 minutes) for burn rate > 3x over 6 hours; P3 (file a ticket) for sustained latency degradation within SLO. Link every alert to a runbook that explains what the alert means, how to investigate it, and the most common remediation steps. Review and prune alerts quarterly.

Read Next: