Reliability Patterns - How Services Stay Up When Everything Around Them Falls Apart
Helpful context:
- Fault Tolerance - Building Systems That Survive Their Own Failures
- Microservices - Small Services, Large Coordination Problems
- Latency, Throughput & Queues - The Physics of System Performance
On August 19, 2012, Netflix experienced an outage. Not because Netflix’s own services failed, but because an Amazon Web Services Elastic Load Balancing service in us-east-1 had a problem. Netflix was running in a single region, and when ELB had issues, Netflix went down with it.
What Netflix did next was interesting. They did not just fix the immediate problem. They built a system called Chaos Monkey - software that randomly terminates instances in their production environment, every day, on purpose. The reasoning: if we know our system can fail at any time, and we keep breaking it in controlled ways, we will be forced to build services that survive failures. A system that is never broken in testing will eventually be broken in production, at the worst possible time.
This is the philosophy behind reliability patterns: build systems that assume failure is constant, not exceptional.
Cascading Failures: How Small Problems Become Big Ones
In a monolith, a failure is usually contained. The service is down, everything it does is down, you restart it. In a microservices architecture, a failure in one service can cascade through the system in ways that are hard to predict.
The pattern is this: Service A calls Service B. Service B starts responding slowly - not failing, just slow. Service A’s thread pool fills up with requests waiting on B. New requests to A arrive and find no available threads. A starts timing out. Services that call A start filling up their thread pools. The slowness in B has propagated upward through the call chain until the entire system is degraded, even though most services are individually healthy.
The failure mode is amplification: a small problem in a leaf service becomes a large problem in the services that depend on it, and a larger problem in the services that depend on those. This is the cascading failure pattern, and it is the primary reason that distributed systems need reliability patterns beyond what a monolith needs.
Timeouts: Setting an Upper Bound on Waiting
The first and most important reliability primitive is the timeout. Every network call should have a timeout. If you call a service and it does not respond within N milliseconds, stop waiting and treat the call as a failure.
Without timeouts, threads wait indefinitely for responses that may never come. A slow dependency brings the caller to a halt - threads pile up, memory fills, the service grinds down.
Choosing a timeout value is not arbitrary. Too aggressive (1ms) and you reject valid slow requests. Too conservative (30 seconds) and threads queue up for 30 seconds before being released. The right value comes from measuring the actual latency distribution of the dependency. A service whose P99 latency is 200ms should have a timeout somewhere around 500ms-1000ms - far enough past the P99 to not trip on normal variation, close enough to release threads promptly on actual failures.
Cascading timeouts are a subtlety: if Service A has a 5-second timeout calling Service B, and Service B has a 4-second timeout calling Service C, and Service C has a 3-second timeout calling Service D - when D starts being slow, B’s request to D will time out after 3 seconds, then C’s request to B will time out after 4 seconds, then A’s request to B will time out after 5 seconds. The total wait for A is up to 5 seconds even though the actual failure was at D. Set timeouts at each layer to add up to less than the timeout the layer above is expecting.
Circuit Breaker
The circuit breaker is named after the electrical device. An electrical circuit breaker detects excessive current and opens the circuit, stopping the flow of electricity rather than allowing a surge to damage equipment. A software circuit breaker detects excessive failures from a dependency and stops sending requests to it.
The circuit breaker has three states:
Closed (normal operation): requests flow through normally. The breaker tracks the failure rate. If failures stay below the threshold, the circuit stays closed.
Open (dependency failing): the failure rate exceeded the threshold. The breaker stops sending requests to the dependency immediately and returns a fast failure to callers. This is key: a fast failure, not a slow one. The caller gets an error quickly rather than waiting for a timeout. Threads are not held waiting. The call chain does not cascade.
Half-open (recovery probe): after a configured cooldown period, the breaker allows a single test request through. If that request succeeds, the breaker closes and normal traffic resumes. If it fails, the breaker returns to open and the cooldown resets.
The circuit breaker serves two purposes. First, it protects the caller from slow failures - instead of waiting 30 seconds for a timeout, you get an immediate error you can handle. Second, it gives the failing dependency space to recover. A dependency that is failing under load gets fewer requests when the circuit is open, which may be exactly what it needs.
Netflix’s Hystrix library popularized circuit breakers in the JVM ecosystem. In modern microservices, circuit breaker logic often lives in a service mesh like Envoy or Istio, which implements it transparently without requiring each service to implement it in code.
Bulkhead
A bulkhead in a ship is a wall that divides the hull into compartments. If one compartment is breached and floods, the bulkhead prevents the flooding from reaching other compartments. The ship can survive partial flooding that would sink it without bulkheads.
The software pattern is the same: isolate resource pools so that failure in one does not drain resources from others.
The most common implementation is separate thread pools for different dependencies. If your service calls three external APIs - authentication, inventory, and payment - give each a separate thread pool. When inventory is slow and its thread pool fills up, authentication and payment are unaffected. Without bulkheads, all three share a single thread pool, and a slow inventory API fills the entire pool, starving authentication and payment even though they are healthy.
Bulkheads also apply to connection pools to databases and downstream services, to request rate limiters (different rate limits for different clients), and to resource limits in Kubernetes (different resource quotas for different workloads). The principle is the same: partition shared resources so that one consumer’s failure mode is contained.
Retry with Exponential Backoff and Jitter
Some failures are transient - a brief network hiccup, a momentary overload spike, a restarting instance. For these, retrying the request after a short wait often succeeds. But naive retry logic can make failures worse.
The two failure modes of naive retries:
Immediate retry loops: a client that retries immediately on failure issues 100 requests per second instead of 1, turning a service that is slightly overloaded into one that is massively overloaded. The retry storm described in the rate limiting post.
Synchronized retries: if many clients back off for the same fixed interval (say, all retry after 5 seconds), they all send their requests at the same moment, creating a load spike every 5 seconds.
Exponential backoff with jitter addresses both. The wait time grows exponentially with each retry (1s, 2s, 4s, 8s, 16s…) and is randomized within the interval (wait between 0 and 8 seconds, not exactly 8). This reduces load during outages and desynchronizes retries across clients.
Retries should also respect a maximum attempt count. A request that has failed five times in a row is not going to succeed on the sixth attempt if the underlying failure is persistent. Stop retrying and surface the error.
Backpressure
When a consumer cannot keep up with a producer, what should the producer do?
Without backpressure, the producer keeps producing. Requests queue up. The queue grows. Eventually something crashes - the consumer’s memory is exhausted, the queue overflows, or the consumer gets so far behind that requests time out before they are processed.
Backpressure is the mechanism by which a consumer signals to a producer to slow down. In synchronous systems, this is simple: the producer calls the consumer and waits for a response, so a slow consumer naturally throttles the producer. In asynchronous systems - queues, event streams - the producer and consumer are decoupled, and backpressure requires explicit signaling.
Reactive programming frameworks (Rx, Akka Streams, Project Reactor) build backpressure into their abstractions. A consumer signals how many items it can handle; a producer respects that signal and does not emit faster than the consumer can absorb. Kafka’s consumer group lag is a visible manifestation of backpressure: if lag is growing, the consumer is behind the producer, and something needs to change.
In a simpler form: if your queue depth is growing beyond a threshold, start rejecting new items at the producer side rather than continuing to enqueue. It is better to reject new requests cleanly than to accept them into a queue that will deliver them too late or not at all.
Graceful Degradation
When a dependency is unavailable, there are often two options: fail the entire request, or degrade gracefully and return something less than optimal.
A search service that depends on a recommendation engine for personalized results might return generic results when the recommendation engine is down, rather than returning an error. Users get slightly worse results instead of no results. A shopping cart that depends on a discount service for pricing might show full prices when the discount service is down, rather than blocking checkout. A social feed that depends on a content ranking service might return unranked chronological posts instead of nothing.
Graceful degradation requires thinking through the “what can we still do without this dependency?” question for each dependency. Not all dependencies have a good degraded mode - payment processing cannot be gracefully degraded away. But many can, and identifying which ones changes a complete outage into a partial degradation.
The implementation requires: knowing which dependencies each feature needs, having a fallback behavior defined for each optional dependency, and actually testing the fallback paths. A fallback that was defined in code eighteen months ago and never tested since is not a reliable fallback.
Chaos Engineering
The practices above are easier to build if you know your failure modes. The problem is that failure modes in distributed systems are not obvious from reading the code. They emerge from the interaction of services, network conditions, load patterns, and timing in ways no single engineer can reason about completely.
Chaos engineering is the practice of deliberately introducing failures into a system to discover its failure modes before they are discovered by a production incident.
Netflix’s Chaos Monkey randomly terminates EC2 instances in production. Chaos Gorilla terminates entire availability zones. Chaos Kong takes out entire regions. The idea is that if your system is always experiencing a controlled version of these failures, the team is always working to fix the failures they observe, and the system becomes genuinely resilient rather than theoretically resilient.
The prerequisite is good monitoring and a stable enough system that you can distinguish “normal chaos experiment behavior” from “actual new problem.” Chaos engineering on a system that is already unstable is not useful.
Lesser forms of chaos engineering are accessible to any team: kill a service instance and watch how other services respond. Block a network connection and verify the circuit breaker opens. Fill a disk and verify the service degrades gracefully. These are not exotic experiments - they are basic hygiene for systems that need to be reliable.
Summary
| Pattern | What it does | Failure it prevents |
|---|---|---|
| Timeout | Abandons slow calls after N ms | Thread exhaustion from waiting indefinitely |
| Circuit breaker | Stops sending to a failing dependency | Cascading failures; wasted resources on known-bad calls |
| Bulkhead | Isolates resource pools per dependency | One slow dependency exhausting shared resources |
| Retry + backoff + jitter | Retries transient failures with growing delay | Retry storms; synchronized thundering herds |
| Backpressure | Consumer signals producer to slow down | Queue overflow; consumer crash from overload |
| Graceful degradation | Returns reduced functionality instead of errors | Total outage when optional dependency fails |
| Chaos engineering | Deliberately introduces failures | Undiscovered failure modes that appear only in production |
None of these patterns is a silver bullet. A system that uses circuit breakers but not timeouts still fails when threads pile up waiting for slow dependencies. A system with bulkheads but no retry logic fails on transient errors that retries would have handled. Reliability comes from combining these patterns thoughtfully, based on understanding which failure modes your specific system faces.
Read next: