Fault Tolerance - Building Systems That Survive Their Own Failures // Megha Bose

Helpful context:

Consistency & CAP - The Tradeoff Distributed Systems Cannot Escape

Google estimates that 0.5% of servers in a large datacenter fail every day. At 100,000 servers - a modest number for hyperscale infrastructure - that is 500 individual hardware failures every single day. Hard drives spin down. Memory modules flip bits. Network cards lose their link. Power supplies give out. And that’s just the hardware. Software crashes, network partitions, and the inevitable human configuration error add to the count.

The question is not “how do we prevent failures?” - you can’t, at this scale. The question is: how do you build systems that genuinely don’t care?

Failure Is a Spectrum, Not an Event

Understanding how things fail is prerequisite to designing around them. Failures are not uniform.

Crash failures are the cleanest type: a process stops and returns no responses. Other nodes detect the silence, route around it, and continue. Crash-stop is the easiest failure model to reason about and the one most distributed systems literature assumes.

Omission failures are subtler: a node is running but messages are silently dropped. The network swallows packets; a process’s socket buffer overflows; a thread pool is exhausted and new requests are discarded. The node isn’t dead - it’s just not communicating. Worse, from the outside, it looks identical to a crash.

Timing failures occur when responses arrive too late. A node is functioning but slow - overloaded, paused for garbage collection, waiting on a saturated disk. From the caller’s perspective, a timeout and a crash are indistinguishable. This is why “detect the failure” is harder than it sounds.

Byzantine failures are the nightmare case: a node sends actively wrong or contradictory information. Memory corruption produces valid-looking but incorrect data. A compromised node lies. Byzantine fault tolerance requires specialized algorithms (PBFT, used in some blockchain consensus protocols) and is practically never applied in conventional distributed systems. Most real systems assume crash failures only - and then add monitoring to catch the cases where that assumption is wrong.

The FLP Impossibility and Why We Use Timeouts Anyway

The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proved that in a fully asynchronous distributed system, no algorithm can guarantee consensus in the presence of even a single crash failure. You cannot reliably distinguish a crashed node from a slow one in an asynchronous network.

In practice, systems work around FLP by using timeouts: wait long enough, declare the node dead, move on. This works because we’re willing to occasionally treat a slow node as dead (a false positive), reelect a leader, and then deal with the recovered slow node rejoining. The cost of false positives is manageable. The cost of waiting forever is not.

Cassandra uses the Phi Accrual Failure Detector: instead of a fixed timeout, it computes a continuous suspicion value $\phi$ based on the statistical distribution of recent heartbeat inter-arrival times. High $\phi$ means high suspicion of failure. This adapts to variable network conditions - a cluster with naturally high latency variance gets a higher effective threshold, reducing false positives. The same mechanism is used in Akka’s cluster monitoring.

Redundancy: Active-Active vs Active-Passive

The first principle of fault tolerance is to have more than one of everything.

Active-active redundancy runs multiple instances simultaneously, all serving live traffic. If one fails, the others absorb its load - failover is instantaneous because there’s no promotion required. Load balancers automatically stop routing to the failed instance. The cost: you must provision enough capacity in the surviving instances to absorb the failed instance’s share, which means steady-state over-provisioning.

Active-passive redundancy runs a primary that serves traffic and one or more standbys that replicate data but sit idle. When the primary fails, a standby is promoted. Promotion takes time: leader election, health check propagation, DNS update. This means a brief outage - seconds to minutes depending on the system. The advantage: lower steady-state cost because standbys don’t need full serving capacity.

AWS Multi-AZ is the canonical active-passive pattern for databases. An RDS instance in us-east-1 has a synchronous standby in a different availability zone. AWS handles promotion automatically on failure; the failover typically completes in 60 - 120 seconds. This is table stakes for any production database - a single-AZ RDS instance is one hardware failure away from downtime.

AWS Multi-Region is the next level: separate deployments in geographically distinct regions (us-east-1 and eu-west-1). Replication is asynchronous (synchronous cross-region replication is physically expensive - 100ms round-trip latency per write). Route 53 health checks and failover routing records implement the DNS-level switch. Recovery Time Objective (RTO) in this configuration is typically minutes, not seconds, because DNS TTLs and health check evaluation add delay.

The Circuit Breaker Pattern

A slow or failing downstream service is contagious. When service A calls service B, and B starts returning errors or timing out, A’s threads pile up waiting for B. A’s thread pool exhausts. A’s own response times spike. Services C and D, which call A, start to fail. The failure has propagated sideways through the service graph - a cascade.

The circuit breaker prevents this. Named after the electrical component that interrupts current flow when the circuit is overloaded, a software circuit breaker has three states:

Closed (normal operation): Requests flow through. The breaker counts recent failures. When the failure rate exceeds a threshold (say, 50% of requests in the last 10 seconds), the breaker trips.

Open (failing): Requests fail immediately with a local error - no network call is made. This gives the downstream service time to recover without accumulating backpressure from the upstream. From the caller’s perspective, the failure is fast rather than slow - a fast fail at 1ms is far less damaging than a slow fail at 30 seconds, because threads aren’t tied up waiting.

Half-open (probing): After a configurable timeout, the breaker allows a limited number of test requests through. If they succeed, the circuit closes and normal operation resumes. If they fail, the circuit opens again and the timeout resets.

Resilience4j (Java) and Polly (.NET) implement this pattern. The critical parameters to tune: failure rate threshold, minimum number of calls before evaluation (prevents tripping on the first failure), and the wait duration before half-open. Getting these wrong creates either a hair-trigger breaker that opens unnecessarily or an unresponsive one that fails to protect.

The Thundering Herd Problem

Here is the failure mode that circuit breakers create if mishandled: when the circuit closes (the downstream recovers), every caller whose circuit was open simultaneously starts sending real requests again. If there were 10,000 clients with open circuits, all 10,000 send requests at the same moment. The recovering service is immediately overwhelmed and fails again. The circuit re-opens. Wait. Try again. Same result.

This is the thundering herd - the circuit breaker’s paradox. The solution is retry jitter and staggered half-open probing. Instead of allowing all clients to probe simultaneously when the circuit half-opens, allow only a fraction (say, 5%) to send test requests. If those succeed, allow more. This gradual ramp prevents the re-overwhelm scenario.

The same pattern appears in retry logic: immediate retries on failure from thousands of clients simultaneously replicate the original load spike on the recovering service. Exponential backoff with jitter - wait $\min(\text{cap}, \text{base} \times 2^{\text{attempt}}) \times \text{random}(0.5, 1.0)$ - spreads retries across a time window, giving the recovering service a gradual ramp rather than another spike.

AWS SDKs implement this by default. If you’re rolling your own retry logic and not adding jitter, you’re building a thundering herd generator.

Timeout Budgets

The most commonly omitted fault-tolerance primitive. Without timeouts, a call to a slow service blocks a thread indefinitely - and eventually, all threads are blocked, the thread pool is exhausted, and the service stops accepting new requests.

Every network call must have a timeout. Every single one.

The right timeout is not arbitrary: it should be set relative to the caller’s own deadline. If an API request has a 1-second user-facing SLA, and it makes three serial downstream calls, each timeout should be a fraction of that second - with margin for processing time. A 30-second timeout on a downstream call inside a 1-second SLA request is nonsensical: the downstream’s timeout would never be hit; the upstream would time out to the user long before.

This is the concept of timeout budgets: each layer of a service call chain allocates part of the budget to the downstream call and keeps the rest for local processing. gRPC’s deadline propagation does this automatically - a deadline set on the outer RPC is propagated to all nested RPCs, and inner calls that would exceed the deadline fail fast.

Chaos Engineering: Deliberately Breaking Production

Netflix moved to AWS in 2011 and faced a problem: how do you know if your fault-tolerance mechanisms actually work? The answer was counterintuitive. You break things on purpose, during business hours, while engineers are watching.

Chaos Monkey randomly terminates production EC2 instances. The premise: if instances are going to fail anyway (and they are, at 0.5%/day), it’s better to discover weaknesses during working hours when engineers can respond than at 3am when the on-call is alone. The result of years of Chaos Monkey was an architecture that survived the 2011 AWS US-East outage far better than competitors who made the same cloud assumptions but hadn’t stress-tested their recovery paths.

Modern chaos engineering goes further. Netflix’s FIT (Failure Injection Testing) injects failures at the service-call level - simulating a downstream service returning errors - to test whether circuit breakers, fallbacks, and degraded modes work as designed. AWS Fault Injection Simulator (FIS) is a managed service that injects EC2 terminations, network latency, disk errors, and more according to a structured experiment template. GCP has an equivalent through their reliability tooling.

The discipline is: define steady-state metrics before the experiment (P99 latency, error rate, throughput), hypothesize that the system will maintain those metrics under a given failure, inject the failure, measure, and fix any deviations. Chaos engineering is not “break things randomly.” It’s controlled hypothesis testing about your system’s failure behavior.

Graceful Degradation

When a dependency fails, returning a degraded but useful response is almost always better than returning an error. A product page that shows cached inventory levels (“in stock as of 5 minutes ago”) is more valuable to the user than a 503. A search page that returns results without personalization is better than no search at all.

Graceful degradation requires identifying which dependencies are on the critical path (the response is useless without them) and which are enhancement-only (the response is degraded but usable without them). Circuit breakers should be paired with fallback behaviors: return a cached value, return a default, or omit the dependent feature.

The fallacy of distributed computing that most teams learn the hard way: “the network is reliable.” It’s not. The network will partition, drop packets, reorder messages, and add latency. Systems that assume network reliability degrade catastrophically when reality intrudes. Systems that assume network unreliability degrade gracefully.

eBPF and the Future of Failure Observation

eBPF (extended Berkeley Packet Filter) is a kernel-level technology that allows user-defined programs to run inside the Linux kernel - observing system calls, network traffic, and process behavior - without modifying kernel source code or running privileged programs.

For fault tolerance, eBPF enables a category of observability that was previously impossible: detecting failure modes at the kernel level rather than the application level. A service that crashes silently, a TCP connection that gets stuck without the application noticing, network retransmissions that indicate upstream congestion - all observable via eBPF tools (Falco, Cilium, Pixie) without application instrumentation.

SRE error budgets formalize fault tolerance as an organizational practice: if you have a 99.9% SLA, you have 8.7 hours of acceptable downtime per year. When you exceed that budget, you stop shipping features and fix reliability. When you’re well within budget, you can take more risk with deployments. Error budgets make the fault-tolerance conversation explicit and quantified, rather than a perpetual tension between “ship more” and “break less.”

Summary

Concept	Key Insight
Failure taxonomy	Crash → omission → timing → Byzantine; most systems only handle crash
FLP impossibility	Can’t distinguish slow from crashed; timeouts are the practical workaround
Active-active vs active-passive	Active-active: instant failover, higher cost. Active-passive: cheaper, recovery takes time
AWS Multi-AZ vs Multi-Region	Multi-AZ: 60 - 120s failover. Multi-Region: minutes; asynchronous replication
Circuit breaker	Three states: closed → open (fast fail) → half-open (probe); pair with fallback
Thundering herd	Circuit close + simultaneous retry = re-overwhelm; jitter prevents it
Timeout budgets	Set timeouts relative to the caller’s deadline, not arbitrarily
Chaos engineering	Controlled failure injection during business hours; test recovery paths
Graceful degradation	Cached/partial response > error; identify critical vs enhancement dependencies

Read Next:

Message Queues - Decoupling Services by Delaying Their Conversations