Fault Tolerance // Megha Bose

Prerequisite: Consistency & CAP Theorem

At scale, failure is not the exception - it is the steady state. A system with 10,000 servers, each with 99.9% uptime, will have roughly 10 servers unavailable at any moment. Disks fail, processes crash, networks partition, and cloud regions have outages. Building fault-tolerant systems means designing for failures as a first-class operational reality, not an edge case.

Types of Failures

Crash failures are the cleanest: a node stops and does not respond. Other nodes eventually detect the absence and route around it. Crash-stop is the easiest failure model to reason about.

Omission failures occur when a node is running but messages are lost or delayed. The network drops packets, a process’s queue fills up, or a socket buffer overflows. These are harder to detect - the node isn’t dead, it’s just not communicating reliably.

Timing failures happen when responses arrive too late. A node is functioning but slow - overloaded, experiencing GC pauses, or waiting on a slow disk. From the caller’s perspective, a timeout and a crash look identical.

Byzantine failures are the worst case: a node behaves arbitrarily - sending corrupted data, contradicting itself, or actively colluding. Byzantine fault tolerance requires specialized algorithms (PBFT, used in some blockchains) and is rarely applied outside security-critical systems. Most distributed systems assume crash failures only.

Failure Detection

Detecting that a node has failed without perfect information is provably hard. The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that in an asynchronous system, no algorithm can guarantee consensus in the presence of even one crash failure. In practice, systems use timeouts to approximate failure detection - accepting that they may occasionally incorrectly suspect a healthy node.

Heartbeats are periodic messages sent from each node to a monitor (or to peers). If no heartbeat arrives within a threshold, the node is suspected dead. The challenge: set the threshold too low and healthy nodes get falsely suspected during GC pauses; set it too high and recovery from real failures is slow.

Phi Accrual Failure Detector (used by Cassandra and Akka) avoids a fixed threshold. It computes a continuous suspicion value (phi) based on the distribution of recent heartbeat inter-arrival times. A phi above a configurable threshold triggers a suspect event. This adapts to variable network conditions.

Redundancy and Replication

Active-active redundancy runs multiple instances simultaneously, all serving traffic. If one fails, the others absorb its load. Higher resource utilization is required at steady state, but failover is instantaneous - the load balancer simply stops routing to the failed instance.

Active-passive runs a primary and one or more standby replicas that receive data but don’t serve traffic. When the primary fails, a standby is promoted. Promotion takes time (seconds to minutes for leader election and health check propagation), causing a brief outage.

Replication factor determines how many copies of data exist. Most distributed databases default to 3: with a replication factor of 3 and a quorum read/write requirement of 2, the system tolerates 1 node failure while maintaining consistency.

Circuit Breaker Pattern

In a microservices system, a slow or failing downstream service can cause upstream callers to pile up blocked threads, exhaust their connection pools, and crash - spreading the failure sideways. The circuit breaker prevents this.

The circuit has three states:

Closed (normal): Requests flow through. The breaker counts failures.
Open (failing): After the failure threshold is exceeded, the breaker opens. Subsequent requests fail immediately with a local error - no network call made. This gives the downstream service time to recover.
Half-open (probing): After a timeout, the breaker allows a small number of requests through. If they succeed, the circuit closes. If they fail, it opens again.

Libraries like Resilience4j (Java), Polly (.NET), and pybreaker (Python) implement this. The key parameters: failure rate threshold, minimum number of calls before evaluation, wait duration before half-open.

Retry with Exponential Backoff and Jitter

Immediate retries on failure often make things worse. If 1,000 clients all retry simultaneously after a 1-second outage, they create a thundering herd that overwhelms the recovering service.

Exponential backoff spaces retries exponentially: wait 1s, then 2s, then 4s, then 8s. This reduces retry pressure over time.

Jitter adds randomness to the backoff interval, spreading retries across the time window instead of synchronizing them:

wait = min(cap, base * 2^attempt) * random(0.5, 1.0)

AWS recommends “full jitter” (multiply by a full random factor) or “decorrelated jitter” (base wait on the previous wait). The result: retries spread across the window, the recovering service sees a gradual ramp rather than a spike.

Bulkhead Pattern

Named after the watertight compartments in a ship hull: damage one compartment, the ship stays afloat. In services, the bulkhead pattern isolates failures between subsystems by giving each a dedicated resource pool.

Instead of one thread pool shared by all downstream services, assign a separate thread pool to each: 20 threads for the payment service, 20 threads for the inventory service. If the payment service becomes slow and all 20 threads are blocked, inventory requests are unaffected. Without bulkheads, slow payment calls would exhaust the shared pool and prevent inventory requests from being processed.

The same principle applies to connection pools, rate limit buckets, and semaphores.

Timeouts

The most commonly omitted fault-tolerance measure. Without timeouts, a call to a slow service can block indefinitely, consuming a thread and a connection until the caller’s resource pool is exhausted. Every network call must have a timeout. Every.

Appropriate timeout values require measuring actual latency distributions. A timeout shorter than P99 will cause spurious failures under normal load. A timeout longer than the upstream’s own deadline is wasteful - the result will be discarded anyway.

Chaos Engineering

Netflix coined the term and built Chaos Monkey, a tool that randomly terminates production instances during business hours. The premise: if failures are going to happen anyway, it is better to discover weaknesses intentionally during working hours than to be surprised by them at 3am.

Modern chaos engineering goes beyond instance termination: network latency injection, CPU pressure, disk fill, DNS failures, and dependency outages. Tools include Netflix’s Chaos Monkey/FIT, Gremlin, and LitmusChaos for Kubernetes. The process: define steady-state metrics, hypothesize that the system will maintain them under a given failure, inject the failure, measure, fix, repeat.

Graceful Degradation and Idempotency

When a dependency fails, returning a cached or partial result is usually better than returning an error. A product page that shows cached inventory levels (“in stock as of 5 minutes ago”) provides more value than a 503. Graceful degradation requires identifying which dependencies are on the critical path and which can be skipped or approximated.

Idempotency is the property that applying an operation multiple times produces the same result as applying it once. Idempotent operations are safe to retry without side effects. POST requests that create resources are not naturally idempotent; adding an idempotency key (a client-generated UUID sent with the request, stored server-side) makes them so. Stripe’s API uses this pattern for payment creation.

Examples

Circuit breaker implementation sketch:

if circuit.state == OPEN:
    raise CircuitOpenError("fast fail")
try:
    result = call_downstream()
    circuit.record_success()
    return result
except Exception:
    circuit.record_failure()
    if circuit.failure_rate() > 0.5:
        circuit.open()
    raise

Retry with jitter:

import random, time
for attempt in range(max_retries):
    try:
        return call()
    except TransientError:
        wait = min(60, 1 * 2**attempt) * random.uniform(0.5, 1.0)
        time.sleep(wait)

Chaos Monkey at Netflix: When Netflix moved to AWS, they ran Chaos Monkey in production from the start. Forcing engineers to design for instance failure - rather than hope it doesn’t happen - led to an architecture that survived the 2011 AWS US-East outage far better than competitors who made the same cloud assumptions without the same failure discipline.