Helpful context:


Distributed systems have a fundamental problem: if you want a piece of data to survive node failures, you replicate it across multiple machines. But now you have multiple copies of the data, and they can disagree. If two nodes each think they are the authoritative writer, you get split-brain: conflicting updates, corrupted state, chaos. The question “how do distributed nodes reach agreement?” has occupied computer scientists since the 1980s. The answer is consensus algorithms. Paxos was the first complete answer. Raft was the readable one.

The Problem: Agreement in the Presence of Failure

Imagine three servers each holding a copy of a counter. A client increments the counter. You want all three servers to agree on the new value, even if one server is temporarily unreachable.

The naive approach is to write to all three and wait for all three to confirm. But if one server is down, you wait forever. So you write to two out of three and accept that as success - a majority. But now you have a timing problem: what if two clients write simultaneously? What if the network partitions and you have two groups each thinking they are the majority?

A consensus algorithm guarantees that all nodes agree on the same sequence of values, even in the presence of network delays, partitions, and node failures - as long as a majority of nodes are functioning and reachable to each other.

The formal requirements are:

  • Safety: all nodes that decide on a value decide on the same value. No two nodes reach different conclusions.
  • Liveness: if a majority of nodes are healthy, the system eventually makes progress. It does not stall forever.
  • Durability: once a value is agreed upon, it stays agreed upon even if some nodes crash.

Paxos: Correct, Unimplementable

Leslie Lamport published Paxos in 1989 (though the paper was famously difficult to get published and did not appear in print until 1998). Paxos is provably correct. It is also notoriously difficult to understand and even more difficult to implement correctly. Lamport himself wrote that “there is only one feasible protocol for implementing a fault-tolerant distributed service: the multi-decree parliament” - i.e., Paxos - and then acknowledged in later papers that nobody seemed able to implement it without introducing subtle bugs.

Google’s Chubby lock service, a foundational piece of the Google infrastructure stack, is built on Paxos. The team that built it wrote that despite the algorithm being well-studied, “Paxos' opaque description makes it difficult to build correct implementations.” Amazon’s DynamoDB, etcd, CockroachDB, and most production consensus systems ended up implementing Paxos variants with significant undocumented extensions.

This is the problem Raft was designed to solve.

Raft: Understandability as a Design Goal

Diego Ongaro and John Ousterhout published “In Search of an Understandable Consensus Algorithm” (2014) - the Raft paper. The explicit design goal, stated in the first sentence, was understandability. They decomposed the consensus problem into three semi-independent subproblems: leader election, log replication, and safety. Each has a clean specification.

They ran user studies comparing Raft and Paxos on graduate students with no prior exposure to either. Raft consistently produced better comprehension. This is the paper that made consensus practical: the algorithm is complete enough to implement from the paper alone, without undocumented extensions.

Core Concept: The Replicated Log

Raft models distributed consensus as maintaining a replicated log: an ordered sequence of commands that all nodes agree on, in the same order. Each command is a log entry. Once an entry is committed (agreed upon by a majority), it is applied to the state machine - the actual data the system is managing.

The state machine can be anything: a key-value store, a lock table, a counter, a database. The consensus layer only cares about ordering entries in the log; the application layer cares about what the entries mean.

If every node starts in the same initial state and applies the same log entries in the same order, every node ends in the same state. Consensus on the log gives you consistency of the state machine.

Leader Election

Raft uses a strong leader: all writes go through one designated leader. The leader receives client requests, appends them to its log, and replicates them to followers. This simplifies the algorithm enormously compared to Paxos, where any node can propose values.

Time in Raft is divided into terms, numbered sequentially. Each term begins with an election. If a follower does not hear from a leader for a while (the election timeout, randomized between 150ms and 300ms in typical configurations), it assumes the leader has failed, increments the term number, becomes a candidate, and requests votes from other nodes.

A node grants its vote if: the candidate’s term is at least as large as the voter’s current term, and the candidate’s log is at least as up-to-date as the voter’s log (defined precisely). A candidate that receives votes from a majority of nodes becomes the new leader for that term.

The randomized election timeout is why split votes are rare: if two nodes simultaneously decide to start an election, their random timeouts will usually differ by enough that one gets votes before the other even asks. In the rare case of a tie, both timeout again (with new random delays) and retry.

One leader per term is a fundamental safety invariant. Because a node only votes once per term, and majority vote is required, it is impossible for two different nodes to both win an election in the same term.

Log Replication

Once elected, the leader accepts client requests. Each request becomes a log entry appended to the leader’s log, tagged with the current term. The leader sends AppendEntries RPCs to all followers to replicate the entry.

When a majority of nodes (including the leader) have written the entry to their logs, the entry is committed. The leader applies it to its state machine and returns success to the client. The leader then notifies followers in subsequent AppendEntries messages that the entry is committed; followers apply it to their state machines.

The client never directly interacts with followers. All reads and writes go to the leader. Followers exist only to preserve durability (the entry is safe on multiple machines) and to be ready to take over if the leader fails.

What if a follower is slow or temporarily disconnected? The leader keeps track of how far each follower’s log extends. It retries AppendEntries indefinitely for slow followers. When a follower reconnects, it receives all the entries it missed, in order. The log is eventually consistent across all nodes that are reachable.

Safety: Why Stale Leaders Cannot Corrupt State

The critical safety property: if an entry is committed, no future leader can overwrite it with a different value.

This is guaranteed by the log up-to-date check in leader election. A candidate must have a log at least as up-to-date as any node it needs a vote from (up-to-date means: higher term on the last entry, or same term and longer log). Any node that voted for a committed entry will only vote for candidates that have that entry. Since a majority is needed to commit and a majority is needed to elect, the winning candidate necessarily intersects with the committed-entry-holders. It therefore has the committed entry.

This is the key insight that makes Raft safe: the quorum for committing and the quorum for electing always overlap by at least one node.

Network Partitions

What happens when the network splits into two groups?

If the leader is in the minority partition, it can no longer replicate entries to a majority of nodes. It continues running but cannot commit new entries. The majority partition elects a new leader with a higher term. The old leader, when the partition heals, discovers the higher term in an AppendEntries response and immediately steps down, becoming a follower and catching up its log.

Any entries the old leader “committed” (which, by definition, it could not have, since it lacked a majority) would not actually be durable. Entries it accepted from clients but could not commit are overwritten by the new leader’s log. Raft handles this by not returning success to the client until the entry is actually committed by a majority.

If the client retried during the partition, the entry may be submitted twice. This is why operations on a Raft-backed system should be idempotent, or the system should deduplicate by client-provided request IDs.

Raft in Production

etcd is the most prominent Raft implementation. It is the distributed key-value store that Kubernetes uses as its database - storing all cluster state (which pods are running, which nodes exist, which services are registered). etcd’s correctness is critical: a bug in its consensus layer could corrupt the entire cluster state.

CockroachDB uses Raft per-range: the database is partitioned into ranges (typically 64MB each), and each range has its own Raft group. This allows thousands of Raft instances to run in parallel across a cluster, each handling consensus for its slice of the keyspace.

TiKV (the storage engine of TiDB) uses multi-Raft, the same architecture: one Raft group per region.

Consul (HashiCorp’s service mesh and key-value store) uses Raft for its catalog and key-value state.

The pattern: Raft is used wherever you need a small number of servers to agree on a sequence of values with strong consistency. It is not designed for large clusters (the cost of a round of AppendEntries RPCs scales with the number of nodes). Most production systems run Raft groups of 3 or 5 nodes - enough to tolerate 1 or 2 node failures respectively.

Raft vs Paxos vs Multi-Paxos

Paxos as described by Lamport handles agreement on a single value. Multi-Paxos extends this to a sequence of values (a log) by electing a “distinguished proposer” who acts as the de-facto leader and bypasses the first phase of Paxos for subsequent entries. This is structurally similar to Raft.

The practical differences are subtle: Raft specifies leader election, log compaction (snapshots), and membership changes explicitly. Multi-Paxos leaves these as implementation details. Raft’s prescriptiveness is its advantage: you can implement it from the paper. Paxos variants require each implementer to independently solve the same underdocumented problems.

Whether you call it Raft or Multi-Paxos, the substance is similar: one leader per term, log replication to a majority, safety guaranteed by quorum intersection. The theoretical foundations are the same. The practical difference is that Raft came with a complete specification, reference implementations, and TLA+ formal proofs. That is why the systems built in the 2010s used Raft.

Read next: