Helpful context:


Physical clocks can be synchronized to within milliseconds, but milliseconds are an eternity in a system where CPUs execute billions of instructions per second. More fundamentally, physical time cannot answer the question that matters most in a distributed system: did event $A$ cause event $B$? Two events can have the same physical timestamp and be causally unrelated, or have timestamps 10ms apart and be causally linked through a chain of messages. Lamport’s 1978 paper introduced logical time: a way to assign timestamps to events that respects causality exactly, without any reference to physical clocks. The insight - that a message is a causal link, and clocks should respect causal links - launched a field.

The logical clock framework separates two concerns that physical time conflates: the order in which things happened, and the question of which things influenced which. Causality is the more fundamental notion. Once you have a crisp model of causality, you can reason about distributed algorithms precisely: detecting deadlock, taking consistent snapshots, ordering updates. None of these require knowing what time it is. They require knowing what happened before what.

Snapshotting raises its own subtleties. A global state recorded naively - each process writing down its own state at whatever physical moment the snapshot command arrives - can be inconsistent: it might record a process having received a message that, in the snapshot, was never sent. The Chandy-Lamport algorithm solves this with a beautiful use of markers. The global snapshot problem and the logical clock problem are deeply linked: both reduce to capturing a consistent cut of the partial order defined by causality.


Events, Processes, and Happens-Before

A distributed program is modeled as a collection of sequential processes $p_1, p_2, \ldots, p_n$ that communicate only by sending and receiving messages (no shared memory). Each process executes a sequence of events: local computation steps, send events, and receive events.

Lamport’s happens-before relation $\to$ is the smallest relation satisfying:

  1. If $a$ and $b$ are events in the same process and $a$ occurs before $b$, then $a \to b$.
  2. If $a$ is the send event of a message and $b$ is the corresponding receive, then $a \to b$.
  3. Transitivity: if $a \to b$ and $b \to c$, then $a \to c$.

If neither $a \to b$ nor $b \to a$, then $a$ and $b$ are concurrent ($a \parallel b$): they have no causal relationship.

The happens-before relation defines a partial order on events. It is the formal model of causality in distributed systems.


Scalar (Lamport) Clocks

Goal. Assign a non-negative integer $C(e)$ to each event $e$ such that $a \to b \Rightarrow C(a) < C(b)$. (The converse need not hold - Lamport clocks do not capture concurrency.)

Rules:

  • Each process $p_i$ maintains a counter $LC_i$, initialized to 0.
  • Before executing any event: $LC_i \leftarrow LC_i + 1$.
  • When $p_i$ sends a message: piggyback the current $LC_i$ as timestamp $ts$.
  • When $p_i$ receives a message with timestamp $ts$: $LC_i \leftarrow \max(LC_i, ts) + 1$.

Theorem. If $a \to b$ then $LC(a) < LC(b)$.

Proof. By structural induction on the definition of $\to$:

  • Same process: counter increments before each event, so earlier events have smaller timestamps.
  • Send-receive: receiver increments to $\max(\cdot) + 1$, which is strictly greater than the sender’s timestamp.
  • Transitivity: composition of strict inequalities is a strict inequality. $\square$

The converse fails. $LC(a) < LC(b)$ does not imply $a \to b$. Lamport clocks assign totally ordered timestamps (breaking ties by process ID), imposing a total order consistent with $\to$ but potentially relating concurrent events.

Strongly consistent clocks. A clock assignment is strongly consistent if $a \to b \Leftrightarrow C(a) < C(b)$. Lamport clocks are only weakly consistent ($\Rightarrow$ but not $\Leftarrow$). Vector clocks achieve strong consistency.


Vector Clocks

Goal. Assign timestamps such that $a \to b \Leftrightarrow VC(a) < VC(b)$, where $<$ is the componentwise partial order.

Construction (Fidge 1988, Mattern 1988). Each process $p_i$ maintains a vector $VC_i[1..n]$, initialized to all zeros.

Rules:

  • Before $p_i$ executes a local event: $VC_i[i] \leftarrow VC_i[i] + 1$.
  • When $p_i$ sends a message: increment $VC_i[i]$, piggyback the full vector $VC_i$.
  • When $p_i$ receives a message with timestamp $ts[1..n]$: $VC_i[j] \leftarrow \max(VC_i[j], ts[j])$ for all $j$; then $VC_i[i] \leftarrow VC_i[i] + 1$.

Comparison. For two timestamps $V, W$:

  • $V = W$ iff $V[i] = W[i]$ for all $i$.
  • $V \leq W$ iff $V[i] \leq W[i]$ for all $i$.
  • $V < W$ iff $V \leq W$ and $V \neq W$.
  • $V \parallel W$ (concurrent) iff neither $V \leq W$ nor $W \leq V$.

Theorem. $a \to b \Leftrightarrow VC(a) < VC(b)$.

$VC_i[j]$ represents the number of events process $p_i$ knows process $p_j$ has executed. A vector timestamp thus encodes the “knowledge frontier” of a process at the time of an event.

Cost. Vector clocks require $O(n)$ space and $O(n)$ communication overhead per message. For large $n$, this is expensive.


Matrix Clocks

Goal. Each process tracks not just what it knows about others, but what it knows about what others know. This enables garbage collection of message logs: process $p_i$ can discard a message once it knows all other processes have received it.

Construction. Each process $p_i$ maintains an $n \times n$ matrix $MC_i$ where:

  • $MC_i[i][j]$: what $p_i$ believes $p_j$ knows about $p_j$’s own events.
  • $MC_i[k][j]$: what $p_i$ believes $p_k$ knows about $p_j$.

Rules:

  • On a local event: $MC_i[i][i] \leftarrow MC_i[i][i] + 1$.
  • On send to $p_j$: piggyback the full matrix $MC_i$; increment $MC_i[i][i]$ and $MC_i[i][j]$ appropriately.
  • On receive from $p_k$ with matrix $M$: $MC_i[l][j] \leftarrow \max(MC_i[l][j], M[l][j])$ for all $l, j$; update own row.

Application. The minimum entry in column $j$ of $MC_i$ (i.e., $\min_k MC_i[k][j]$) is the number of events of $p_j$ that every process in the system has heard about. Messages older than this can be garbage-collected. Matrix clocks trade $O(n^2)$ overhead for the ability to reason about global knowledge.


Message Ordering

Different applications need different guarantees on message delivery order.

FIFO ordering. Messages from a single sender to a single receiver are delivered in the order sent. Easy to achieve with sequence numbers; TCP provides this automatically.

Causal ordering. If the sending of $m_1$ causally precedes the sending of $m_2$, then any process that delivers $m_2$ has already delivered $m_1$. Causally ordered delivery ensures that effects do not precede causes. Implemented by attaching vector clocks to messages and buffering messages until all causally prior messages have been delivered.

Total order broadcast (atomic broadcast). All processes deliver all messages in the same order (even messages from different senders). Much harder - requires consensus. Total order broadcast and consensus are computationally equivalent (each can be implemented using the other). ISIS’s CBCAST and ABCAST, Zookeeper’s Zab, and Kafka’s replication all implement variants.

Formal hierarchy: Total order $\Rightarrow$ Causal order $\Rightarrow$ FIFO order $\Rightarrow$ No ordering guarantee.


Consistent Cuts and Global State

A global state is a tuple $(LS_1, \ldots, LS_n, SC_{12}, SC_{13}, \ldots)$ where $LS_i$ is the local state of process $p_i$ and $SC_{ij}$ is the set of messages in transit from $p_i$ to $p_j$.

A consistent cut (or consistent global snapshot) is a set of events $C$ - one prefix per process - such that: if $e \in C$ and $e' \to e$, then $e' \in C$. No event in the cut causally depends on an event outside the cut. Equivalently: if a receive event is in the cut, the corresponding send event is also in the cut.

Inconsistent cut example. Process A sends $m$ to B. The cut includes B receiving $m$ but not A sending $m$. This means a message was received that was never sent - physically impossible, but a snapshot could capture this if done naively.

A consistent cut corresponds to a reachable global state - one the system could actually have been in.


Chandy-Lamport Snapshot Algorithm

Goal. Record a consistent global snapshot of a running distributed system without stopping it.

Assumptions. FIFO channels (messages arrive in order sent on each channel). At least one initiator. No messages lost.

Protocol:

  1. Any process $p_i$ can initiate by recording its own local state $LS_i$ and sending a marker on all outgoing channels.
  2. When process $p_j$ receives a marker on channel $C_{ij}$ for the first time:
    • Record local state $LS_j$.
    • Send a marker on all outgoing channels.
    • Start recording all messages arriving on all other incoming channels (not $C_{ij}$, since the marker separates “before” from “after”).
  3. When $p_j$ receives a marker on channel $C_{kj}$ after already having recorded its state:
    • Stop recording messages on $C_{kj}$. The recorded messages are the channel state $SC_{kj}$.

Correctness. The marker acts as a “snapshot now” signal that propagates through the system. By the FIFO assumption, any message sent before the marker arrives before the marker; any message sent after arrives after. The recorded local states and in-transit messages form a consistent cut.

Termination. In a strongly connected graph with FIFO channels, every process records its state within $O(d)$ message delays where $d$ is the diameter.

Applications. Detecting stable properties: deadlock, termination, garbage (properties that remain true once they become true). Not for non-stable properties - a snapshot showing no deadlock does not imply the system is currently deadlock-free.


Snapshots in Non-FIFO Channels (Lai-Yang)

With non-FIFO channels, the marker may arrive before messages sent before it. The Chandy-Lamport protocol breaks.

Lai-Yang algorithm (1987). Uses a coloring scheme instead of explicit markers:

  • Initially all messages are white.
  • When a process records its snapshot, it turns red: all subsequent messages it sends are red.
  • Red messages carry a flag; white messages do not.
  • A process records its snapshot when it sends or receives its first red message.
  • The state of channel $C_{ij}$ is all white messages received by $p_j$ after $p_j$ turns red but sent by $p_i$ before $p_i$ turned red.

To find these messages: each process $p_i$ appends to every red message it sends the count of white messages it has sent on each outgoing channel. Process $p_j$ collects the state of $C_{ij}$ as the white messages from $p_i$ it has received since turning red, up to the declared count.

No FIFO assumption is needed. The price is that each message carries additional state (the white message counts).


Summary

Mechanism What it captures Overhead Converse holds?
Lamport clock $a \to b \Rightarrow C(a) < C(b)$ $O(1)$ per message No
Vector clock $a \to b \Leftrightarrow VC(a) < VC(b)$ $O(n)$ per message Yes
Matrix clock Vector + knowledge-of-knowledge $O(n^2)$ per message Yes
Chandy-Lamport Consistent global snapshot FIFO channels required -
Lai-Yang Consistent global snapshot Non-FIFO safe; extra per-message state -

Read next: