Time in Distributed Systems - Why Clocks Disagree and What to Do About It
Helpful context:
- Networking - How Packets Find Their Way Across the Internet
- Operating Systems - The Software That Runs All Other Software
In a single-computer program, time is unambiguous: the operating system’s clock tells you when things happened, and instructions execute in a well-defined order. In a distributed system, neither of these holds. Clocks on different machines drift apart - even atomic clocks disagree after enough time - and there is no shared memory through which processes can observe each other’s state. A message sent at time $t$ on machine A might arrive before or after an event that happened at time $t$ on machine B, with no principled way to compare the two. Getting time right in distributed systems is not an engineering detail: it determines whether you can reason about the order of events, recover from failures, or even detect whether a failure occurred.
The difficulty is not merely practical. It is fundamental. Einstein’s special relativity tells us there is no universal notion of simultaneity - two events that appear simultaneous in one reference frame may not be in another. Distributed systems face an earthbound version of this problem: without a shared clock, “before” and “after” are not globally defined. The field has responded with two distinct strategies. Physical clock synchronization attempts to make all clocks agree on the same wall-clock time within some bounded error. Logical clocks abandon wall time entirely and instead track only the causal relationships that actually matter for correctness.
This post covers the physical side: how clocks drift, what it means to synchronize them, and the algorithms - Cristian’s, Berkeley, decentralized averaging, and NTP - that actually do it at scale. Understanding each algorithm requires understanding why the previous one was insufficient. Each one addresses a specific limitation - single point of failure, unavailability of a reference clock, internet-scale deployment - and the progression reveals what it actually costs to agree on time.
Why Physical Time Is Hard
A clock is a hardware oscillator counting cycles. No oscillator runs at exactly the right frequency. Clock drift is the rate at which a clock deviates from perfect time; typical computer clocks drift by 10-100 ppm (parts per million), meaning up to 8.6 seconds per day. Clock skew is the instantaneous difference between two clocks at a given real time.
For distributed applications, skew matters more than absolute accuracy. If process A sends a message with timestamp $t_A$ and process B receives it with timestamp $t_B < t_A$, the message appears to arrive before it was sent - a causality violation. File systems, databases, and distributed locks all break if timestamps cannot be trusted.
The goal of clock synchronization is to bound skew across all machines: ensure that for any two machines $i, j$ at real time $t$, $|C_i(t) - C_j(t)| < \delta$ for some acceptable $\delta$.
Cristian’s Algorithm
Setting. A client wants to synchronize its clock to a time server $S$ that has access to an accurate time source (GPS, atomic clock). The round-trip delay to $S$ introduces uncertainty.
Protocol (1989):
- At client time $T_0$, client sends a request to $S$.
- Server receives, reads its time, responds with the current time $T_s$.
- Client receives at client time $T_1$.
The round-trip time is $RTT = T_1 - T_0$. Assuming symmetric delays, the server’s time when the response arrives is approximately $T_s + RTT/2$. The client sets: $$C \leftarrow T_s + \frac{T_1 - T_0}{2}$$
Accuracy. If the minimum one-way delay is $d_{\min}$ (the time for network processing alone, excluding queuing), the true server time at the moment the client sets its clock lies in the interval: $$\left[T_s + d_{\min},\ T_s + (RTT - d_{\min})\right]$$ The error is at most $RTT/2 - d_{\min}$. For accuracy, $RTT$ should be small and $d_{\min}$ should be close to $RTT/2$ (symmetric delay).
Limitations: Single point of failure (the server). The server’s measurement of its own time has non-zero latency. Multiple samples can be averaged to improve accuracy.
Berkeley Algorithm
Setting. No machine has a reliable reference clock. We want all machines to agree on a common time, even if that time drifts from true UTC.
Protocol (Gusella-Zatti, 1989):
- A designated master (or coordinator) polls all slaves for their current clock values.
- Each slave responds with its local time.
- The master computes the offset of each slave from its own clock, discards outliers (clocks that differ too much - likely faulty), and computes the average offset of the remaining clocks relative to its own.
- The master sends each slave a correction: the amount by which it should adjust its clock (a signed delta, not an absolute time - this avoids revealing the master’s time to adversaries and allows each slave to apply the adjustment gradually).
- The master also adjusts its own clock by the average offset.
Why averaging. No single machine is trusted to have the correct time. The average of many clocks cancels out individual drifts in expectation - the group converges on a common time. This is appropriate for closed systems (local networks) where external reference is unavailable or unnecessary.
Limitation. A single master is a failure point. If the master fails, a new one must be elected (requires a separate election protocol).
Decentralized Averaging
Setting. No designated master. All peers want to converge to the same average time without any node playing a coordinator role.
Protocol: Each node periodically selects a random neighbor, exchanges clock values, and both nodes adjust to the average: $$C_i \leftarrow \frac{C_i + C_j}{2}, \quad C_j \leftarrow \frac{C_i^{\text{old}} + C_j}{2}$$
Repeated over many rounds, all nodes converge to the average of the initial clocks (the average is a conserved quantity). Gossip-based averaging protocols converge in $O(\log n)$ rounds with high probability.
Tradeoff. Fully decentralized and fault-tolerant (no single point of failure). Slower convergence than master-based schemes. Cannot synchronize to an external reference.
NTP: Network Time Protocol
NTP is the internet-scale solution. It provides clock synchronization to UTC within milliseconds over the public internet, and within tens of microseconds over a local network.
Stratum hierarchy. Time servers are organized in layers:
- Stratum 0: reference clocks (GPS receivers, atomic clocks). Not directly on the network.
- Stratum 1: servers directly connected to stratum 0 sources.
- Stratum 2: servers synchronized to stratum 1. And so on up to stratum 15 (stratum 16 = unsynchronized).
Clients synchronize to multiple servers and use statistical filtering to select the best sources and reject outliers (the intersection algorithm finds the set of servers whose offset estimates are mutually consistent).
NTP timestamp exchange. Same basic idea as Cristian’s: measure round-trip time, estimate one-way delay. But NTP uses four timestamps per exchange:
- $T_1$: client sends request (client clock)
- $T_2$: server receives request (server clock)
- $T_3$: server sends response (server clock)
- $T_4$: client receives response (client clock)
The clock offset estimate is: $$\theta = \frac{(T_2 - T_1) + (T_3 - T_4)}{2}$$ The round-trip delay is $\delta = (T_4 - T_1) - (T_3 - T_2)$.
Clock discipline. NTP never steps the clock backward (which would break applications). Instead, it slews the clock: speeds it up or slows it down by at most 500 ppm until the offset is corrected. This prevents time from ever going backward.
Accuracy. NTP over the internet: $\pm 10$ milliseconds typical. Over local LAN: $< 1$ millisecond. With GPS + PPS (pulse-per-second signal): $< 1$ microsecond.
Limitation of physical clocks. Even with perfect physical clock synchronization, the fundamental problem remains: message delays are variable, so you cannot determine the order of events at different machines purely from timestamps. If two events happen within the synchronization error window, their physical timestamps are unreliable for ordering. This motivates logical time.
Summary
| Algorithm | Setting | Who has reference | Accuracy |
|---|---|---|---|
| Cristian’s | Client-server | Server (GPS/atomic) | $RTT/2 - d_{\min}$ |
| Berkeley | LAN, no reference | Master averages peers | Relative consistency |
| Decentralized averaging | Fully peer-to-peer | None | Average convergence |
| NTP | Internet-scale | Stratum 0 (GPS) | $\pm 10$ ms internet, $< 1$ ms LAN |
Read next: