Distributed Mutual Exclusion - One at a Time, Without a Referee // Megha Bose

Helpful context:

In a single machine, mutual exclusion is enforced by the operating system: a mutex, a semaphore, a spinlock - all implemented using atomic hardware instructions that can compare-and-swap a memory location. In a distributed system, there is no shared memory and no central OS. A process entering a critical section must somehow convince every other process in the network that it has exclusive access - without any entity having global visibility. The history of distributed mutual exclusion is a sequence of algorithms, each reducing message complexity or eliminating a bottleneck, while maintaining safety (at most one process in the CS at a time) and liveness (every request is eventually granted).

Each algorithm in this sequence responds to a concrete deficiency in the previous one. Lamport gave the first correct solution but sent one message too many. Ricart and Agrawala noticed that the extra message was redundant and eliminated it. Maekawa asked why every process needed to hear from every other process, and showed that a carefully chosen quorum of $\sqrt{n}$ sufficed. Token-based algorithms took a different approach entirely - instead of asking for permission, they pass a token, reducing the steady-state cost to near zero. Understanding why each design choice was made requires understanding the cost model it was optimizing against.

The correctness argument for all of these algorithms rests on the same underlying insight: two processes can only enter the critical section simultaneously if they somehow both receive permission without the other knowing. Every algorithm below prevents this by constructing an intersection property - either in the set of processes that must reply, or in the structure of a voting quorum, or in the uniqueness of a token. The differences are entirely in how efficiently that intersection is enforced.

The Problem

$n$ processes share a resource that only one may use at a time. A process wanting the resource enters a request phase, waits until it gets permission, enters the critical section (CS), then executes a release phase. Requirements:

Safety. At most one process is in the CS at any time.
Liveness (no starvation). Every request is eventually granted.
Fairness. Requests are granted in the order they were made (often required, sometimes weakened).

Metrics:

Message complexity. Messages sent per CS entry/exit.
Synchronization delay. Time from a process releasing the CS to the next process entering.
Response time. Time from request to CS entry.

Lamport’s Algorithm (1978)

Approach: Permission-based. Each process maintains a request queue sorted by Lamport timestamp. A process enters the CS only when its own request is at the front of the queue and it has received messages with higher timestamps from all other processes.

Protocol:

Request: $p_i$ increments its clock, sends REQUEST(ts, i) to all others, enqueues its own request.
Receive REQUEST: Enqueue the request, send REPLY(ts, i) to the requester.
Enter CS: $p_i$ enters when (a) its request is at the front of its queue, and (b) it has received a reply (or later message) from every other process.
Release: $p_i$ sends RELEASE(ts, i) to all; all processes dequeue $p_i$’s request.

Correctness. If $p_i$ and $p_j$ both want the CS, one has a smaller timestamp. The one with the smaller timestamp gets replies from all others first (since the one with the larger timestamp sends a REPLY, which puts the smaller timestamp first). So both cannot enter simultaneously.

Message complexity. $3(n-1)$ messages per CS entry: $(n-1)$ requests, $(n-1)$ replies, $(n-1)$ releases.

Ricart-Agrawala Algorithm (1981)

Key insight: In Lamport’s algorithm, the REPLY can be deferred until the receiver has finished using the CS. If $p_i$ has a lower timestamp than $p_j$, then when $p_j$ receives $p_i$’s request it sends a REPLY immediately (because $p_i$ has priority). If $p_j$ has a lower timestamp, it defers sending REPLY to $p_i$ until $p_j$ finishes the CS. This eliminates the RELEASE message entirely.

Protocol:

Request: $p_i$ broadcasts REQUEST(ts, i) to all others.
Receive REQUEST from $p_j$:
- If $p_i$ is not interested in the CS, or its own timestamp is larger than $p_j$’s: send REPLY immediately.
- Otherwise (own request has smaller timestamp, or equal timestamp with smaller ID): defer the reply (add $p_j$ to a deferred set).
Enter CS: $p_i$ enters when it has received REPLY from all $n-1$ others.
Release: Send REPLY to all deferred requesters; clear the deferred set.

Message complexity. $2(n-1)$ messages per CS entry: $(n-1)$ requests and $(n-1)$ replies. One message saved per entry/exit compared to Lamport.

Carvalho-Roucairol Optimization (1983)

Key insight: Once $p_i$ has received a REPLY from $p_j$, it holds a “permission token” from $p_j$. If $p_j$ has not sent a request since $p_i$ last entered the CS, $p_i$ doesn’t need to ask $p_j$ again.

Optimization: Keep a set $S$ of processes from which permission is held. On a new CS request, only send REQUEST to processes in $N \setminus S$ (those not already granted). When receiving a REQUEST from $p_j$ that $p_i$ is holding permission from, send the permission to $p_j$ and remove $p_j$ from $S$.

Result: In best case (no contention), 0 messages if all permissions are still held. In worst case (all permissions expired), $2(n-1)$ messages - same as RA. Average-case improvement depends on workload.

Maekawa’s Algorithm (1985)

Motivation. Ricart-Agrawala requires permission from all $n-1$ others. Maekawa asks: do we need everyone’s permission, or just a large enough subset?

Voting sets. Assign each process $p_i$ a voting set (quorum) $V_i \subseteq \{p_1, \ldots, p_n\}$ such that:

$p_i \in V_i$ for all $i$ (self-inclusion).
$V_i \cap V_j \neq \emptyset$ for all $i \neq j$ (any two quorums intersect - ensures at most one can have permission simultaneously).
$|V_i| = K$ for all $i$ (equal load).

Optimal quorum size: $K \approx \sqrt{n}$. With a grid organization of $\sqrt{n} \times \sqrt{n}$ processes, each row-and-column quorum has size $2\sqrt{n} - 1$.

Protocol:

Request: $p_i$ sends REQUEST to all $j \in V_i$.
Receive REQUEST from $p_j$: If $p_i$ has not voted, send VOTE to $p_j$ and record that it has voted. If already voted, enqueue the request.
Enter CS: $p_i$ enters when it has received VOTE from all $j \in V_i$.
Release: $p_i$ sends RELEASE to all $j \in V_i$; each $j$ dequeues the next request and sends VOTE to it.

Message complexity. $3\sqrt{n}$ messages per CS entry: $\sqrt{n}$ requests, $\sqrt{n}$ votes, $\sqrt{n}$ releases. Significantly better than $O(n)$ for large $n$.

FAILED and INQUIRE messages. Maekawa’s basic algorithm can deadlock (two processes each waiting for overlapping quorums). To prevent deadlock, two additional message types are used:

INQUIRE: sent by a voter to a process it has voted for, asking it to return the vote if it hasn’t yet entered the CS.
FAILED: sent by a voter to a process requesting its vote, when the voter has already voted for someone with a smaller timestamp and the voter cannot retract that vote.
RELINQUISH: the voted-for process returns the vote (sends RELINQUISH) if it hasn’t entered the CS and it learns that a lower-timestamp request is waiting.

Starvation. The basic INQUIRE/FAILED protocol does not guarantee starvation freedom. Fixes exist (using timestamps in the queue discipline), but add complexity.

Suzuki-Kazami Algorithm (1985) - Token Based

Approach. Token-based: a unique token circulates among processes. Only the holder of the token may enter the CS.

Data structures:

Each process $p_i$ maintains an array $RN_i[1..n]$ where $RN_i[j]$ is the largest sequence number received from $p_j$ in a REQUEST message.
The token contains: array $LN[1..n]$ where $LN[j]$ is the sequence number of the last CS entry by $p_j$; a queue $Q$ of processes waiting for the token.

Protocol:

Request: $p_i$ increments $RN_i[i]$ and broadcasts REQUEST(i, RN_i[i]).
Receive REQUEST(j, sn): Update $RN_i[j] \leftarrow \max(RN_i[j], sn)$. If $p_i$ holds the token and is not in CS, send token to $p_j$.
Enter CS: $p_i$ holds the token.
Release: $LN[i] \leftarrow RN_i[i]$. For each $p_j$ not in $Q$: if $RN_i[j] = LN[j] + 1$, add $p_j$ to $Q$. If $Q$ is non-empty, dequeue head $p_m$ and send token to $p_m$. Else hold token.

Message complexity. 0 messages if token already held; otherwise 1 REQUEST broadcast ($n-1$ messages) + 1 token transfer = $n$ messages total. But broadcasts are expensive; effective per-entry cost depends on contention.

Advantage. No message overhead when a process reuses the CS consecutively. No voting, no synchronization messages, no possibility of deadlock.

Raymond’s Algorithm (1989) - Tree Based

Approach. Token-based, organized as a dynamic spanning tree. The tree structure routes token requests efficiently.

Each process maintains:

HOLDER: the direction (neighbor) toward the current token holder.
ASKED: whether it has sent a REQUEST toward the token.
A queue of pending requests.

Protocol:

Request: $p_i$ enqueues itself. If ASKED is false and $p_i$ doesn’t hold the token: send REQUEST to HOLDER; set ASKED = true.
Receive REQUEST from $p_j$: Enqueue $p_j$. If token held and CS not occupied: send token to head of queue, update HOLDER. Else forward REQUEST toward token (HOLDER direction) if not already asked.
Receive TOKEN: Become HOLDER. If queue non-empty: send token to head of queue; update HOLDER; send REQUEST toward new token holder.

Message complexity. $O(\log n)$ per CS entry on a balanced tree - much better than $O(n)$ or $O(\sqrt{n})$ for large systems.

Limitation. Tree topology must be maintained and updated. Token loss (node failure) requires recovery.

Comparison

Algorithm	Type	Messages/CS	Fault tolerant	Starvation-free
Lamport	Permission	$3(n-1)$	No	Yes
Ricart-Agrawala	Permission	$2(n-1)$	No	Yes
Carvalho-Roucairol	Permission	$\leq 2(n-1)$	No	Yes
Maekawa	Voting	$3\sqrt{n}$	No	With fixes
Suzuki-Kazami	Token	$0$ to $n$	No (token loss)	Yes
Raymond	Token (tree)	$O(\log n)$	No	Yes

Read next: