Distributed Deadlock - When Systems Wait for Each Other Forever // Megha Bose

Helpful context:

In a distributed system, a process may wait for a message that will never arrive because the sender is waiting for a reply from a third process, which is waiting for the first. No central entity observes this cycle - each process sees only its own blocked state. Deadlock detection in distributed systems requires reconstructing a global view from local observations, and the moment you take those observations determines whether the result is accurate. This is harder than it sounds: the system keeps running while you observe it, and a deadlock you detect may have already resolved, or a real deadlock may be invisible if your observations are too spread out in time.

The difficulty is not just algorithmic - it is epistemic. A snapshot of a distributed system is not a photograph taken at one instant; it is a collage assembled from messages that arrive at different times. Two processes can disagree about what the current state is, and both can be right from their own perspective. Any deadlock detection algorithm must decide how to handle this gap between observation and reality, and the different algorithms make very different tradeoffs.

Termination detection is the same problem in a different guise. Instead of asking “is there a cycle of waiting processes,” we ask “has the computation finished.” Both require answering a global question without a global observer, and both must contend with the fact that a process that looks idle may be about to receive a message that makes it active again.

Deadlock Models

Resource deadlock. Processes hold some resources and wait for others. Represented by a Wait-For Graph (WFG): directed graph where $P_i \to P_j$ means $P_i$ is waiting for $P_j$ to release a resource. Deadlock exists iff the WFG contains a cycle.

Communication deadlock. Process $P_i$ is blocked waiting to receive from $P_j$, which is blocked waiting to receive from $P_k$, forming a cycle. No resources are held - only messages are awaited.

AND-model vs OR-model:

AND-model: $P_i$ is deadlocked only if it waits for all processes in its wait-set. A cycle in the WFG means deadlock.
OR-model: $P_i$ is deadlocked only if it waits for any one process in its wait-set, and all are blocked. Deadlock requires a knot in the WFG (a set $S$ such that every node in $S$ has all its outgoing edges pointing to nodes in $S$, with no escape).

Phantom Deadlocks

A phantom deadlock is a false positive: the deadlock detection algorithm reports a cycle, but by the time the detection message is processed, the deadlock has been resolved (a process received what it was waiting for and is no longer blocked).

Phantom deadlocks arise because the WFG is observed asynchronously. The detection algorithm sees a snapshot of the WFG, but that snapshot may not correspond to any consistent global state: one process’s state is recorded at time $T_1$ and another’s at time $T_2 > T_1$, and between $T_1$ and $T_2$ a message was delivered that broke the cycle.

Two-phase algorithms and diffusion-based algorithms reduce (but cannot eliminate under all models) phantom deadlocks by ensuring the observed state is consistent.

One-Phase vs Two-Phase Detection

One-phase: The detector takes a single snapshot of the WFG and checks for cycles. Fast but prone to phantom deadlocks - the snapshot may be inconsistent.

Two-phase:

Phase 1: Detector takes a snapshot, finds a candidate deadlock cycle.
Phase 2: Detector re-checks the same set of processes. If they are still in the same waiting configuration, the deadlock is real.

Two-phase reduces phantoms (a transient deadlock resolves between phases), but increases detection latency and is still not always correct under all models.

Centralized Deadlock Detection (Chandy-Misra)

Setting. A central controller periodically (or on demand) collects local WFG information from all processes and checks for cycles globally.

Protocol:

Each process $P_i$ maintains its local WFG edges (who it waits for).
When an edge is added or removed (a process starts or stops waiting), $P_i$ reports the change to the controller.
The controller maintains a global WFG and checks for cycles on each update.

Problems:

Controller is a single point of failure and a bottleneck.
The global WFG at the controller may be inconsistent: an edge removal from $P_i$ and an edge addition from $P_j$ may arrive out of order relative to actual execution.
Phantom deadlocks can occur.

When to use. Acceptable for lightly loaded systems where the controller overhead is manageable and false positives can be tolerated (they trigger rollback/resolution, which is usually safe).

Chandy-Misra-Haas Algorithm (1982) - Diffusion Computation

Setting. No central controller. Deadlock detection by diffusion computation: special probe messages propagate through the WFG, and a deadlock is confirmed if a probe returns to the initiator.

For AND-model (process waits for all in wait-set):

Data structures. Each process $P_i$ has: engaged[i] (whether it is currently engaged in a detection initiated by some $P_k$), num_replies_pending[i][k] (how many replies $P_i$ is waiting for, in a detection for $P_k$).

Protocol initiated by $P_k$ suspecting deadlock:

$P_k$ sends PROBE(k, k, k) to all processes in its wait-set (format: PROBE(initiator, sender, receiver)).
On receiving PROBE(k, i, j) at $P_j$:
- If $P_j$ is actively waiting: if $P_j$ was not yet engaged in detection for $P_k$: mark engaged[j][k], set num_replies_pending[j][k] = |wait-set of $P_j$|, send PROBE(k, j, m) to all $m$ in $P_j$’s wait-set.
- If PROBE(k, j, j) arrives at $P_k$ (probe returned to initiator): $P_k$ is deadlocked.

Correctness. A probe returning to $P_k$ means there is a sequence of waiting processes forming a cycle in the WFG at the time the probe traveled. All processes on the path were blocking when the probe passed through them. If the AND-model holds, the cycle is a deadlock.

Phantom deadlocks in CMH. A process on the probe’s path may have been released between the time the probe passed through and the time it completes the cycle. This produces a phantom. The differential technique (below) addresses this.

Singhal-Kshemkalyani Differential Technique

Problem with CMH. CMH sends information about all WFG edges whenever a process reports. Most of the time, only a few edges have changed. The full WFG transmission is wasteful.

Differential technique. Each process maintains:

SV[i]: the sequence vector - SV[i][j] is the sequence number of the last information process $P_i$ sent about process $P_j$.
RV[i]: the receive vector - RV[i][j] is the sequence number of the last information $P_i$ received about $P_j$.

When $P_i$ sends information to the controller (or in a diffusion computation), it sends only the changes since last time - the edges where $SV[i][j] > RV[\text{receiver}][j]$. The receiver updates $RV$ accordingly.

Why this was needed for CMH. CMH in its original form sends probes through the full WFG. In large systems with many processes and many WFG edges, each probe message carries excessive data. The differential technique reduces each message to carry only the edges that have changed since the last communication between those two processes - turning a $O(n)$-per-message overhead into an $O(\delta)$-per-message overhead where $\delta$ is the number of changed edges.

Mitchell-Merritt Algorithm (1984) - Single Resource

For the special case where each process holds at most one resource and waits for at most one resource (single-resource model), the WFG is a collection of simple chains and at most one cycle. Detection is simpler.

Algorithm:

Each edge in the WFG is labeled with a tag.
To detect: process $P_i$ (at the tail of a wait-for chain) propagates a query forward along the chain.
Each process receiving a query forwards it, appending its own label.
If the query returns to the initiator, there is a cycle.

Message complexity. $O(d)$ messages where $d$ is the length of the deadlock chain. Much cheaper than general WFG algorithms for the single-resource case.

Limitation. Only correct under the single-resource model. Does not generalize to AND-model with multiple resources.

Termination Detection

Problem. A distributed computation is terminated when all processes are idle and no messages are in transit. How does an observer know when this has happened?

Why it’s hard. A process that appears idle may be about to receive a message that makes it active again. The observer cannot tell idle-with-pending-message from genuinely-terminated.

Dijkstra-Scholten Algorithm (Token-Based)

Model. One initiator process; all activity derives from it. Tree structure tracks which processes were activated by which.

Protocol:

The initiator spawns a computation tree: processes activated by the initiator are its children; processes they activate are their children, etc.
When a process becomes idle and all processes it activated are also idle (it has no children left), it sends a signal to its parent in the computation tree.
When the initiator receives signals from all its children and is itself idle: the computation is terminated.

Message complexity. At most 2 signal messages per message sent in the computation (one credit sent with each message, one returned on termination). No global state collection needed.

Permission-Based Termination Detection

Safra’s algorithm (1987). Processes are arranged in a logical ring. A token circulates. When a process passes the token: if it has been active since it last passed the token, it “colors” the token black (tainted). Otherwise it passes it white. The initiator detects termination only when a white token completes a full ring traversal while the initiator is itself idle and the token arrived white.

Invariant. A white token completing a full ring means no process became active after the token passed through it - which means all processes are idle and will remain idle (no pending messages can re-activate them).

Summary

Algorithm	Model	Approach	Key property
Centralized (Chandy-Misra)	General	Global WFG at controller	Simple; bottleneck; phantom-prone
CMH (diffusion)	AND-model	Probe propagation through WFG	Decentralized; phantom possible
Singhal-Kshemkalyani	AND-model	Differential WFG updates	Reduces message overhead to changed edges only
Mitchell-Merritt	Single resource	Tag-labeled query chains	$O(d)$ messages; limited to single resource
Dijkstra-Scholten	Termination	Computation tree signals	Initiator-based; $O(1)$ overhead per message
Safra (ring token)	Termination	Colored token on ring	Permission-based; no central coordinator

Read next: