Lock-Free Data Structures // Megha Bose

Prerequisite:

Memory Models & Atomics

Why Not Just Use a Mutex?

Mutexes seem like the obvious answer to concurrent access - lock, modify, unlock. The problem is that this convenience carries real costs that compound badly at scale.

Mutex overhead is not trivial. Acquiring an uncontended lock on Linux involves at minimum a futex syscall on the fast path, and contended locks require kernel scheduling, which means context switches, cache invalidation across cores, and time spent waiting in the kernel’s wait queue. For a data structure accessed thousands of times per second by tens of threads, this overhead becomes the dominant bottleneck.

Priority inversion is a more subtle failure mode. A low-priority thread holds a lock; a high-priority thread blocks waiting for it; a medium-priority thread runs freely while the high-priority thread starves. The Mars Pathfinder lander famously experienced this in 1997, triggering watchdog resets until engineers patched priority inheritance from Earth.

Deadlock is the classic hazard: thread A holds lock 1 and waits for lock 2, thread B holds lock 2 and waits for lock 1. Neither can proceed. Lock ordering disciplines and lock hierarchies mitigate this, but as systems grow, maintaining these invariants becomes increasingly fragile.

Lock-free data structures sidestep these problems by relying on hardware atomic primitives instead of kernel-mediated synchronization.

Progress Guarantees

Not all “lock-free” code is equal. There’s a precise hierarchy:

Blocking (mutex-based): if a thread is paused, other threads may be unable to make progress indefinitely. A thread holding a mutex can be preempted, and every other thread attempting to acquire that mutex is stuck.

Lock-free: the system as a whole always makes progress - at least one thread completes an operation in a finite number of steps. Individual threads may still starve under adversarial scheduling, but the data structure can never be globally stuck. This is the most common guarantee in practice.

Wait-free: every individual thread completes its operation in a bounded number of steps, regardless of what other threads do. This is the strongest guarantee and also the hardest to achieve. Wait-free queues exist but are considerably more complex to implement.

Compare-And-Swap: The Foundation

The primitive that makes lock-free programming possible is Compare-And-Swap (CAS):

bool CAS(int *addr, int expected, int new_val) {
    if (*addr == expected) {
        *addr = new_val;
        return true;
    }
    return false;
}

This executes atomically on the hardware - no other thread can observe a partial state. On x86, this is the LOCK CMPXCHG instruction. In C++, it appears as std::atomic<T>::compare_exchange_weak and compare_exchange_strong.

The pattern is: read a value, compute a new value based on it, then CAS the new value in only if the old value hasn’t changed. If the CAS fails, retry. This is an optimistic concurrency approach: assume no conflict, detect and retry if there was one.

#include <atomic>

std::atomic<int> counter{0};

void increment() {
    int old_val = counter.load(std::memory_order_relaxed);
    while (!counter.compare_exchange_weak(
        old_val, old_val + 1,
        std::memory_order_release,
        std::memory_order_relaxed)) {
        // old_val is updated on failure - retry with current value
    }
}

compare_exchange_weak may spuriously fail (on architectures like ARM that use load-linked/store-conditional), which is why it’s used in a loop. compare_exchange_strong never spuriously fails but may be slower.

The ABA Problem

CAS introduces a subtle correctness hazard. Suppose thread T1 reads a pointer value A. Before T1 acts, thread T2 changes A to B and then back to A - perhaps by allocating a new node at the same address after freeing the old one. T1’s CAS succeeds because the value still looks like A, but the memory it read no longer refers to the same logical state.

Example scenario with a lock-free stack:

Stack top points to node A (value 1). A.next = B (value 2).
T1 reads top = A and is preempted.
T2 pops A, pops B, pushes A back (same pointer, recycled). Stack: A (value 1), A.next = nullptr.
T1 resumes, CAS(top, A, B) succeeds - but B has been freed. Now top points to freed memory.

Fix 1: Tagged pointers / version counter. On 64-bit systems with 48-bit addressing, the upper 16 bits of a pointer are unused. Store a monotonically increasing version counter there. The CAS now operates on a 128-bit pair (pointer + counter) using a double-wide CAS (DCAS or cmpxchg16b on x86). Even if the pointer value recycles, the counter will differ.

struct TaggedPointer {
    Node* ptr;
    uint64_t tag;
};
std::atomic<TaggedPointer> head;

Fix 2: Hazard pointers. Before accessing a node, a thread publishes a “hazard pointer” - a global declaration that this address must not be freed. Memory reclamation scans all hazard pointers before freeing. This adds overhead but is correct without requiring 128-bit CAS.

Fix 3: Epoch-based reclamation. Threads announce which epoch they’re in. Memory is only freed when no thread is in an epoch that could have a reference to that memory. Used by many production implementations including Folly’s hazptr.

The Treiber Stack

The simplest useful lock-free data structure is the Treiber stack (1986). It maintains a singly-linked list with atomic operations on the head pointer.

struct Node {
    int value;
    Node* next;
};

std::atomic<Node*> head{nullptr};

void push(int val) {
    Node* new_node = new Node{val, nullptr};
    Node* old_head = head.load(std::memory_order_relaxed);
    do {
        new_node->next = old_head;
    } while (!head.compare_exchange_weak(
        old_head, new_node,
        std::memory_order_release,
        std::memory_order_relaxed));
}

std::optional<int> pop() {
    Node* old_head = head.load(std::memory_order_acquire);
    while (old_head != nullptr) {
        Node* new_head = old_head->next;
        if (head.compare_exchange_weak(
            old_head, new_head,
            std::memory_order_acquire,
            std::memory_order_relaxed)) {
            int val = old_head->value;
            // DANGER: must use hazard pointers or SMR before delete
            delete old_head;
            return val;
        }
    }
    return std::nullopt;
}

Push: create a new node, set its next to the current head, CAS the head to the new node. If another thread pushed first, retry. Pop: read the head, read head->next, CAS head to head->next. The memory reclamation on pop is the hard part - in production code you would not delete immediately.

The Michael-Scott Queue

A lock-free FIFO queue requires two CAS operations per enqueue: one to link the new node, one to advance the tail pointer. The Michael-Scott queue (1996) handles this with a two-phase approach.

The key insight: the tail pointer is allowed to lag behind the actual end of the queue. Threads that observe this inconsistency help advance the tail before doing their own work - a technique called helping, which is what enables the lock-free progress guarantee.

Enqueue:

Read tail and tail->next.
If tail->next is non-null, advance tail (helping a previous enqueue) and retry.
CAS tail->next from null to the new node.
CAS tail from old tail to the new node (may be done by another thread).

This is the basis of java.util.concurrent.ConcurrentLinkedQueue and similar structures in production runtimes.

Memory Reclamation: The Hard Part

Lock-free data structures make safe memory reclamation genuinely difficult. The fundamental problem: thread T1 reads a pointer to a node and is about to dereference it. Thread T2 pops that node and frees it. T1 now dereferences freed memory - undefined behavior.

Read-Copy-Update (RCU) is the most performant solution for read-heavy workloads, used extensively in the Linux kernel. Readers execute in RCU read-side critical sections (essentially free - just a preemption disable on non-preemptible kernels). Writers make a copy of the data, modify the copy, and atomically swap the pointer. Memory is freed only after a grace period during which all readers that could have seen the old pointer have completed.

// Linux kernel RCU read side
rcu_read_lock();
struct foo *p = rcu_dereference(gp);  // load with proper barriers
if (p) do_something(p->field);
rcu_read_unlock();

// Writer side
struct foo *new_p = kmalloc(...);
*new_p = *old_p;
new_p->field = new_value;
rcu_assign_pointer(gp, new_p);        // atomic pointer swap
synchronize_rcu();                    // wait for all readers
kfree(old_p);

Hazard pointers provide a more general solution for arbitrary data structures at the cost of higher overhead per operation.

Epoch-based reclamation batches frees into epochs, providing good throughput with lower per-operation cost than hazard pointers, at the cost of potentially delaying reclamation.

Examples

Treiber stack in action - watching CAS retries under contention:

std::atomic<int> cas_retries{0};

void push_instrumented(int val) {
    Node* new_node = new Node{val, nullptr};
    Node* old_head = head.load();
    while (!head.compare_exchange_weak(old_head, new_node)) {
        new_node->next = old_head;
        cas_retries.fetch_add(1, std::memory_order_relaxed);
    }
}
// Under high contention, cas_retries climbs fast - lock-free ≠ contention-free

ABA demonstration with version counter:

struct TaggedPtr {
    Node* ptr;
    uintptr_t tag;  // incremented on every pop
};

std::atomic<TaggedPtr> head;

bool pop(int& out) {
    TaggedPtr old_head = head.load(std::memory_order_acquire);
    while (old_head.ptr) {
        TaggedPtr new_head{old_head.ptr->next, old_head.tag + 1};
        if (head.compare_exchange_weak(old_head, new_head)) {
            out = old_head.ptr->value;
            return true;
        }
    }
    return false;
}
// Now recycled pointer at same address has different tag - CAS correctly fails

RCU read-side critical section (userspace via liburcu):

#include <urcu.h>

rcu_register_thread();
rcu_read_lock();
struct config *cfg = rcu_dereference(global_config);
// safe to read cfg fields here
use_config(cfg);
rcu_read_unlock();

When Lock-Free Is Not the Answer

Lock-free data structures add significant complexity. Under low contention, a well-implemented spinlock or mutex is often faster because it avoids the retry loop overhead entirely. Under very high contention, lock-free structures can also degrade - threads waste CPU spinning on failed CAS operations rather than doing useful work.

The right approach is to measure. Profile your actual contention levels. perf lock can show mutex wait times. Only reach for lock-free structures when profiling identifies lock contention as a genuine bottleneck, and when you have the expertise to implement memory reclamation correctly.

Read Next: