Lock-Free Data Structures - Concurrency Without the Wait // Megha Bose

Helpful context:

Memory Models & Atomics - What Order Do Other Cores See Your Writes?

A mutex protects your data structure. It works. Under low contention - a handful of threads occasionally accessing shared state - a mutex is fast, simple, and correct. Then you scale up. Forty threads hammering the same queue. The lock becomes a serialization point: only one thread makes progress at a time while the others sleep in the kernel’s wait queue, burning CPU cycles in context switches, draining warm caches, waiting. The mutex that felt free at low contention now dominates your latency profile.

Lock-free data structures sidestep the serialization point. Instead of taking a lock, threads use hardware atomic operations to make progress directly. When two threads collide - both trying to update the same state simultaneously - one succeeds and the other retries, but neither waits in the kernel. The system always makes forward progress.

The catch: lock-free data structures are genuinely hard to get right. The correctness bugs are subtle, depend on specific memory orderings, and often manifest only under high contention on multi-socket hardware. The performance wins are real but conditional - under low contention, a mutex is often faster. And safe memory reclamation in a lock-free structure is a research problem that most implementations get wrong.

Do not reach for lock-free code because it sounds cool. Reach for it after the profiler has identified lock contention as a bottleneck on a system that cannot be restructured to reduce sharing.

Progress Guarantees: The Hierarchy

“Lock-free” is often used loosely. There is a precise hierarchy:

Blocking (mutex-based): if a thread holding the lock is paused - preempted, killed, or sleeping - all other threads attempting to acquire the lock are stuck indefinitely. The system has no forward progress guarantee when a thread stops making progress.

Obstruction-free: if a thread runs in isolation (no competing threads), it completes in finite steps. Under contention, it may be repeatedly blocked by others. Weakest non-blocking guarantee; rarely sufficient in practice.

Lock-free: the system as a whole always makes progress - at least one thread completes an operation in finite steps. Individual threads may retry indefinitely under adversarial scheduling, but the system never gets globally stuck. This is the standard goal for production lock-free code.

Wait-free: every individual thread completes its operation in a bounded number of steps, regardless of what other threads do. The strongest guarantee and the hardest to achieve. Wait-free queues exist in the literature but are considerably more complex than lock-free ones and rarely used in practice.

Most “lock-free” data structures in production libraries (Java’s ConcurrentLinkedQueue, C’s Linux kernel lists, LMAX Disruptor) are lock-free, not wait-free.

Compare-and-Swap: The Foundation

The hardware primitive that makes lock-free programming possible is Compare-and-Swap (CAS):

// Atomically:
bool CAS(T *addr, T expected, T new_val) {
    if (*addr == expected) {
        *addr = new_val;
        return true;
    }
    return false;
}

The entire operation - compare the current value, conditionally update it - executes as an indivisible unit. No other thread can observe a partial state. On x86, this is the LOCK CMPXCHG instruction. In C++:

std::atomic<int> value{0};
int expected = 0;
bool success = value.compare_exchange_strong(expected, 1);
// expected is updated to the current value on failure

compare_exchange_weak may spuriously fail - return false even when the value matched - on architectures using Load-Linked/Store-Conditional (LL/SC) instead of a true CAS (most ARM implementations). It must be used in a loop. compare_exchange_strong never spuriously fails but may be slower on those architectures. The naming is counterintuitive: use weak in retry loops, strong for a single non-looping check.

The lock-free pattern: read the current value, compute a new value, CAS the new value in only if the old value has not changed. If the CAS fails, someone else modified the value first - retry from the new current value.

The Treiber Stack: Lock-Free in 20 Lines

The simplest useful lock-free structure is the Treiber stack (1986). A singly-linked list with an atomic head pointer.

struct Node {
    int value;
    Node* next;
};

std::atomic<Node*> head{nullptr};

void push(int val) {
    Node* node = new Node{val, nullptr};
    Node* old_head = head.load(std::memory_order_relaxed);
    do {
        node->next = old_head;
    } while (!head.compare_exchange_weak(
        old_head, node,
        std::memory_order_release,
        std::memory_order_relaxed));
}

std::optional<int> pop() {
    Node* old_head = head.load(std::memory_order_acquire);
    while (old_head) {
        Node* new_head = old_head->next;
        if (head.compare_exchange_weak(
                old_head, new_head,
                std::memory_order_acquire,
                std::memory_order_relaxed)) {
            int val = old_head->value;
            // Cannot safely delete old_head here - another thread may be reading it
            return val;
        }
    }
    return std::nullopt;
}

Push: create a node, set its next to the current head, CAS the head to the new node. If another thread pushed simultaneously, the CAS fails (head changed), and we retry with the new head. Pop: read the head, read head->next, CAS head to head->next. The acquire on the pop’s load ensures the value read from the popped node is visible.

The comment on delete old_head is where the real problem lives.

The ABA Problem

CAS checks that a value is still what you read. It cannot check whether the meaning of that value has changed.

Scenario with the Treiber stack:

Stack: A (value=1) → B (value=2) → null. Thread T1 reads head = A. T1 is preempted.
Thread T2 pops A. Pops B. Frees B. Pushes a new node that happens to be allocated at the same address as A (memory allocators reuse addresses). Stack: A (new, value=99) → null. But A the pointer is the same.
T1 resumes. CAS(head, A, B) succeeds - head is still A (same address). T1 sets head to B - which has been freed.

The structure is now corrupt. T1’s CAS cannot detect that A was replaced: pointer identity is the same, but the node’s semantic identity changed. This is ABA.

ABA is not hypothetical. Any lock-free structure that recycles memory (which is all of them in practice, since allocators reuse freed addresses) is vulnerable.

Solution 1: Tagged pointers (versioned pointers)

On 64-bit systems with 48-bit user-space addressing, the upper 16 bits of a pointer are unused. Pack a monotonically increasing version counter into the upper bits. Now the CAS operates on a 64-bit word that contains both the pointer and the version:

struct TaggedPtr {
    uintptr_t data;  // lower 48 bits: pointer, upper 16 bits: version

    Node* ptr() const { return reinterpret_cast<Node*>(data & 0x0000FFFFFFFFFFFF); }
    uint16_t tag() const { return data >> 48; }
    static TaggedPtr make(Node* p, uint16_t t) {
        return {(reinterpret_cast<uintptr_t>(p) & 0x0000FFFFFFFFFFFF) | (uint64_t(t) << 48)};
    }
};

std::atomic<TaggedPtr> head;

std::optional<int> pop() {
    TaggedPtr old = head.load(std::memory_order_acquire);
    while (old.ptr()) {
        TaggedPtr next = TaggedPtr::make(old.ptr()->next, old.tag() + 1);
        if (head.compare_exchange_weak(old, next, std::memory_order_acquire,
                                                   std::memory_order_relaxed)) {
            int val = old.ptr()->value;
            // Still cannot safely delete - use hazard pointers or epoch reclamation
            return val;
        }
    }
    return std::nullopt;
}

Even if A is recycled at the same address, the version counter will have changed, and the CAS will correctly fail. The 16-bit counter wraps at 65536, but in practice this is enough.

Solution 2: Hazard pointers

Before accessing a node, a thread publishes that pointer in a global hazard pointer slot - a globally visible declaration that “I am currently reading this node; do not free it.”

// Each thread has a hazard pointer slot
thread_local Node* my_hazard_ptr = nullptr;

std::optional<int> pop() {
    while (true) {
        Node* old_head = head.load(std::memory_order_acquire);
        if (!old_head) return std::nullopt;

        my_hazard_ptr = old_head;           // announce we are reading this node
        std::atomic_thread_fence(std::memory_order_seq_cst);

        // Re-check: did the head change between the load and the announcement?
        if (head.load(std::memory_order_relaxed) != old_head) continue;

        Node* new_head = old_head->next;
        if (head.compare_exchange_weak(old_head, new_head)) {
            my_hazard_ptr = nullptr;
            int val = old_head->value;
            // safe_retire(old_head): add to a list; free only after scanning all hazard slots
            safe_retire(old_head);
            return val;
        }
    }
}

Memory reclamation scans all threads' hazard pointer slots before freeing any retired node. If any thread has a hazard pointer to the node, it is not freed. The cost: per-operation overhead of publishing and revoking hazard pointers, plus periodic reclamation scans. Folly’s hazptr library implements this with production-grade performance.

Solution 3: Epoch-based reclamation

Group time into epochs. Threads announce which epoch they are in when they enter a data structure operation. Nodes are retired into the current epoch’s “to-free” list. A node can only be freed when all threads that could have a reference to it have moved to a later epoch - i.e., when the global epoch has advanced past the epoch in which the node was retired.

Epoch reclamation has lower per-operation cost than hazard pointers and is used in many production implementations (RCU in the Linux kernel is a variant of this idea; crossbeam-epoch in Rust is a clean userspace implementation).

The Michael-Scott Queue: Lock-Free FIFO

A lock-free FIFO queue is more complex than a stack because enqueueing requires updating the tail and dequeueing requires updating the head - two separate atomic operations on two separate pointers, with no way to do both atomically.

The Michael-Scott queue (1996) handles this with a two-pointer structure (dummy head node) and the “helping” technique: threads that observe an inconsistency assist other threads in completing their in-flight operations before proceeding.

struct Node {
    int value;
    std::atomic<Node*> next{nullptr};
};

std::atomic<Node*> head;  // points to a dummy sentinel
std::atomic<Node*> tail;  // may lag behind true tail

void enqueue(int val) {
    Node* node = new Node{val};
    while (true) {
        Node* t = tail.load(std::memory_order_acquire);
        Node* next = t->next.load(std::memory_order_acquire);
        if (t == tail.load(std::memory_order_relaxed)) {
            if (!next) {
                // Tail is truly the last node - try to link new node
                if (t->next.compare_exchange_weak(next, node,
                        std::memory_order_release, std::memory_order_relaxed)) {
                    // Try to advance tail; if this fails, another thread will do it
                    tail.compare_exchange_weak(t, node,
                        std::memory_order_release, std::memory_order_relaxed);
                    return;
                }
            } else {
                // Tail is lagging - help advance it
                tail.compare_exchange_weak(t, next,
                    std::memory_order_release, std::memory_order_relaxed);
            }
        }
    }
}

The tail pointer is allowed to lag. If a thread observes tail->next != null, it knows an enqueue is in progress, and it advances tail before doing its own work. This helping strategy is what makes the queue lock-free: some thread always completes the in-progress enqueue in finite steps, so the system always makes progress.

Java’s ConcurrentLinkedQueue is a direct implementation of Michael-Scott. It is what the Java standard library uses whenever you need a concurrent non-blocking FIFO.

Memory Reclamation: The Hard Part

The Treiber stack and Michael-Scott queue examples elide safe memory reclamation - the pop operations noted “cannot safely delete.” This is not an implementation detail; it is the central correctness problem in lock-free data structures.

The fundamental issue: Thread T1 reads a pointer to node N. Thread T2 dequeues N and calls free(N). Thread T1, which still holds the pointer, dereferences it - use-after-free, undefined behavior, and in practice: a crash or memory corruption.

With locks, this cannot happen: T1 holds the lock while dereferencing N, so T2 cannot dequeue and free N until T1 releases the lock. Lock-free structures have no such mutual exclusion, so they need an alternative mechanism to ensure N is not freed while T1 might be reading it.

Read-Copy-Update (RCU) is the most performant solution for read-heavy workloads. Readers declare “I am in a read-side critical section” (very cheap on non-preemptible kernels - effectively a preemption disable). Writers make a copy of the data, modify the copy, atomically swap the pointer, then wait for a grace period: a point in time when all threads that were in a read-side critical section at the time of the swap have completed. After the grace period, the old data can be safely freed - no reader can hold a reference to it.

RCU is used pervasively in the Linux kernel for data structures that are read millions of times per second (routing tables, process lists, module lists) and written rarely. The asymmetry is the point: reads are nearly free; writes pay the synchronization cost.

For general-purpose lock-free code where both reads and writes are common, hazard pointers or epoch-based reclamation are the standard choices.

High-Frequency Trading: Lock-Free Ring Buffers

The LMAX Disruptor is a lock-free ring buffer originally developed for low-latency financial trading. The core insight: most concurrent queues need coordination between producers and consumers on every operation. The Disruptor decouples them.

The Disruptor preallocates a fixed-size ring buffer, sized to a power of 2. Producers claim a sequence number with a CAS, write to the slot at that index, then publish the sequence. Consumers track which sequences have been published and read directly - no CAS on the consumer side at all.

On a single-producer / single-consumer configuration, the Disruptor eliminates all atomic operations from the producer path (a single-producer does not need to CAS its sequence claim) and reduces the consumer path to a single-reader load. At this point the queue is effectively a memory barrier pattern rather than a CAS-based structure.

Benchmarks show the Disruptor achieving 25-60 million operations per second per core, versus 1-5 million for ConcurrentLinkedQueue. The cost: fixed-size preallocated buffer; cannot grow; best for predictable, bounded throughput patterns. Folly’s MPMCQueue and ProducerConsumerQueue apply similar ideas for C++ systems programming.

When Lock-Free Is Not the Answer

Lock-free data structures are not automatically faster than mutex-based ones. Under low contention - few threads, infrequent access - a well-implemented std::mutex is faster. A mutex acquisition on the uncontended fast path is a single atomic CAS (the futex fast path), which is exactly what a lock-free structure’s retry loop also does. The lock-free structure additionally does register and memory management per operation that the mutex-protected structure does not.

Lock-free structures shine under high contention on multi-socket systems: when many threads are constantly competing for the same structure, the ability to make progress without kernel intervention (no context switch, no cache thrash from lock handoff, no priority inversion) provides substantial wins.

The right approach is to measure. perf lock reports mutex wait times and contention rates. If lock contention is not in your top bottlenecks, a lock-free replacement will not help. If it is, profile the specific structure, benchmark an alternative, and only then consider the implementation complexity cost of going lock-free.

The Future: Hardware Transactional Memory

Hardware Transactional Memory (HTM) - Intel TSX, IBM POWER’s TM - offers a different approach: speculatively execute a block of code as a “transaction,” automatically detecting conflicts with other threads' accesses, and retrying if a conflict occurs. The programmer writes sequential code inside a transaction; the hardware provides atomicity.

#ifdef RTM_SUPPORT
if (_xbegin() == _XBEGIN_STARTED) {
    // Transactional path: modify the data structure normally
    stack[top++] = value;
    _xend();
} else {
    // Fallback path: HTM aborted (conflict, capacity, etc.)
    std::lock_guard<std::mutex> lock(fallback_mutex);
    stack[top++] = value;
}
#endif

Intel TSX had a troubled history - hardware bugs led to it being disabled via microcode updates on most Skylake CPUs - but IBM POWER TM has been more stable. The long-term trajectory of HTM is unclear; the industry has not yet converged on it as a standard tool.

Software Transactional Memory (STM) - Haskell’s STM monad is the most successful implementation - provides transactional semantics in software, with conflict detection in the runtime. Haskell STM composes cleanly: you can combine transactions from separate data structures without worrying about lock ordering. The cost is higher than HTM and the write overhead is substantial. STM has found a comfortable niche in Haskell but has not crossed over to mainstream systems programming languages.

Summary

Concept	What It Is	When to Use It
CAS (compare-and-swap)	Hardware atomic: compare and conditionally update	Foundation of all lock-free structures
Lock-free	At least one thread always makes progress	High-contention shared structures on multi-core
Wait-free	Every thread completes in bounded steps	Safety-critical systems; very rare in practice
ABA problem	Pointer recycled at same address after free	Whenever memory is reclaimed in a lock-free structure
Tagged pointers	Version counter in unused pointer bits	Cheap ABA prevention with 64-bit pointers
Hazard pointers	Publish pointer before dereferencing; defer free	General-purpose safe reclamation
Epoch-based reclamation	Free when all threads in newer epochs	Better throughput than hazard pointers for bursts
RCU	Copy-modify-swap with grace period for readers	Read-heavy structures; Linux kernel staple
Treiber stack	CAS on head pointer of singly-linked list	Simplest lock-free structure; learning tool
Michael-Scott queue	Lock-free FIFO with helping	Java ConcurrentLinkedQueue; production queues

Read Next:

CPU Affinity & NUMA - When Cores Care Where Memory Lives