Memory Models & Atomics // Megha Bose

Prerequisite:

Concurrent programming has a hidden adversary: the memory system. CPUs reorder memory operations for performance. Compilers reorder them too. Without explicit synchronisation, two threads reading and writing shared memory may observe events in a completely different order than you wrote them. The memory model is the formal contract that specifies which reorderings are permitted and how to prevent them.

Why Reordering Happens

Modern CPUs do not execute instructions in program order. They execute them in whatever order keeps the execution units busy - as long as the result is correct for a single-threaded observer. A store to one address followed by a load from a different address may execute as load-then-store if there is no dependency between them. Store buffers allow writes to retire from the core before they are visible to other cores.

Compilers perform similar reorderings during optimisation: two independent assignments may be swapped, a loop may be reordered relative to a flag check, a value may be cached in a register rather than re-read from memory.

For a single thread, all these reorderings are invisible - the CPU’s and compiler’s correctness guarantees cover single-thread behaviour. But when another thread is watching shared memory, it may see the reordered sequence, not the program-order sequence.

Sequential Consistency

Sequential consistency (SC) is the intuitive model: the result of any execution is as if all operations of all threads were executed in some sequential order, consistent with each thread’s own program order. Under SC, if thread A writes x = 1 before thread B reads x, thread B must see x = 1.

SC is easy to reason about but expensive to implement. Enforcing it requires that every write be immediately visible to all cores before the next operation proceeds - effectively disabling store buffers and out-of-order execution. Modern hardware does not give you SC by default.

Memory Ordering in C++

C++11 introduced std::atomic<T> and an explicit memory ordering model. When you perform an atomic operation, you specify not just the operation but also the ordering constraints it imposes.

memory_order_relaxed - the operation is atomic (no torn reads or writes), but it imposes no ordering relative to other memory operations. Two relaxed atomic increments of a counter will both happen, but their order relative to other loads and stores is unspecified. Use this for counters where only the final value matters, not the sequence.

memory_order_release (for stores) and memory_order_acquire (for loads) create a synchronisation relationship. If thread A stores to a flag with release, and thread B loads that flag with acquire and sees the new value, then all stores thread A performed before the release are guaranteed to be visible to thread B after the acquire. This is the fundamental pattern for lock-free producer-consumer communication.

memory_order_seq_cst is the strongest ordering - it provides sequential consistency. Every seq_cst operation appears in a single total order that all threads agree on. This is the default when you write atomic.store(1) without specifying an ordering. It is also the most expensive.

The Happens-Before Relationship

Happens-before is the formal underpinning of the C++ memory model. If operation A happens-before operation B, then B is guaranteed to observe all memory effects of A.

Within a single thread, program order defines happens-before: if A is sequenced before B in the code, A happens-before B. Across threads, a release store to an atomic happens-before an acquire load on the same atomic that observes the stored value. A mutex unlock happens-before a subsequent lock of the same mutex.

If neither A happens-before B nor B happens-before A, and they access the same memory (with at least one write), this is a data race - undefined behaviour in C++.

Volatile in C/C++

volatile tells the compiler “do not cache this value in a register; always re-read it from memory.” It was historically used for hardware registers and signal handlers. It is insufficient for inter-thread synchronisation because it prevents compiler reordering but does not prevent CPU reordering, and it does not communicate ordering intent to the memory model. A volatile flag used as a thread-synchronisation mechanism is a data race and undefined behaviour. Use std::atomic instead.

Memory Barriers

A memory barrier (or fence) is an instruction that prevents memory operations on one or both sides from being reordered across it. std::atomic_thread_fence(memory_order_release) ensures that all stores before the fence are complete before any store after the fence. Hardware provides explicit fence instructions: mfence, sfence, lfence on x86; dmb, dsb on ARM. The acquire/release semantics of atomic operations typically compile to these instructions (or their implicit equivalents) on the target architecture.

x86 has a relatively strong hardware memory model: by default it provides acquire semantics on loads and release semantics on stores, so many acquire/release atomic operations compile to plain loads and stores with no extra fence instructions on x86. On ARM, the hardware model is weaker, so acquire loads and release stores require explicit barrier instructions.

CAS and Mutex Implementation

Compare-and-swap (CAS) is the atomic operation that underlies most lock-free algorithms:

// Pseudo-code:
bool CAS(int *ptr, int expected, int desired) {
    if (*ptr == expected) { *ptr = desired; return true; }
    return false;
}

In C++: atomic.compare_exchange_strong(expected, desired). CAS is the building block for implementing spinlocks, queues, and stacks without OS-level blocking.

A simple spinlock using CAS:

std::atomic<bool> lock{false};

void acquire() {
    bool expected = false;
    while (!lock.compare_exchange_weak(expected, true,
                                        memory_order_acquire,
                                        memory_order_relaxed))
        expected = false;
}

void release() {
    lock.store(false, memory_order_release);
}

The acquire on the CAS ensures that all loads and stores inside the critical section are ordered after the lock acquisition. The release on the store ensures they are ordered before the unlock.

The Java Memory Model and Python’s GIL

Java’s memory model (JMM), introduced in Java 5, makes similar guarantees: volatile fields in Java provide acquire/release semantics (unlike in C/C++), and synchronized blocks provide full sequential consistency for the code they protect.

CPython’s Global Interpreter Lock (GIL) takes a different approach: a single mutex that any thread must hold to execute Python bytecode. The GIL means CPython threads are never truly parallel for CPU-bound work, but it also means the CPython interpreter never has data races in its internal state. For I/O-bound work, threads release the GIL during blocking calls. Python’s multiprocessing module bypasses the GIL by using separate processes with separate memory spaces.

Examples

Double-checked locking without fences - broken. The classic broken version:

Singleton *instance = nullptr;

Singleton *get() {
    if (instance == nullptr) {         // check 1 (no ordering)
        lock();
        if (instance == nullptr) {     // check 2
            instance = new Singleton(); // create
        }
        unlock();
    }
    return instance;
}

The problem: the store to instance and the stores inside new Singleton() (to initialise the object’s fields) may be reordered. Another thread passing check 1 may see a non-null instance but uninitialised fields. Fix: use std::atomic<Singleton*> with release on the store and acquire on the load.

std::atomic counter. For a counter incremented by multiple threads, std::atomic<int> counter{0}; counter.fetch_add(1, memory_order_relaxed); is correct and efficient - no fences needed, just atomicity.

Acquire-release for producer-consumer. Thread A writes data to a buffer then does ready.store(true, memory_order_release). Thread B spins on while (!ready.load(memory_order_acquire)); then reads the buffer. The acquire-release pair guarantees thread B sees everything thread A wrote before the release.

Memory models are one of the most frequently misunderstood areas of systems programming. The key insight is that memory_order_seq_cst is not “correct” and everything else is “risky” - it is that each ordering level is correct for a specific class of use cases, and choosing the right one requires understanding the happens-before relationship you actually need.

Read Next:

Lock-Free Data Structures