Memory Models & Atomics - What Order Do Other Cores See Your Writes? // Megha Bose

Helpful context:

Two threads. Thread A writes x = 1, then writes ready = 1. Thread B loops until it sees ready == 1, then reads x. Thread B sees ready == 1. What is x?

On x86, the answer is 1. On ARM, the answer is: it depends. Thread B might see x == 0, even after observing ready == 1. The write to x may not have propagated to Thread B’s core yet when ready did. The two stores happened in program order in Thread A’s code, but the memory system is allowed to deliver them to other cores in a different order.

Welcome to the memory model problem. It is real, it happens on shipping hardware, and it has crashed satellites (we will get to that). The C11/C++11 memory model - and the std::atomic API it standardized - exists to give you the vocabulary to reason about and fix exactly this class of bug.

Why Reordering Happens: The Hardware Story

A modern CPU core does not execute instructions in program order. It executes them in whatever order keeps its execution units busy, subject to the constraint that the result appears correct to that core. This is the key: single-thread correctness is maintained; cross-thread visibility is not.

Two hardware mechanisms cause the writes-in-different-order problem:

Store buffers: when a core writes to memory, the write goes first into a small store buffer - a queue of pending writes from that core. The core does not wait for the write to propagate to the cache hierarchy; it continues executing. Another core that reads the same address may not yet see the write sitting in the first core’s store buffer. This is the primary cause of relaxed memory ordering on ARM.

Out-of-order execution: if two independent instructions are ready to execute (their inputs are available), the CPU may execute them in either order. A store followed by a load to a different address may execute as load-then-store if the load’s result is needed sooner. The CPU tracks dependencies and ensures single-thread correctness, but there is no cross-thread tracking.

The compiler adds its own reorderings: two independent assignments may be swapped during optimization; a loop variable may be cached in a register rather than re-read from memory on each iteration; a store the compiler proves has no visible effect may be eliminated entirely.

For a single thread, all of these reorderings are invisible - correctness is maintained. For another thread watching shared memory, it may observe the reordered sequence.

The False Promise of Sequential Consistency

Sequential consistency (SC) is the intuitive model: all operations from all threads appear to execute in some global sequential order consistent with each thread’s own program order. Under SC, if Thread A stores to x then stores to ready, any thread that sees ready == 1 is guaranteed to also see x == 1. The store to x “happened” globally before the store to ready.

SC is easy to reason about. It is also expensive to enforce. Preventing a store from becoming visible to other cores until the prior store has propagated globally requires either very conservative hardware design or explicit synchronization after every store. Early symmetric multiprocessing (SMP) systems in the 1980s and early 1990s often provided SC by design - they were slow enough that the cost was manageable. Modern hardware is not.

x86 provides something close to SC: Total Store Order (TSO). Loads are never reordered with loads; stores are never reordered with stores; loads are never reordered before prior stores to the same address. Only one reordering is allowed: a load may appear to happen before a prior store to a different address (because the load gets its result from the L1 cache while the store sits in the store buffer). This is why the opening example “works” on x86 without explicit synchronization - TSO is close enough to SC for most producer-consumer patterns.

ARM provides a significantly weaker model - closer to the theoretical “no guarantees unless you ask for them” end of the spectrum. The two-thread example fails on ARM without explicit memory barriers.

The C11/C++11 Memory Model

Before C11/C++11, C and C++ had no memory model at all. The standard defined program behavior for single-threaded programs. For multithreaded programs, you were on your own - using volatile (wrong), relying on platform-specific behavior (non-portable), or using pthread mutexes (correct but not expressive enough for lock-free code).

C11 and C++11 introduced two things simultaneously: std::atomic<T>, a type for atomically accessed variables; and a formal memory model specifying exactly what ordering guarantees each atomic operation provides.

The core concept is the happens-before relation. If operation A happens-before operation B, then B is guaranteed to observe all memory effects of A - every store A performed is visible to B. Happens-before is established by:

Sequenced-before: within one thread, if A is sequenced before B in program order, A happens-before B.
Synchronizes-with: a release store to an atomic A synchronizes-with an acquire load from the same atomic A that observes the stored value. The synchronizes-with edge, combined with sequenced-before on each side, establishes happens-before across threads.
Mutex lock/unlock: a mutex unlock happens-before the next lock of the same mutex.

If neither A happens-before B nor B happens-before A, and both access the same memory location with at least one write, this is a data race - undefined behavior in C++. Not “might produce wrong results.” Undefined behavior: the optimizer is allowed to produce any output it likes.

The Ordering Levels

std::atomic<T> operations take an explicit memory ordering argument. From weakest to strongest:

memory_order_relaxed: the operation is atomic (no torn reads or writes - you either see the old value or the new value, never a partial update). But it imposes no ordering relative to any other memory operations. Two threads incrementing a relaxed atomic counter will both succeed atomically, but neither will observe the other’s prior stores to other locations as a result of the increment.

Use relaxed for: counters where only the final value matters, reference counts (where you only need the final count to be correct, not ordering relative to other operations).

memory_order_acquire (loads) / memory_order_release (stores): the acquire-release pair is the minimal ordering needed for producer-consumer synchronization. A release store says “all my prior stores are complete before this store is visible.” An acquire load says “all my subsequent loads and stores happen after this load.” When a thread observes a release-stored value via an acquire load, the full synchronizes-with relationship is established: everything the writer stored before the release is visible to the reader after the acquire.

This is what mutexes are built on. A mutex unlock is a release store to the lock variable. A mutex lock is an acquire load (specifically, a CAS loop with acquire success ordering). The acquire on lock acquisition ensures the critical section’s operations see everything the previous holder stored before releasing.

memory_order_acq_rel: used for read-modify-write operations (like fetch-and-add or compare-and-swap) that serve as both an acquire and a release simultaneously.

memory_order_seq_cst: sequential consistency. All seq_cst operations appear in a single total order that all threads agree on. This is the default for std::atomic operations without an explicit ordering argument. It is always safe and always correct. It is also the most expensive - on ARM and other weakly-ordered architectures, seq_cst stores and loads require full memory fence instructions.

The discomfort: knowing which ordering to choose requires carefully reasoning about what happens-before relationship you actually need. seq_cst is safe when in doubt. relaxed is only safe when you need atomicity without any ordering. acquire/release covers most real-world producer-consumer patterns.

Acquire-Release: The Producer-Consumer Pattern

The canonical pattern for lock-free producer-consumer synchronization:

#include <atomic>
#include <cassert>

std::atomic<bool> ready{false};
int data = 0;

// Thread A (producer)
void produce() {
    data = 42;                               // plain store
    ready.store(true, std::memory_order_release);  // release store
}

// Thread B (consumer)
void consume() {
    while (!ready.load(std::memory_order_acquire));  // acquire load
    assert(data == 42);  // guaranteed to pass
}

Thread B’s acquire load sees ready == true, which means it observed the value Thread A stored with release. The synchronizes-with relationship is established. Thread A’s store to data - which was sequenced before the release store - happens-before Thread B’s read of data - which is sequenced after the acquire load. The assertion cannot fail.

Without the release/acquire pair, the assertion can fail. On ARM without barriers, the store to data might not be visible when Thread B reads it even though ready already flipped. On x86, it would work by accident (TSO is strong enough for this pattern) - but that accident is not portable, and relying on it creates bugs that appear only on ARM servers.

Memory Barriers

Memory barriers (fences) are explicit instructions that prevent reordering of memory operations across them.

std::atomic_thread_fence(std::memory_order_release);  // C++ fence
std::atomic_thread_fence(std::memory_order_acquire);

Fences are more powerful than acquire/release on individual atomic operations: a release fence ensures all prior stores (to any memory, not just atomic variables) are ordered before all subsequent stores. But they are also more expensive and less composable. The atomic operation acquire/release pattern is usually preferable.

On x86, acquire loads and release stores compile to plain loads and stores with no fence instruction - TSO provides these semantics by default. On ARM, they compile to ldar (load-acquire) and stlr (store-release) instructions, which are lightweight hardware barriers. seq_cst operations require dmb ish (a full data memory barrier) on ARM.

The `volatile` Keyword: What It Is and Is Not

volatile tells the compiler: do not cache this value in a register; re-read it from the memory location every time it is accessed. Historically used for memory-mapped hardware registers where each read is an observable side effect.

volatile is not sufficient for inter-thread synchronization. It prevents compiler reordering but does not prevent hardware reordering. It does not establish any happens-before relationship. A volatile flag used as a synchronization primitive between threads is a data race and undefined behavior. The GCC documentation is explicit: “Do not use volatile for threading.”

Use std::atomic<T> with memory_order_relaxed if you need atomicity without ordering. Use std::atomic<T> with acquire/release if you need ordering. Use volatile only for memory-mapped hardware registers or signal handlers where you genuinely want each access to hit memory.

The Double-Checked Locking Pattern

The classic broken pattern for lazy initialization:

Singleton *instance = nullptr;
std::mutex mu;

Singleton *get() {
    if (instance == nullptr) {      // check 1 -- no ordering
        std::lock_guard<std::mutex> lock(mu);
        if (instance == nullptr) {  // check 2
            instance = new Singleton();  // create
        }
    }
    return instance;
}

The bug: new Singleton() does two things: (1) allocate memory, (2) run the constructor (which stores into the object’s fields). On ARM, the store to instance (setting it to non-null) may become visible to other threads before the constructor’s stores to the object’s fields. A thread passing check 1 may see a non-null instance pointing to an uninitialized Singleton.

The fix uses an atomic with acquire/release:

std::atomic<Singleton*> instance{nullptr};

Singleton *get() {
    Singleton *p = instance.load(std::memory_order_acquire);
    if (!p) {
        std::lock_guard<std::mutex> lock(mu);
        p = instance.load(std::memory_order_relaxed);
        if (!p) {
            p = new Singleton();
            instance.store(p, std::memory_order_release);
        }
    }
    return p;
}

The release store ensures the constructor’s writes are ordered before the pointer becomes visible. The acquire load ensures the reader sees all those writes after observing the non-null pointer.

Python’s GIL as a Memory Model

CPython’s Global Interpreter Lock (GIL) is often described as a mechanism to prevent data races in the interpreter’s internal state. But it also provides an implicit memory model: because any Python thread must hold the GIL to execute Python bytecode, and because the GIL is a mutex, every GIL release is a release store and every GIL acquisition is an acquire load.

The consequence: Python code protected by the GIL effectively runs under sequential consistency - you get seq_cst semantics “for free” because the GIL is always present. This is why Python programmers rarely think about memory ordering. The cost is that Python threads cannot execute Python bytecode in parallel on multiple cores.

Python’s threading.Lock() explicitly provides release/acquire semantics through the underlying pthread mutex. Code that uses Lock correctly is safe even if the GIL is eventually removed (as the no-GIL Python experiment explores).

Rust’s Approach: Same Model, Enforced by the Type System

Rust uses the same C++11 memory model - std::sync::atomic in Rust corresponds directly to std::atomic in C++, with the same ordering constants (Ordering::Relaxed, Ordering::Acquire, Ordering::Release, Ordering::SeqCst).

The difference: Rust’s borrow checker enforces that non-atomic shared mutable access is a compile error. You cannot have a data race that gets to the hardware - you are forced to use the correct type (Mutex<T>, RwLock<T>, or Atomic<T>) to share mutable state. The memory model still applies (you can still write Relaxed atomics incorrectly), but an entire class of errors - the ones that only fail on ARM at 3 AM - is caught at compile time.

The Ongoing Controversy

The C++ memory model has critics. The “data race = undefined behavior” rule is a correctness guarantee but also a licensing mechanism for aggressive optimization. A benign data race (two threads reading and writing a byte flag, where the worst case is an extra re-read) is technically undefined behavior, which means the compiler can generate code that does anything at all - including deleting the race entirely in a way that changes program semantics.

LLVM’s implementation of the C++ memory model has historically been “weak” in ways that occasionally generated correct per-spec code that nonetheless violated user expectations. The LLVM community has been working on strengthening guarantees around seq_cst operations in particular.

There is also ongoing debate about whether the acquire-release model is the right level of abstraction. The Java Memory Model (JMM) took a different approach - defining a smaller set of formally specified behaviors with fewer edge cases - and many researchers have argued it is easier to reason about in practice. The C++ model’s power (fine-grained control over ordering) comes at the cost of complexity that even expert programmers get wrong.

Summary

Ordering	Provides	Use For
`relaxed`	Atomicity only	Counters, reference counts where final value matters
`acquire` (load)	Sees all stores before the paired `release`	Lock acquisition, consuming a producer’s data
`release` (store)	All prior stores visible after paired `acquire`	Lock release, publishing a producer’s data
`acq_rel`	Both acquire and release on read-modify-write	CAS operations that synchronize bidirectionally
`seq_cst`	Global total order across all seq_cst operations	Default safe choice; highest cost on weak hardware

Read Next:

Lock-Free Data Structures - Concurrency Without the Wait