Helpful context:


You add a second CPU socket to your server. The benchmark you care about - a database query planner - shows the expected improvement: roughly 1.8x throughput on CPU-bound work. But the p99 latency, which was 2ms before, is now 4ms. You did not expect that. You added more compute. Performance got worse for latency.

What happened: the operating system, doing its best to utilize all cores, started migrating threads between sockets. A thread that built up warm data in socket 0’s L3 cache gets moved to socket 1’s core. Its working set is now on the wrong side of the QuickPath Interconnect. Every memory access that used to take 40 cycles (L3 hit) now takes 200 cycles (remote NUMA node). The memory bus is the bottleneck, and the OS is the one creating the problem.

This is NUMA - Non-Uniform Memory Access - and it is the dominant performance architecture in server hardware. Understanding it does not require special hardware. Any dual-socket server, any large AWS instance, any modern AMD EPYC or Intel Xeon deployment has NUMA topology. And most software ignores it entirely, leaving performance on the table.

The History: From UMA to NUMA

The first symmetric multiprocessing (SMP) systems used a single shared memory bus: all processors connected to the same bus, accessing the same memory pool. This is Uniform Memory Access (UMA) - every processor sees the same latency to every memory address. It is easy to reason about. It scales poorly.

The shared bus is the bottleneck. With 4 processors, each gets one quarter of the bus bandwidth. With 8, one eighth. By the mid-1990s, large SMP systems hit a wall: adding more processors made the bus more congested, not less.

The solution was to partition memory. Give each processor package its own local memory controller and its own pool of DRAM. Connect the packages with a high-speed point-to-point link rather than a shared bus. Each processor can access its local memory at full speed. Accessing another processor’s memory requires crossing the link - slower, but the link is dedicated and does not become a shared bottleneck.

AMD introduced this architecture commercially with HyperTransport in the Opteron processor (2003), which broke the x86 server market open. Intel followed with QuickPath Interconnect (QPI) in Nehalem (2008), replaced by UltraPath Interconnect (UPI) in Skylake server (2017). AMD’s current implementation is Infinity Fabric, shared between the CPU interconnect and the GPU interconnect in ROCm platforms. The marketing names change; the architecture is the same.

NUMA Topology: What Actually Exists

A dual-socket server has two NUMA nodes. Each node contains:

  • The processor package: all cores, their private L1/L2 caches, and the shared L3 cache
  • A memory controller attached to DIMMs on that node

Access latency on a typical dual-socket Intel server:

  • L1 cache: 4 cycles (~1.3ns)
  • L2 cache: 12 cycles (~4ns)
  • L3 cache: 40 cycles (~13ns)
  • Local DRAM: ~65ns
  • Remote DRAM (cross-socket): ~130ns

The 2x latency penalty for remote memory is a floor. Under bandwidth pressure - many threads accessing remote memory simultaneously - the interconnect becomes a bottleneck and effective latency climbs further.

Modern AMD EPYC processors complicate this further. EPYC uses a chiplet design: multiple CPU dies (CCDs) connected by Infinity Fabric within a single package. A 64-core EPYC has 8 chiplets of 8 cores each, with fabric latency between chiplets within the same socket that is measurably higher than L3 access within a chiplet. This creates sub-NUMA clustering: the OS may see a single NUMA node, but there is internal latency variation within that node. AMD and some Linux distributions expose this as multiple NUMA nodes per socket.

Reading Your Topology

Before tuning anything, understand the system:

lscpu
# Architecture: x86_64
# CPU(s): 64
# Thread(s) per core: 2         (hyperthreading)
# Core(s) per socket: 16
# Socket(s): 2
# NUMA node(s): 2
# NUMA node0 CPU(s): 0-15,32-47
# NUMA node1 CPU(s): 16-31,48-63

numactl --hardware
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 ...
# node 0 size: 128768 MB
# node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 ...
# node 1 size: 128885 MB
# node distances:
# node   0   1
#   0:  10  21     ← local=10, remote=21 (2.1x slower)
#   1:  21  10

The numastat command shows whether your system is currently suffering from NUMA imbalance:

numastat
#                            node0          node1
# numa_hit               412345678       18273654   ← most accesses on node0
# numa_miss               8273456       45678901    ← lots of misses on node1
# numa_foreign            8273456       45678901    ← allocations placed on wrong node

High numa_miss or numa_foreign counts indicate the OS’s default first-touch allocation policy is failing - threads are running on one socket and their memory is on another.

lstopo (from the hwloc package) renders the full topology including cache sharing relationships and PCI device attachment - critical for knowing which NIC is on which NUMA node, which matters enormously for network-intensive workloads.

CPU Affinity: Keeping Threads Where They Belong

CPU affinity restricts a thread or process to run on a specified set of CPU cores, preventing the scheduler from migrating it. Migration is the enemy: moving a thread invalidates its warm L1/L2 cache, and if the move crosses sockets, its working set is suddenly on remote DRAM.

From the shell:

# Launch a process pinned to CPU cores 0-7 (all on NUMA node 0)
taskset -c 0-7 ./database_process

# Pin an existing process (PID 12345) to core 4
taskset -cp 4 12345

# Run on all cores of NUMA node 0, allocate memory from node 0
numactl --cpunodebind=0 --membind=0 ./database_process

From C/C++:

#include <pthread.h>
#include <sched.h>

void pin_thread_to_core(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

From Python on Linux:

import os
# Pin current process to cores 0, 1, 2, 3
os.sched_setaffinity(0, {0, 1, 2, 3})

Pinning is not a substitute for understanding. Pin threads to the wrong cores - for example, across the hyperthread sibling boundary - and you can make things worse. Pin threads but not memory, and remote DRAM accesses continue.

NUMA-Aware Memory Allocation: The Other Half

Pinning a thread to socket 0’s cores is necessary but not sufficient. Linux uses a first-touch memory allocation policy by default: when a thread accesses a page for the first time, that page is allocated on the NUMA node the accessing thread is currently running on. If a thread on socket 0 allocates a large buffer but a thread on socket 1 first touches it (initializes it), the buffer ends up on socket 1’s memory - and socket 0’s threads pay the remote access penalty forever.

Explicit NUMA-aware allocation:

#include <numa.h>
#include <numaif.h>

// Allocate on the local node of the current thread
void *buf = numa_alloc_local(size);

// Allocate on a specific node
void *buf2 = numa_alloc_onnode(size, 0);

// Bind existing memory to a node (mmap + mbind)
void *mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
unsigned long nodemask = 1 << 0;  // node 0
mbind(mem, size, MPOL_BIND, &nodemask, 2, 0);

// Free
numa_free(buf, size);

numa_alloc_local() is especially useful in thread initialization code: if the thread is pinned to its NUMA node before allocating its working buffers, local allocation is automatic. The pattern: pin thread, then allocate.

Hyperthreading: The Shared Core Problem

Intel’s Hyperthreading (AMD: SMT) exposes two logical CPUs per physical core. The two logical cores share execution units, L1 and L2 caches, TLBs, and the memory controller interface - only the register files and some pipeline buffers are separate.

For I/O-bound or latency-bound threads that frequently stall waiting for memory or network, hyperthreading is a win: when one thread stalls on a cache miss, the other can run on the shared execution units. Utilization improves with no additional contention.

For compute-bound or memory-bandwidth-bound workloads, hyperthreading often hurts. Both threads compete for the same execution ports and the same cache. A matrix multiply that fills the L2 cache on one hyperthread leaves no room for the other; both thrash each other’s working sets. Effective throughput on compute-bound work drops 20-40% compared to using both hyperthreads as independent physical cores.

Identifying hyperthread siblings:

# Show which logical CPUs share a physical core
cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
# Output: 0,32  ← logical CPUs 0 and 32 share one physical core

In latency-critical systems, it is common to only use one logical CPU per physical core - leave one hyperthread idle, or assign it exclusively to OS interrupts and housekeeping. The idle sibling reduces interference with the hot-path thread.

False Sharing Across NUMA Nodes

False sharing - two threads writing different variables that happen to share a 64-byte cache line - is expensive within a socket (cache line bounces through the L3 coherence protocol). Across sockets, it is catastrophically expensive: each write requires the cache line to be invalidated on the remote socket and fetched back across the QPI/UPI interconnect. A false-sharing pair with one thread per socket can reduce throughput by 100x compared to a correctly padded version.

// WRONG: thread 0 and thread 1 write to adjacent counters on the same cache line
struct Counters {
    int64_t thread0_count;  // bytes 0-7
    int64_t thread1_count;  // bytes 8-15  ← same 64-byte line as thread0_count
};

// CORRECT: pad each counter to its own cache line
struct alignas(64) PaddedCounter {
    int64_t count;
    char _pad[56];
};
PaddedCounter counters[NUM_THREADS];

On a multi-socket system, also ensure each thread’s counter is allocated on NUMA-local memory, not just padded.

Databases and NUMA Awareness

Databases are the canonical NUMA-sensitive application.

PostgreSQL allows binding background workers (bgwriter, checkpointer, autovacuum) to specific CPU sets, keeping I/O-intensive processes on the socket nearest their data. When configured naively on a NUMA system, PostgreSQL suffers from remote memory accesses in its shared buffer pool - a problem that tuning numa_balancing and zone_reclaim_mode kernel parameters can partially mitigate.

Oracle Database has explicit NUMA awareness options that partition the buffer cache across NUMA nodes and pin threads to their memory regions. For large in-memory workloads on 4-socket systems, Oracle’s NUMA configuration can make the difference between linear scaling and a 2x wall.

Apache Kafka achieves zero-copy message delivery via sendfile(), which moves data from the page cache directly to the NIC’s DMA buffer without passing through user space. On NUMA systems, this is only zero-overhead if the page cache pages and the NIC are on the same NUMA node - otherwise the sendfile() path crosses the interconnect. Kafka deployments that care about throughput align NIC placement with Kafka’s data directory NUMA node.

Cloud NUMA: AWS and the Nitro Architecture

AWS instances have NUMA topology too, though it is less visible.

Memory-optimized instances (r-family) are explicitly NUMA-aware in their design. An r6i.32xlarge has 128 vCPUs and 1TB of RAM, structured with multiple NUMA nodes underneath. AWS exposes this via numactl --hardware inside the instance. Applications running on large instances that ignore NUMA topology see the same penalty as on bare metal.

The AWS Nitro architecture uses dedicated hardware for virtualization overhead - the Nitro card handles network and storage I/O, freeing host CPU cores for the guest. The Nitro card is attached to a specific PCIe bus, which is on a specific NUMA node. Network-intensive workloads benefit from pinning their threads to the NUMA node local to the Nitro card, reducing cross-fabric traffic.

For ML training on AWS GPU instances (p4d, p4de), the NVLink interconnect between GPUs is NUMA-aware: GPUs on the same NUMA node can communicate faster than cross-NUMA pairs. PyTorch distributed training on these instances benefits from NCCL’s awareness of the NVLink topology, which is exposed through the NUMA abstraction.

Kernel Isolation for Extreme Cases

For the most latency-sensitive workloads, NUMA awareness and CPU pinning are not enough. The OS itself introduces jitter: scheduler ticks, RCU callbacks, timer interrupts, workqueue processing.

The full isolation stack on Linux:

# In /etc/default/grub:
GRUB_CMDLINE_LINUX="isolcpus=4-7 nohz_full=4-7 rcu_nocbs=4-7"
  • isolcpus=4-7: removes CPUs 4-7 from the scheduler’s general pool. No task is ever scheduled there unless it explicitly pinned itself.
  • nohz_full=4-7: disables the periodic scheduling-clock interrupt (normally fires every 1ms or 4ms) on these CPUs. Eliminates a major source of regular latency spikes.
  • rcu_nocbs=4-7: offloads Read-Copy-Update callbacks to other CPUs, preventing RCU work from interrupting the isolated cores.

Combined with IRQ affinity (directing NIC interrupts to non-isolated CPUs), SCHED_FIFO priority for the hot-path thread, and mlockall() to prevent page faults, this configuration achieves single-digit microsecond scheduling jitter on commodity hardware.

This level of tuning is appropriate for: high-frequency trading systems, industrial real-time control, network appliances processing line-rate traffic. For everything else, NUMA awareness (topology understanding, affinity pinning, NUMA-local allocation) delivers most of the benefit without the operational complexity.

The Honest Assessment

NUMA optimization is micro-optimization in most contexts. A web application serving database-backed requests is almost never bottlenecked on NUMA topology. The right order is: fix algorithmic complexity first, then reduce lock contention, then address memory access patterns, then - if you are still bottlenecked and profiling shows remote NUMA accesses - add NUMA awareness.

The contexts where NUMA awareness is not optional: databases with multi-socket deployments and large buffer pools, real-time data processing on multi-socket servers, ML training on large instances where interconnect bandwidth matters, kernel components and storage drivers. If you are building those, NUMA is not a micro-optimization - it is a first-class architectural concern that must be addressed at design time.

Summary

Concept What It Does When It Matters
NUMA Memory latency varies by socket proximity Any multi-socket server workload
CPU affinity (taskset, sched_setaffinity) Prevents thread migration across sockets Latency-sensitive, cache-sensitive code
NUMA-local allocation (libnuma) Allocates memory near the allocating thread Prevents remote DRAM accesses
numactl Binds process to CPU node and memory node Simplest NUMA binding for existing programs
First-touch policy Pages go to the node that first touches them Must understand for correct NUMA-local init
Hyperthreading Shares execution units per physical core Hurt compute-bound; help I/O-bound
False sharing Cache line bounced between threads on different sockets Padding/alignment fixes 100x throughput regressions
isolcpus + nohz_full Removes OS jitter from dedicated cores HFT, real-time, line-rate networking

Read Next: