CPU Affinity & NUMA // Megha Bose

Prerequisite:

The Topology Problem

Modern servers are not a flat pool of compute resources with uniform memory access. They are layered hierarchies of processors, caches, memory controllers, and interconnects - and treating them as uniform is one of the fastest ways to leave performance on the table.

Understanding CPU topology is not optional for systems that care about latency. The OS scheduler, left to its own devices, will happily migrate your thread across sockets, invalidate your warm cache, and route your memory accesses through a remote controller - all while claiming to be helpful.

Multi-Socket Systems and NUMA

A physical server with two CPU sockets contains two separate processor packages. Each package has its own cores, its own L1/L2/L3 caches, and its own memory controller attached to a subset of the system’s RAM. The two packages are connected by a high-speed interconnect - QPI/UPI (Intel QuickPath/UltraPath Interconnect) or Infinity Fabric (AMD).

This architecture is called NUMA: Non-Uniform Memory Access. Memory is not equidistant from all cores. A core on socket 0 accessing RAM attached to socket 0 sees approximately 100ns latency. The same core accessing RAM attached to socket 1 crosses the interconnect and sees approximately 200ns - twice the latency, plus bandwidth contention on the link.

Socket 0                    Socket 1
┌─────────────────┐         ┌─────────────────┐
│  Core 0  Core 1 │  QPI/   │  Core 8  Core 9 │
│  Core 2  Core 3 │◄──────► │  Core 10 Core 11│
│  L3 Cache       │  UPI    │  L3 Cache       │
│  Mem Controller │         │  Mem Controller │
└────────┬────────┘         └────────┬────────┘
         │                           │
      DIMM 0-3                    DIMM 4-7
   (NUMA Node 0)              (NUMA Node 1)

For a workload running entirely on socket 0, remote memory accesses are pure overhead with no benefit. The OS does not always prevent this.

Inspecting Topology

Before tuning anything, understand what you have.

lscpu gives a quick summary:

$ lscpu
Architecture:          x86_64
CPU(s):                32
Thread(s) per core:    2      ← hyperthreading enabled
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

numactl --hardware shows NUMA distances - the normalized latency between nodes:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 64345 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 64508 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

A distance of 10 is local; 21 is cross-socket (about 2.1x slower).

lstopo (from the hwloc package) renders the full topology graphically or as text, including which CPU IDs share L3 cache, which cores share a physical core under hyperthreading, and which PCI devices are attached to which NUMA node - critical for NIC placement in networking workloads.

CPU Affinity: Pinning Threads

CPU affinity restricts a thread or process to run only on a specified set of CPU cores. This prevents the scheduler from migrating the thread, which would invalidate warm caches and potentially move execution to a remote NUMA node.

taskset sets affinity for an existing process or at launch:

# Run a process on CPU 0 and 1 only
taskset -c 0,1 ./my_program

# Pin an existing process (by PID) to core 4
taskset -cp 4 12345

pthread_setaffinity_np does the same from within a C/C++ program:

#include <pthread.h>
#include <sched.h>

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(3, &cpuset);  // pin to core 3

pthread_t thread = pthread_self();
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

In Python, the os.sched_setaffinity(pid, cpus) call is available on Linux.

NUMA-Aware Memory Allocation

Pinning a thread to socket 0 cores is only half the job. If that thread allocates memory via the default allocator, the kernel may satisfy the request from socket 1’s memory banks - you’ve pinned the compute but not the data.

numactl can bind both CPU and memory at launch:

# Run on node 0 cores, allocate memory only from node 0
numactl --cpunodebind=0 --membind=0 ./my_program

From code, use libnuma:

#include <numa.h>

// Allocate on the node local to the current thread's CPU
void *buf = numa_alloc_local(size);

// Allocate on a specific node
void *buf2 = numa_alloc_onnode(size, 0);

// Free
numa_free(buf, size);

numa_alloc_local() is particularly useful in thread startup code: if the thread is already pinned to a NUMA node, local allocation is automatic.

Hyperthreading: Shared Physical Core

Intel’s Hyperthreading (and AMD’s equivalent SMT) exposes two logical CPUs per physical core. The two logical cores share execution units, L1 and L2 caches, and TLBs - only the register files and some buffers are separate.

The consequence: two threads running on sibling hyperthreads compete for the same execution resources. For I/O-bound threads that frequently stall waiting for memory or network, hyperthreading is beneficial - one thread’s stall lets the other run. For compute-bound or memory-bandwidth-bound workloads, hyperthreading often hurts: both threads fight over the same cache and execution ports, and effective throughput can drop.

Identifying sibling pairs:

# For each physical core, shows which logical CPUs are siblings
cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
# Output: 0,16  ← logical CPU 0 and 16 share a physical core

In latency-critical systems, it’s common to only use one logical CPU per physical core - dedicate one hyperthread to the hot path and leave the sibling idle (or handle OS interrupts).

False sharing - two threads modifying different variables that reside on the same 64-byte cache line - is damaging even within a single socket. Across sockets, it is catastrophically expensive. Each write requires the cache line to be invalidated on the remote socket and fetched back, crossing the QPI/UPI interconnect on every operation.

The fix is alignment:

// BAD: counters share a cache line
struct Counters {
    int64_t thread0_count;
    int64_t thread1_count;
};

// GOOD: pad each counter to its own cache line
struct alignas(64) PaddedCounter {
    int64_t count;
    char _pad[56];  // ensure 64-byte alignment
};
PaddedCounter counters[NUM_THREADS];

On multi-socket systems, also ensure each thread’s counter lives on memory local to that thread’s socket.

L1 cache (typically 32-64KB): private to each logical core. Fastest access, ~4 cycles.

L2 cache (typically 256KB-1MB): private to each physical core (shared between hyperthreads). ~12 cycles.

L3 cache (typically 4-64MB): shared across all cores within a socket. ~40 cycles. Not shared across sockets.

This matters for communication patterns. Threads on the same socket sharing data through L3 cache are far cheaper to synchronize than threads on different sockets, which must go through system memory and the interconnect.

Examples

Benchmarking NUMA-local vs. remote memory access:

# Allocate and access memory on NUMA node 0 from node 0 CPUs
numactl --cpunodebind=0 --membind=0 ./membench --size=1GB

# Same benchmark but access node 1 memory from node 0 CPUs
numactl --cpunodebind=0 --membind=1 ./membench --size=1GB

# Typical result: ~2x difference in bandwidth, ~2x in latency

Pinning ML training workers to NUMA nodes:

import os
import torch.multiprocessing as mp

def worker(rank, world_size):
    # Each worker pinned to its own NUMA node
    numa_node = rank % 2  # assuming 2 NUMA nodes
    cores = list(range(numa_node * 8, (numa_node + 1) * 8))
    os.sched_setaffinity(0, cores)
    # Now data loading and gradient computation stay NUMA-local
    train(rank, world_size)

mp.spawn(worker, nprocs=4, args=(4,))

Measuring hyperthreading impact on matrix multiply:

# Run two matrix multiply processes on sibling hyperthreads (same physical core)
taskset -c 0 ./matmul &
taskset -c 16 ./matmul &
wait
# vs. two physical cores on the same socket:
taskset -c 0 ./matmul &
taskset -c 1 ./matmul &
wait
# Expected: sibling pair is 20-40% slower for compute-bound workloads

IRQ Affinity and Isolating Cores

For the most latency-sensitive workloads, you can go further:

IRQ affinity: network interrupts can be directed away from latency-critical cores using /proc/irq/N/smp_affinity. This prevents NIC interrupts from firing on cores running your hot-path threads.

isolcpus: a Linux kernel boot parameter that removes specified cores from the scheduler’s pool entirely. The OS will never schedule normal tasks there - only threads that explicitly set their affinity to those cores. This eliminates scheduler jitter from background tasks.

# In /etc/default/grub:
GRUB_CMDLINE_LINUX="isolcpus=4-7 nohz_full=4-7 rcu_nocbs=4-7"

The combination of isolcpus, nohz_full (disabling the scheduling-clock interrupt), and rcu_nocbs (offloading RCU callbacks) can reduce OS-induced latency jitter to the single-digit microsecond range.

Read Next: