Cache-Friendly Code - Letting the Hardware Do What It Was Built For // Megha Bose

Helpful context:

Two programs. Same language, same algorithm, same number of arithmetic operations, same asymptotic complexity. One runs in 0.3 seconds. The other runs in 3 seconds.

The difference is memory access order.

Here is the concrete version. A square matrix of floats, 4096 × 4096 elements. Traverse every element and multiply by 2. Doing it row by row - touching elements in the order they sit in memory - completes in under half a second. Doing it column by column - jumping 4096 × 4 bytes = 16 KB between each access - takes 10× longer on the same hardware with the same CPU doing the same multiplication.

This is the memory wall, and understanding it is the difference between code that looks efficient and code that actually is.

Why the Gap Exists: The Memory Hierarchy

A modern CPU core can execute 3 - 5 billion operations per second. Main memory (DRAM) can deliver data at a rate that supports roughly 100 - 200 million cache-line reads per second. The computation-to-memory-delivery ratio is wildly mismatched.

CPUs bridge this gap with a hierarchy of caches:

Level	Typical Latency	Typical Size	Scope
Registers	~0 cycles	<1 KB	Per core
L1 cache	~4 cycles	32 - 64 KB	Per core
L2 cache	~12 cycles	256 KB - 1 MB	Per core
L3 cache	~40 cycles	8 - 64 MB	Shared across cores
DRAM	~200 cycles	GBs	Shared
NVMe SSD	~100,000 cycles	TBs	Shared

An L1 hit costs 4 cycles. A DRAM miss costs 200 cycles. That 50× difference determines whether your inner loop runs at 3 GFLOPS or 60 MFLOPS. The L3-to-DRAM boundary is where most real performance cliffs live.

Cache Lines: The Atom of Memory Transfer

Caches do not operate on individual bytes. They operate on cache lines - 64 bytes on virtually every modern CPU. When you access any byte in memory, the entire 64-byte block containing it is loaded into L1 cache.

This has enormous practical consequences:

If you access byte 0 of a 64-byte struct, bytes 1 - 63 are also in L1 cache for free. Access them within the same loop iteration and you pay nothing additional.
If you access memory in a pattern that touches a new cache line on every step, you pay a full DRAM latency on every access.

You can verify cache-line size on Linux:

cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
# 64

The practical implication: iterating through a C array with stride 1 and stride 8 (every 32 bytes) have nearly identical performance - both touch the same cache lines. Stride 16 (every 64 bytes, one per cache line) starts to reveal the miss penalty. Stride 64+ shows the full penalty.

Spatial Locality: Access Memory Like a Typewriter

Spatial locality means accessing addresses that are close together in memory. Sequential access is optimal - every access hits a cache line already loaded by the previous access.

The most teachable example is matrix traversal. In C, a 2D array is stored in row-major order: row 0 is contiguous, then row 1, then row 2. Column access jumps by an entire row’s worth of bytes on every step.

#define N 4096
float a[N][N];  // 64 MB

// Cache-friendly: row-major, sequential access
// ~0.3 seconds for N=4096
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        a[i][j] *= 2.0f;

// Cache-hostile: column-major, N*4 = 16384 bytes between accesses
// ~3 seconds for N=4096 - same operations, 10x slower
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        a[i][j] *= 2.0f;

The inner loop of the column-major version touches a new cache line on every iteration. For N=4096, that is 4096 × 4096 = 16 million cache misses in the inner loop alone. Each one costs ~200 cycles, turning a simple multiply into a memory-latency benchmark.

NumPy arrays are C-contiguous (row-major) by default. Iterating over rows is fast; iterating over columns is slow. Fortran-order arrays (column-major) are available via np.asfortranarray for cases where column access dominates.

Temporal Locality: Reuse What You Loaded

Temporal locality means accessing the same data multiple times in a short window, before it gets evicted from cache. An algorithm with good temporal locality does heavy computation on a small dataset that fits in L1 before moving on.

Matrix multiplication is the canonical example of temporal locality done right - or wrong. Naive matrix multiply:

// Naive ijk order: B is accessed column-wise - cache-hostile
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];

For large matrices, B[k][j] jumps by N × sizeof(float) bytes between iterations of the k loop. Every access to B potentially misses cache.

The fix is loop tiling (blocking): restructure the computation to work on small sub-blocks that fit in cache, completing all computation on each block before moving on.

#define BLOCK 64
for (int ii = 0; ii < N; ii += BLOCK)
  for (int kk = 0; kk < N; kk += BLOCK)
    for (int jj = 0; jj < N; jj += BLOCK)
      for (int i = ii; i < ii+BLOCK; i++)
        for (int k = kk; k < kk+BLOCK; k++)
          for (int j = jj; j < jj+BLOCK; j++)
            C[i][j] += A[i][k] * B[k][j];

The inner triple loop works on three 64×64 blocks: 64×64×4 bytes = 16 KB each, well within L1 (32 - 64 KB). The same data is reused repeatedly before eviction. Tiled matrix multiply is 5 - 10× faster than naive for large matrices, and the performance gap grows with matrix size.

This is why BLAS (Basic Linear Algebra Subroutines) libraries exist - they implement exactly these optimizations, plus SIMD vectorization, for every architecture. numpy.dot calls BLAS under the hood. Writing matrix multiply yourself and matching BLAS performance takes years of expertise.

Array-of-Structs vs Struct-of-Arrays

Consider a particle simulation: each particle has position, velocity, and mass. Two ways to represent 1 million particles:

// Array of Structs (AoS)
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    float charge;       // 32 bytes total per particle
};
struct Particle particles[1000000];

// Struct of Arrays (SoA)
struct Particles {
    float x[1000000];
    float y[1000000];
    float z[1000000];
    float vx[1000000];
    float vy[1000000];
    float vz[1000000];
    float mass[1000000];
    float charge[1000000];
};

The AoS layout is intuitive - one particle, one struct. But if your inner loop only updates positions (x, y, z), each cache line loaded for one particle also contains vx, vy, vz, mass, charge - fields you never touch. Half the cache line is wasted.

The SoA layout keeps all x values contiguous, all y values contiguous, etc. A position-update loop reads exactly the data it needs, filling cache lines with useful values. SoA typically achieves 2 - 4× speedup for partial-field access.

SoA also enables SIMD vectorization: AVX2 can multiply 8 floats in a single instruction, but only if they are contiguous in memory. SoA makes x[i] through x[i+7] contiguous; AoS does not.

Game engines, physics simulators, and high-performance databases use SoA or a hybrid (structure-of-arrays-of-structs) layout for exactly this reason. This design philosophy is called Data-Oriented Design - organize data by access pattern, not by conceptual grouping.

Two threads, two independent counters, no shared state. Should be perfectly parallel. Yet:

long counter[2];  // 16 bytes total - both in the same cache line

void thread0() {
    for (int i = 0; i < 100000000; i++) counter[0]++;
}
void thread1() {
    for (int i = 0; i < 100000000; i++) counter[1]++;
}

Running this with two threads is slower than single-threaded. The CPUs share an L3 cache with a coherence protocol (MESI - Modified, Exclusive, Shared, Invalid). When thread 0 writes counter[0], it acquires exclusive ownership of that cache line - which includes counter[1]. Thread 1’s cached copy is invalidated. Thread 1 must fetch the line again, then invalidate thread 0’s copy. This ping-pong continues at cache-coherence speed, which is slower than running single-threaded.

The fix: pad each counter to its own cache line.

struct alignas(64) PaddedCounter {
    long value;
    // 56 bytes of padding to reach 64-byte alignment
    char _pad[56];
};
struct PaddedCounter counter[2];

// Now counter[0] and counter[1] live on separate cache lines.
// No false sharing. Both threads run at full speed.

False sharing is notoriously hard to diagnose because the code looks correct - there is no logical data race. The symptom is poor parallel scaling that only appears at runtime. Linux perf stat -e cache-misses and Intel VTune’s memory analysis can identify it.

Hardware Prefetching: Your Free Optimizer

The CPU contains a hardware prefetcher that monitors memory access patterns and speculatively loads cache lines before they are needed. For sequential access - stride-1 or even constant-stride patterns - the prefetcher is remarkably good. It detects the pattern and issues prefetch requests several cache lines ahead of the current position.

This means straightforward sequential code often runs close to theoretical memory bandwidth even without programmer intervention. The prefetcher saves you.

Where it fails: irregular access patterns. Pointer chasing (linked list traversal), hash table lookups, random permutations - the prefetcher cannot predict the next address because it depends on data, not pattern. These workloads are fundamentally memory-latency bound.

When you know a future access pattern ahead of time, you can issue explicit prefetches:

for (int i = 0; i < N; i++) {
    __builtin_prefetch(&data[i + 16], 0, 1);  // prefetch 16 iterations ahead
    process(data[i]);
}

The right prefetch distance depends on memory latency and loop body execution time. Too close: the data is not ready yet. Too far: you evict data you still need. Profile with perf stat -e cache-misses,cache-references before adding manual prefetches - the hardware prefetcher handles most sequential patterns without help.

NUMA: When “Memory” Isn’t Uniform

In multi-socket servers, each CPU socket has its own directly attached DRAM. Accessing local DRAM takes ~100 ns; accessing the other socket’s DRAM through the inter-connect (QPI, NVLink, Infinity Fabric) takes ~300 ns. This Non-Uniform Memory Access (NUMA) topology means where your memory is matters, not just how much you have.

AWS r5 and r6 instance families (memory-optimized) have multiple NUMA nodes. A process that allocates memory on NUMA node 0 but gets scheduled to run on a core belonging to NUMA node 1 pays the remote access penalty on every memory access - often 2 - 3× slower than local access.

numactl --hardware          # show NUMA topology
numactl --membind=0 ./prog  # bind memory allocation to node 0

Kubernetes and Linux’s NUMA-aware scheduling try to place processes on cores local to their memory. When this fails - due to memory pressure or scheduling errors - performance degrades silently. Database servers (especially PostgreSQL and MySQL) have NUMA configuration knobs precisely because this matters.

GPU memory bandwidth dwarfs CPU memory bandwidth by design. An NVIDIA A100 has 2 TB/s of HBM2e bandwidth vs. ~200 GB/s for a high-end CPU. GPUs are architected for massively parallel memory access - thousands of CUDA cores each issuing memory requests, with the memory controller designed to coalesce adjacent requests and hide latency. CPU code optimized for cache-friendly access is implicitly optimizing for a bandwidth ceiling that GPUs have already blown past, which is why the right answer for batch matrix operations is always “use the GPU.”

When NOT to Cache-Optimize

The most important advice about cache optimization: profile first.

Modern hardware prefetchers handle sequential access beautifully. Modern compilers auto-vectorize loops that have obvious patterns. The cases where manual cache optimization gives dramatic wins are real but specific: tight inner loops, large working sets, irregular access patterns.

Premature cache optimization produces code that is harder to read, harder to maintain, and - without profiling evidence - may not even be the bottleneck. A Python script that takes 2 seconds is almost certainly not cache-bound; it is interpreter-bound. A SQL query that takes 10 seconds is almost certainly not cache-bound; it is waiting on disk I/O or doing a full table scan.

The workflow for finding cache-related performance issues:

perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./program
# cache-miss-rate = cache-misses / cache-references
# >5% for a compute-bound workload → investigate

Then use perf record and perf report to identify which source lines are generating the misses. Only then write cache-optimized code.

Concept	Rule of Thumb
Cache line size	64 bytes - the atom of memory transfer
L1 hit vs DRAM miss	~4 cycles vs ~200 cycles; 50× difference
Spatial locality	Access arrays sequentially; column-major is 10× slower
Temporal locality	Work on small blocks that fit in L1 before moving on
False sharing	Different data, same cache line - use 64-byte alignment padding
AoS vs SoA	SoA wins when you access one field at a time
Hardware prefetcher	Handles sequential patterns automatically; fails on pointer chasing
NUMA	Remote memory is 2 - 3× slower; pin processes to local NUMA nodes

Read Next: