Cache-Friendly Code // Megha Bose

Prerequisite: Memory Management

A modern CPU core can execute several billion operations per second. Main memory can deliver data at a rate roughly equivalent to a few hundred million 64-byte chunks per second. This mismatch - often called the memory wall - means that for data-intensive code, the bottleneck is almost never the ALU. It is the memory subsystem. Writing cache-friendly code is often the single highest-leverage optimisation available.

The Cache Hierarchy

CPUs bridge the speed gap with a hierarchy of caches, each faster but smaller than the next:

Level	Typical Latency	Typical Size	Shared?
L1	~4 cycles	32–64 KB	Per core
L2	~12 cycles	256 KB–1 MB	Per core
L3	~40 cycles	8–32 MB	All cores
DRAM	~200 cycles	GBs	All cores

A cache miss that reaches DRAM costs 50× more than an L1 hit. If your inner loop misses L1 on every iteration, you are wasting 98% of your CPU’s potential throughput.

Cache Lines

Caches do not store individual bytes. They operate on cache lines of 64 bytes. When you access one byte in memory, the entire 64-byte block containing it is loaded into L1. This has two implications:

If you access nearby memory immediately after, it is already in cache (free).
If you access memory scattered randomly, you pay a full cache-miss penalty for each access.

You can observe the cache-line effect directly: iterating through an array with stride 1 vs stride 16 (every 64 bytes) should have similar performance. Stride 64+ (every cache line) reveals the miss penalty.

Spatial Locality

Spatial locality means accessing memory addresses that are close together in space. The canonical example is matrix traversal.

In C, a 2D array is stored in row-major order: a[0][0], a[0][1], ..., a[0][n-1], a[1][0], .... Iterating row by row accesses contiguous memory - every access is in the same or next cache line. Iterating column by column jumps by the row stride on every step, causing a cache miss on almost every access.

#define N 1024
float a[N][N];

// Cache-friendly: row-major access
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        a[i][j] *= 2.0f;

// Cache-hostile: column-major access (N² cache misses)
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        a[i][j] *= 2.0f;  // every access skips N floats = N*4 bytes

For a $1024 \times 1024$ float matrix, the row-major version is typically 5–10× faster.

Temporal Locality

Temporal locality means reusing data that was recently accessed, while it is still in cache. Cache-oblivious algorithms (like recursive matrix multiplication) naturally achieve good temporal locality by working on small subproblems that fit in cache.

Loop tiling (blocking) restructures loop nests to operate on blocks that fit in L1 or L2:

// Cache-friendly matrix multiply: ikj order + tiling
#define BLOCK 64
for (int ii = 0; ii < N; ii += BLOCK)
  for (int kk = 0; kk < N; kk += BLOCK)
    for (int jj = 0; jj < N; jj += BLOCK)
      for (int i = ii; i < ii+BLOCK; i++)
        for (int k = kk; k < kk+BLOCK; k++)
          for (int j = jj; j < jj+BLOCK; j++)
            C[i][j] += A[i][k] * B[k][j];

The inner triple loop works on three $64 \times 64$ blocks that fit comfortably in L1. Without tiling, B is accessed column-wise and causes $O(N^3 / 16)$ L1 misses.

Array of Structs vs Struct of Arrays

Consider a particle simulation with position, velocity, and mass:

// Array of Structs (AoS) - common, but cache-unfriendly for partial access
struct Particle { float x, y, z, vx, vy, vz, mass; };
struct Particle particles[N];

// Struct of Arrays (SoA) - cache-friendly when processing one field at a time
struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    float mass[N];
};

If you need to update only positions (the most common operation), the AoS layout loads velocity and mass into cache lines that you never use - wasting 57% of each cache line. The SoA layout keeps x, y, z contiguous, so every cache line is fully useful. SoA is also essential for SIMD vectorisation - the CPU can load 8 floats in one AVX instruction only if they are contiguous.

Two threads writing to different variables can still interfere if those variables share a cache line. This is false sharing: the hardware sees writes to the same 64-byte line and invalidates the other core’s copy, forcing expensive coherence traffic.

// False sharing: counter[0] and counter[1] on the same cache line
long counter[2];   // 16 bytes total - both in one cache line

void thread0() { for (int i = 0; i < 1e8; i++) counter[0]++; }
void thread1() { for (int i = 0; i < 1e8; i++) counter[1]++; }

With two threads, this is slower than single-threaded. The fix is padding:

// Pad each counter to its own cache line
struct PaddedCounter {
    long value;
    char pad[56];   // 8 + 56 = 64 bytes
} counter[2];

Or use alignas(64) in C11/C++11. With the fix, both threads run at full speed.

Prefetching

The hardware prefetcher automatically detects sequential or strided access patterns and fetches cache lines before they are needed. For irregular access patterns (trees, hash tables, pointer chasing), you can issue explicit prefetch hints:

// Prefetch the data you'll need 20 iterations from now
for (int i = 0; i < N; i++) {
    __builtin_prefetch(&data[i + 20], 0, 1);  // read, low temporal locality
    process(data[i]);
}

Over-prefetching wastes bandwidth; under-prefetching leaves the CPU stalled. Profiling with perf stat -e cache-misses tells you whether prefetching is warranted.

Examples

Row-major vs column-major in NumPy:

import numpy as np
import time

N = 4096
A = np.random.rand(N, N).astype(np.float32)

# Row-major traversal (C-contiguous, natural for NumPy)
t0 = time.perf_counter()
for i in range(N):
    _ = A[i, :].sum()
print(f"Row traversal: {time.perf_counter()-t0:.3f}s")

# Column-major traversal - strides across rows
t0 = time.perf_counter()
for j in range(N):
    _ = A[:, j].sum()
print(f"Col traversal: {time.perf_counter()-t0:.3f}s")
# Column traversal is typically 5-10x slower

Visualising the cache hierarchy impact:

import numpy as np
import time

def stride_benchmark(stride):
    arr = np.zeros(64 * 1024 * 1024 // 8, dtype=np.float64)  # 64 MB
    t0 = time.perf_counter()
    total = 0.0
    for i in range(0, len(arr), stride):
        total += arr[i]
    elapsed = time.perf_counter() - t0
    return elapsed

for stride in [1, 2, 4, 8, 16, 32, 64, 128]:
    t = stride_benchmark(stride)
    print(f"stride {stride:4d}: {t:.3f}s")
# You'll see a step-change around stride 8 (cache-line boundary) and another at L1/L2 boundary

Memory access patterns account for a surprisingly large fraction of real-world performance differences. A sorting algorithm with slightly worse complexity but better cache behaviour often wins in practice. When your profiler points at a hot loop, the question is not just “is this $O(n)$ or $O(n \log n)$” but “how many cache lines does each iteration touch?”

Read Next: