SIMD & Vectorization - Doing the Same Operation to Many Values at Once // Megha Bose

Helpful context:

Adding eight floats one at a time takes eight operations. Adding all eight simultaneously takes one. That is not a trick - it is a feature of every modern CPU, and your programs use it constantly without you knowing.

SIMD (Single Instruction, Multiple Data) is the instruction set extension that enables this. One instruction names multiple data elements packed into a wide register, and the hardware executes the operation on all of them in parallel. NumPy’s np.dot is fast not because Python is fast, but because it ultimately calls a BLAS routine that executes AVX-512 instructions adding 16 floats at a time. PyTorch’s CPU inference is fast for the same reason. The computation your code requests and the computation your hardware executes are very different things - and SIMD is most of the gap.

A Brief History of SIMD ISA Proliferation

Intel introduced MMX in 1996 - 64-bit registers holding two 32-bit integers or eight 8-bit integers. The goal was to accelerate multimedia workloads: video decoding, audio processing. MMX reused the floating-point register file, which caused expensive context switches and prevented mixing MMX and floating-point code. The design was a compromise that satisfied nobody fully.

SSE (1999) introduced dedicated 128-bit xmm registers - 4 floats or 2 doubles side by side. SSE2 (2001) added integer SIMD in the same registers. SSE3, SSSE3, SSE4.1, SSE4.2 followed, each adding specific operations. By the time AVX arrived in 2011 (256-bit ymm registers), the instruction set had dozens of overlapping and partially redundant instructions accumulated over 15 years of incremental additions.

AVX2 (2013) generalized AVX to integer operations. AVX-512 (2017, initially only on Xeon Phi, later on mainstream Intel from Ice Lake onward) doubled again to 512-bit zmm registers. Each extension maintains backward compatibility: ymm0 is the lower 256 bits of zmm0; xmm0 is the lower 128 bits of ymm0.

ARM NEON provides 128-bit SIMD, widely deployed on mobile and Apple Silicon. ARM SVE (Scalable Vector Extension) is architecturally different: the vector length is not fixed at 128 or 256 bits but is specified by the hardware and discoverable at runtime. The same SVE binary runs on cores with 128-bit to 2048-bit vector units. This is a cleaner design than x86’s ISA fragmentation, but backward compatibility constraints mean it will take decades to dominate.

The proliferation problem is real: writing SIMD-optimized code that runs correctly on SSE2, SSE4.2, AVX2, and AVX-512 requires either multiple code paths with runtime dispatch (compile all variants, select at startup based on CPUID) or giving up performance on older hardware. Most numerical libraries implement this - look at NumPy’s CPUID dispatch or glibc’s string functions - and it is a maintenance burden proportional to the number of supported ISA levels.

The SIMD Register File

The registers and their widths determine how many elements you can process per instruction:

Extension	Register	Width	float32 lanes	float64 lanes	int32 lanes
SSE	xmm0 - xmm15	128 bits	4	2	4
AVX	ymm0 - ymm15	256 bits	8	4	-
AVX2	ymm0 - ymm15	256 bits	8	4	8
AVX-512	zmm0 - zmm31	512 bits	16	8	16

One vaddps ymm0, ymm1, ymm2 instruction adds 8 float32 pairs simultaneously. Theoretical throughput: 8x over scalar. In practice, real speedups are 4 - 6x for compute-bound loops due to overhead from loop control, horizontal reductions, and the occasional scalar epilogue for array lengths not divisible by 8.

Auto-Vectorization: The Compiler Does It For You

Modern compilers (GCC, Clang, MSVC) automatically translate scalar loops into SIMD instructions when they can prove it’s safe. The optimization is called auto-vectorization, and it’s enabled by default at -O2 and above when combined with -march=native (which allows the compiler to use any instruction the current CPU supports).

// This loop is an ideal auto-vectorization target:
void add_arrays(float *restrict a, const float *restrict b, int n) {
    for (int i = 0; i < n; i++)
        a[i] += b[i];
}

With gcc -O3 -march=native, this emits AVX2 instructions, processing 8 floats per iteration. The restrict keyword is load-bearing: it tells the compiler that a and b don’t overlap in memory. Without restrict, the compiler must conservatively assume aliasing and cannot legally reorder loads/stores - vectorization is blocked.

To see what the compiler actually produced:

gcc -O3 -march=native -fopt-info-vec-optimized -S myfile.c
# Reports which loops were vectorized and with what vector width

Or use Compiler Explorer : paste the code, select GCC with -O3 -march=x86-64-v3, see the assembly with SIMD instructions highlighted.

Why Auto-Vectorization Fails

Auto-vectorization is not magic. It is a dependency analysis problem. The compiler must prove, statically, that processing multiple iterations simultaneously is equivalent to processing them sequentially. Several constructs block this:

Pointer aliasing. Without restrict, the compiler assumes a and b might overlap. Writing a[i] might change b[i+1]. No vectorization is safe.

Loop-carried dependencies. If iteration $i+1$ uses the result of iteration $i$:

// Not vectorizable: a[i] depends on a[i-1]
for (int i = 1; i < n; i++)
    a[i] = a[i-1] + b[i];  // prefix sum - inherently sequential

Parallel prefix sum algorithms exist (used in GPU computing), but the compiler won’t discover them automatically.

Non-contiguous memory access. a[i*stride] for arbitrary stride requires gather instructions (load from non-contiguous addresses). Gathers exist in AVX2 (_mm256_i32gather_ps) but are significantly slower than contiguous loads on most microarchitectures. The compiler often chooses not to vectorize if it can only produce a gather.

Function calls inside the loop. If a function call is opaque to the compiler (not inlined, not annotated as pure), the compiler cannot vectorize across it - the function might have side effects.

Conditional logic. if (a[i] > 0) b[i] = a[i]; else b[i] = 0; can be vectorized using masked stores or blend instructions, but the compiler must recognize the pattern. Complex conditionals block vectorization entirely.

When auto-vectorization fails, the compiler’s report (via -fopt-info-vec-missed) tells you why. Common messages: “not vectorized: can’t determine if loop is safe to vectorize due to possible aliasing,” “not vectorized: loop contains function calls or data references that cannot be analyzed.”

Intrinsics: Direct SIMD Programming

When auto-vectorization fails or you need precise control, write SIMD directly using intrinsics - C function wrappers that map one-to-one onto hardware instructions. The generated assembly is exactly the instruction named by the intrinsic.

#include <immintrin.h>

// Vectorized dot product using AVX2 FMA
float dot_product_avx2(const float *a, const float *b, int n) {
    __m256 acc = _mm256_setzero_ps();       // zero accumulator: 8 floats = 0.0
    int i;
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i); // load 8 floats, unaligned OK
        __m256 vb = _mm256_loadu_ps(b + i);
        acc = _mm256_fmadd_ps(va, vb, acc); // acc += va * vb (fused multiply-add)
    }
    // Handle remaining elements (n not divisible by 8)
    float tail = 0.0f;
    for (; i < n; i++) tail += a[i] * b[i];

    // Horizontal reduction: sum 8 lanes into 1 scalar
    __m128 lo   = _mm256_castps256_ps128(acc);
    __m128 hi   = _mm256_extractf128_ps(acc, 1);
    __m128 sum4 = _mm_add_ps(lo, hi);
    __m128 sum2 = _mm_add_ps(sum4, _mm_movehl_ps(sum4, sum4));
    __m128 sum1 = _mm_add_ss(sum2, _mm_shuffle_ps(sum2, sum2, 1));
    return _mm_cvtss_f32(sum1) + tail;
}

_mm256_fmadd_ps is Fused Multiply-Add: computes a*b + c in a single instruction with one rounding step instead of two. FMA gives both higher throughput (1 instruction instead of 2) and better numerical accuracy (less rounding error). Intel’s FMA throughput is 2 operations per cycle (2 FMA units on most modern cores). This is the theoretical peak for floating-point computation.

The horizontal reduction at the end (collapsing 8 lanes into one scalar sum) is the awkward part. SIMD is designed for vertical operations (lane $i$ of A + lane $i$ of B → lane $i$ of C). Horizontal operations (summing all 8 lanes of one register) require shuffling and adding, which is why the reduction code looks baroque. This overhead is amortized over large arrays - for $n = 1024$, the horizontal reduction is 128 element-pairs worth of work amortized over 1024, which is negligible.

Memory Alignment: A Performance Cliff

SIMD loads and stores are fastest when the memory address is aligned to the vector width. For AVX2, this means 32-byte alignment: the address must be divisible by 32.

_mm256_load_ps requires 32-byte alignment and will fault (segfault) on misaligned addresses. _mm256_loadu_ps handles unaligned addresses but was historically slower on Sandy Bridge-era Intel chips. On modern Intel (Haswell and later), unaligned loads have no throughput penalty when the data is actually cache-line-aligned; there is a small penalty only when a load crosses a cache-line boundary (every 64 bytes).

To guarantee alignment:

// C: posix_memalign or _mm_malloc
float *buf;
posix_memalign((void **)&buf, 32, n * sizeof(float));
// or:
float *buf = (float *)_mm_malloc(n * sizeof(float), 32);

// C++:
alignas(32) float buf[256];   // stack allocation
std::vector<float> v(n);      // heap - no alignment guarantee by default
// Use aligned allocator for std::vector if you need guaranteed alignment

For modern code on modern hardware, loadu is the pragmatic choice - alignment hints from the compiler or runtime alignment of allocations usually ensures performance is fine, and you avoid the segfault risk of load.

AoS vs SoA: Data Layout for SIMD

How you lay out data in memory determines whether SIMD can load contiguous elements. Consider a physics simulation with particles:

// Array of Structs (AoS) - natural object-oriented layout
struct Particle { float x, y, z, mass; };
Particle particles[N];
// In memory: x0 y0 z0 m0 | x1 y1 z1 m1 | x2 y2 z2 m2 | ...

// Struct of Arrays (SoA) - SIMD-friendly layout
struct Particles {
    float x[N];
    float y[N];
    float z[N];
    float mass[N];
};
// In memory: x0 x1 x2 x3 x4 x5 x6 x7 | ... | y0 y1 y2 ...

If you want to update all x-positions using a vectorized computation, AoS requires strided access - load x from particles[0].x, then particles[1].x, 16 bytes later. Strided gather loads are 3 - 5x slower than contiguous loads on AVX2. SoA stores all x-values contiguously: particles.x[0..7] is one vmovups instruction.

For any computation that operates on one field across all particles (update all positions, compute all forces), SoA is dramatically better for SIMD. The cost: code that reads all fields of one particle at once (access pattern along the particle axis) needs to load from four separate arrays, which can be cache-unfriendly if particles don’t fit in cache.

The standard recommendation for SIMD-optimized code: prefer SoA unless the access pattern is clearly particle-centric. Many high-performance simulation libraries (OpenFOAM, Gromacs) use SoA internally even though their public API presents objects.

Why NumPy Is Fast (and When It Isn’t)

NumPy’s core numerical operations (np.dot, np.add, np.sort) call into optimized C extensions. For linear algebra, NumPy uses BLAS (Basic Linear Algebra Subprograms) - either OpenBLAS (open-source) or Intel MKL (proprietary but freely available via conda). These BLAS implementations are hand-tuned at the assembly level, using AVX2 or AVX-512 with carefully crafted loop structures to maximize cache efficiency and register reuse.

np.dot(a, b) for two length-1024 arrays is fast because:

No Python overhead (the function drops into C immediately)
OpenBLAS dispatches to an AVX2 or AVX-512 inner loop
The inner loop uses FMA instructions, 8 - 16 floats per cycle

A Python loop sum(a[i] * b[i] for i in range(n)) is slow because:

Every multiplication and addition is a Python function call
Python function calls are ~100ns each
For $n = 1024$, that’s ~2048 function calls vs 128 AVX2 instructions

The speedup of NumPy over Python loops is typically 50 - 200x for such operations. The lesson: the bottleneck is almost always the Python interpreter, not the arithmetic.

When NumPy isn’t fast: custom per-element operations. np.vectorize(your_python_function) is a convenience wrapper, not a performance tool - it still calls your Python function once per element. For custom elementwise operations, Numba (JIT compilation with SIMD) or Cython (C extension with type annotations) are the practical solutions.

The AVX-512 Controversy: Frequency Throttling

AVX-512 doubles the number of floating-point operations per cycle compared to AVX2. It also consumes significantly more power, which causes Intel CPUs to reduce their clock frequency (“throttle”) when executing AVX-512 code.

The throttling is tiered:

No SIMD or SSE: base frequency (e.g., 3.8 GHz)
AVX/AVX2: light downclocking (e.g., 3.5 GHz on some Xeons)
AVX-512: heavier downclocking (e.g., 3.0 GHz on some Xeons)

The penalty is hardware-model-specific and workload-specific. For sustained floating-point computation (matrix multiply, FFT), the throughput increase from AVX-512’s wider registers usually outweighs the frequency reduction. For mixed workloads where SIMD and scalar code alternate, the frequency transitions can reduce overall performance - the CPU takes hundreds of microseconds to ramp back up to base frequency after AVX-512 instructions.

This is why Linus Torvalds famously criticized AVX-512 and why some cloud providers disable AVX-512 on their VMs. Linux kernel code avoids AVX-512 to prevent frequency throttling from affecting non-SIMD workloads.

Apple Silicon takes a different approach: M-series chips have dedicated SIMD units that don’t share power headroom with the scalar cores. AMX (Apple Matrix Extension) is a separate engine entirely. No frequency throttling from SIMD. This architectural choice is part of why Apple’s performance per watt is exceptional for numerical workloads.

The Right Level of Abstraction

SIMD intrinsics are unreadable. _mm256_fmadd_ps communicates nothing about intent, only about implementation. Code written with intrinsics is:

Non-portable (AVX2 intrinsics don’t work on ARM)
Unreadable to anyone not familiar with the specific ISA extension
Fragile (register pressure, alignment, pipeline stalls are all manual concerns)
Often unnecessary (the compiler can do it)

The practical hierarchy:

Write clear scalar code first and let the compiler vectorize. This works for the majority of numerical loops.
Help the compiler with restrict, __builtin_assume_aligned, and SoA data layout when needed.
Use a library (Eigen, xsimd, highway) that provides portable SIMD abstractions.
Write intrinsics only for hand-optimized inner loops where the compiler demonstrably fails and the speedup matters.

The “SIMD lottery” - whether your compiler vectorizes correctly - is real. Always verify with the vectorization report (-fopt-info-vec-optimized) or by inspecting the assembly output. Never assume.

Future: GPU as Extreme SIMD

GPU architecture is SIMD taken to its logical conclusion. An NVIDIA A100 has 108 streaming multiprocessors, each with 64 CUDA cores (scalar ALUs). All cores in a warp (32 cores) execute the same instruction on different data simultaneously - SIMD over 32 elements, with 32 warps per multiprocessor. The A100 has 6,912 CUDA cores total, processing 6,912 scalar or 432 FP32 SIMD operations per cycle.

Tensor cores on the A100 perform matrix-matrix multiply on 16×16×16 fp16/bf16 matrices in a single operation - SIMD at the matmul level rather than the element level. This is why GPU training is fast: the dominant operation (attention + feedforward in Transformers, which reduces to matmul) maps perfectly onto tensor core hardware.

The limitation is the same as CPU SIMD: memory access must be regular. Irregular access (sparse attention, graph neural networks with varying degree) cannot fully utilize tensor cores and falls back to slower execution paths. This is the hardware reason why dense attention is faster than sparse attention on current GPU hardware up to sequences of several thousand tokens.

Summary

Feature	SSE2	AVX2	AVX-512	ARM NEON	ARM SVE
Register width	128 bits	256 bits	512 bits	128 bits	128 - 2048 bits (runtime)
float32 lanes	4	8	16	4	4 - 64
Available since	2001	2013	2017 (mainstream)	2011	2019
Frequency throttle	No	Minor	Yes (Intel)	No	No

Auto-Vectorization Blocker	Fix
Pointer aliasing	Add `restrict` keyword
Non-contiguous access	Restructure to SoA layout
Loop-carried dependency	Use parallel prefix algorithms or restructure
Function calls in loop	Inline the function or mark as pure
Unknown loop bounds	Provide upper bound to compiler hints

Read Next:

How Compilers Work - From Source Code to Something a CPU Understands