SIMD & Vectorization // Megha Bose

Prerequisite:

Most of the arithmetic throughput available on a modern CPU goes unused by scalar code. A single AVX2 instruction can add eight 32-bit floats in the same time a scalar fadd adds one. SIMD - Single Instruction, Multiple Data - is the mechanism that exposes this parallelism, and understanding it is the difference between code that uses 12% of a processor and code that uses 90%.

The SIMD Register Hierarchy

Intel’s SIMD extensions have grown in stages. SSE (Streaming SIMD Extensions) introduced 128-bit xmm registers, capable of holding four 32-bit floats or two 64-bit doubles simultaneously. AVX widened these to 256-bit ymm registers. AVX-512 doubled again to 512-bit zmm registers.

With AVX2, you have sixteen ymm registers (ymm0–ymm15), each 256 bits wide. A single ymm register can hold eight float32 values or four float64 values packed side by side. One vaddps ymm0, ymm1, ymm2 instruction adds all eight float pairs in parallel - the theoretical arithmetic throughput is 8× compared to scalar code.

ARM has NEON (128-bit) and SVE/SVE2 (scalable vectors from 128 to 2048 bits) as equivalents.

Auto-Vectorization

The easiest way to use SIMD is to let the compiler handle it. With -O2 or -O3 combined with -march=native (which tells the compiler it can use any instruction the current CPU supports), GCC and Clang will automatically vectorize loops that meet certain conditions.

void add_arrays(float *restrict a, const float *restrict b, int n) {
    for (int i = 0; i < n; i++)
        a[i] += b[i];
}

This loop is a perfect auto-vectorization target: no loop-carried dependencies, no aliasing (restrict tells the compiler the pointers do not overlap), and unit stride. The compiler will emit AVX2 or SSE instructions automatically.

You can verify this with gcc -O3 -march=native -fopt-info-vec. Lines that were successfully vectorized are reported; lines that were not include a reason.

Barriers to Auto-Vectorization

Pointer aliasing. Without restrict, the compiler must assume that a and b might overlap, which makes reordering the loads and stores illegal. Adding restrict removes this assumption and often unlocks vectorization.

Loop-carried dependencies. If iteration i+1 depends on the result of iteration i, the loop cannot be parallelized. A prefix sum a[i] = a[i-1] + b[i] has this structure; the compiler cannot vectorize it straightforwardly.

Non-unit stride. Accessing every fourth element (a[i*4]) requires a gather operation, which is far less efficient than loading a contiguous block. Rearranging data into an Array-of-Structs-of-Arrays (SoA) layout often resolves this.

Non-power-of-two or unknown loop bounds. The compiler must generate scalar epilogue code to handle leftover elements, which it can do automatically, but very short loops may not be worth vectorizing at all.

Writing Intrinsics

When auto-vectorization fails or you need precise control, you can write SIMD directly using intrinsics - C functions that map one-to-one onto SIMD instructions.

#include <immintrin.h>

float dot_product_avx2(const float *a, const float *b, int n) {
    __m256 acc = _mm256_setzero_ps();       // zero accumulator
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i); // load 8 floats, unaligned
        __m256 vb = _mm256_loadu_ps(b + i);
        acc = _mm256_fmadd_ps(va, vb, acc); // fused multiply-add
    }
    // horizontal sum of 8 lanes
    __m128 low  = _mm256_castps256_ps128(acc);
    __m128 high = _mm256_extractf128_ps(acc, 1);
    __m128 sum4 = _mm_add_ps(low, high);
    __m128 sum2 = _mm_add_ps(sum4, _mm_movehl_ps(sum4, sum4));
    __m128 sum1 = _mm_add_ss(sum2, _mm_shuffle_ps(sum2, sum2, 1));
    return _mm_cvtss_f32(sum1);
}

_mm256_fmadd_ps is FMA - fused multiply-add - which computes a*b + c in a single instruction with only one rounding step, giving both higher throughput and better numerical accuracy than separate multiply and add.

Memory Alignment

SIMD loads and stores are fastest when the address is aligned to the vector width - 32 bytes for AVX2. _mm256_load_ps requires 32-byte alignment and will fault on unaligned addresses. _mm256_loadu_ps handles unaligned addresses but may be slower on older hardware.

To allocate aligned memory:

float *buf = (float *)_mm_malloc(n * sizeof(float), 32);
// or in C++:
alignas(32) float buf[256];

The `-ffast-math` Flag

-ffast-math enables a set of mathematically unsafe transformations: reassociation of floating-point operations, assumption that there are no NaNs or infinities, replacement of division with multiplication by the reciprocal, and more. These can dramatically improve auto-vectorization because floating-point addition is not associative - the compiler normally cannot reorder (a + b) + c to a + (b + c) - but -ffast-math allows it.

The trade-off is loss of IEEE 754 conformance. Code that depends on exact rounding, NaN propagation, or signed-zero semantics will produce different results. Use it where numerical precision requirements are loose; avoid it in financial or scientific code with strict correctness requirements.

AoS vs SoA Memory Layout

Consider a particle system with x, y, z positions and a mass:

// Array of Structs (AoS) - common but SIMD-unfriendly
struct Particle { float x, y, z, mass; };
Particle particles[N];

// Struct of Arrays (SoA) - SIMD-friendly
struct Particles { float x[N], y[N], z[N], mass[N]; };

AoS interleaves the fields, so loading eight consecutive x values requires strided access. SoA packs all x values together, making a contiguous load of eight x[i] values with a single vmovups instruction. For any computation that operates on one field at a time - updating all positions, computing all masses - SoA is far better for SIMD.

Examples

Checking auto-vectorization. Compile with:

gcc -O3 -march=native -fopt-info-vec-optimized -c dot.c

Lines like dot.c:5:5: optimized: loop vectorized using 32-byte vectors confirm that vectorization happened. Lines with missed tell you why it did not.

AVX2 dot product vs scalar. On a modern CPU, the AVX2 intrinsic version of an inner product over 1024 floats runs roughly 4–6× faster than a scalar loop in practice (less than the theoretical 8× due to horizontal reduction overhead and memory bandwidth).

NumPy’s hidden SIMD. A Python loop over a NumPy array in pure Python will be slow. np.dot(a, b) hands the work to an optimized BLAS routine that uses AVX2 or AVX-512 internally. The speedup over a Python scalar loop is typically 50–200×. The lesson: the bottleneck is almost always the scalar Python interpreter, not the arithmetic.

SIMD is the practical engine behind fast numerical code: linear algebra, image processing, audio DSP, neural network inference, and high-frequency trading signal processing all rely on it. Understanding when the compiler can vectorize automatically, and how to write intrinsics when it cannot, is what allows you to fully exploit the hardware you already have.

Read Next: