Compiler Optimizations - Making Your Code Faster Without Touching It // Megha Bose

Helpful context:

You write a for loop that multiplies a million numbers by 2. Clean, obvious C - a loop index, a load, a multiply, a store, repeat a million times. You compile with -O3 and look at the assembly. The compiler did not emit a million multiplies. It emitted something completely different: eight iterations of a 32-wide AVX-512 vector instruction, each multiplying thirty-two floats simultaneously, followed by a small scalar cleanup loop for the remainder. You wrote C. The compiler wrote SIMD assembly. This gap - between what you write and what runs - is what compiler optimization is about.

The gap is not magic. It is a sequence of formal transformations, each one provably preserving program behavior under stated assumptions. When those assumptions break (and they do break, in security code, in floating-point math, in timing-sensitive paths), the transformations become dangerous. Understanding what the compiler is doing, and why, is the difference between writing code that “seems fast” and code that actually is.

Why Optimization Is a Separate Phase

Early compilers were one-pass: parse a line, emit code for that line, repeat. The result was correct but slow - each statement was compiled in isolation with no knowledge of what came before or after. An expression computed twice was computed twice in the assembly. A variable written but never read was still written. A function called once was still called; its stack frame was set up, arguments were shuffled into registers, and control was transferred, even if the function was three lines that could be inlined without any of that overhead.

The insight that changed everything: computation is not just about correctness; it is about equivalence. Two programs are equivalent if they produce the same observable outputs for the same inputs. The optimizer’s job is to find a faster equivalent program. This is why the optimization stage sits between parsing and code generation - it operates on an abstract representation, free from the surface syntax of the source language, where transformations are cleanest to reason about.

The LLVM Pipeline: IR and SSA

LLVM’s optimizer works on its Intermediate Representation (IR), a low-level, strongly typed assembly-like language. The compiler frontend (Clang for C/C++, rustc for Rust) translates source code into LLVM IR. The backend translates optimized IR into machine code. The optimizer - the middle-end - sees only IR and knows nothing about C syntax or x86 instructions.

LLVM IR is written in Static Single Assignment (SSA) form: every value is defined exactly once, and every use refers to exactly one definition. When control flow merges - after an if-else, for instance - SSA uses a phi instruction to select between values from different incoming branches.

; LLVM IR (simplified): int abs(int x)
define i32 @abs(i32 %x) {
entry:
  %is_neg = icmp slt i32 %x, 0
  br i1 %is_neg, label %neg, label %pos
neg:
  %negated = sub i32 0, %x
  br label %merge
pos:
  br label %merge
merge:
  %result = phi i32 [ %negated, %neg ], [ %x, %pos ]
  ret i32 %result
}

SSA makes analyses trivially easy. To find all uses of a value: look for uses of that SSA name - there is exactly one definition to trace. To check if a value is constant: look at its definition. This is why virtually every modern optimizer (GCC, LLVM, V8’s TurboFan, Java’s HotSpot C2) uses SSA internally.

You can inspect LLVM IR directly:

clang -O2 -emit-llvm -S -o foo.ll foo.c   # emit IR as text
opt --print-passes                          # list all optimization passes
opt -S -O2 foo.ll -o optimized.ll          # run passes on IR

Constant Folding and Propagation

The most basic optimization: if an expression’s operands are known at compile time, evaluate it at compile time. int x = 3 * 7 + 2; becomes int x = 23;. No runtime arithmetic at all. This extends transitively: if x = 23 and y = x + 1, the compiler knows y = 24 and can substitute that everywhere y is used.

Combined with constant propagation - substituting known values into downstream expressions - and iterated to a fixed point, this can collapse surprisingly large computation trees. A loop with a compile-time-constant bound and body that only manipulates constants can sometimes vanish entirely, replaced by its final value.

Dead Code Elimination

Dead code elimination (DCE) removes instructions whose results are never observed. After constant folding collapses a branch condition to false, DCE can delete the unreachable branch. After propagation determines a variable is never read, DCE removes the write.

These two passes feed each other: constant folding creates unreachable branches; DCE removes them, exposing more constants for folding. Modern compilers run them in a loop until no further progress is made.

A subtle and important consequence: DCE can remove security checks. If the compiler determines that a condition is always true or always false (perhaps via undefined behavior reasoning - more on this below), it will eliminate the check. This is how null pointer checks, integer overflow guards, and timing-delay loops have been silently deleted by optimizers in security-sensitive code.

Function Inlining: The Most Important Optimization

Function inlining replaces a call site with a copy of the callee’s body. The direct benefits: eliminating call overhead (saving/restoring registers, stack frame setup, the call and ret instructions). The indirect benefit - far larger - is that after inlining, all other optimizations can operate across the call boundary. Constants can be propagated into the callee. Dead code in the callee that depends on caller state can be eliminated. Loops in the callee can be fused with loops in the caller.

Inlining is the single most impactful optimization the compiler performs. It is also the one most likely to go wrong.

The compiler uses a budget - a heuristic limit on how much a function can grow via inlining. Functions exceeding the budget are not inlined. The budget exists for a good reason: aggressive inlining increases the size of .text, which increases instruction cache (icache) pressure. A CPU that must fetch instructions from many different places in the code exhausts its icache faster; icache misses are expensive (50-200 cycles each). There is a real regime where more inlining makes code slower, not faster, because the icache advantage of compact code outweighs the call overhead savings.

You can override the heuristic with __attribute__((always_inline)) (force inlining) or __attribute__((noinline)) (prevent it). Judicious use of noinline on cold paths (error handlers, rare branches) keeps hot code compact.

Loop Optimizations

Loops are where programs spend most of their time, so the optimizer devotes disproportionate attention to them.

Loop-Invariant Code Motion (LICM) identifies computations inside a loop that produce the same result on every iteration and hoists them before the loop. If a loop computes strlen(s) every iteration but s never changes, LICM moves the call outside. This requires the compiler to prove s is not modified inside the loop - alias analysis matters here.

Loop unrolling replicates the loop body multiple times per iteration, reducing branch overhead and exposing more instruction-level parallelism. Unrolling by 4 means the branch is taken every 4 iterations instead of every 1. The compiler must emit a scalar cleanup loop for the remaining iterations when the trip count is not a multiple of the unroll factor. Over-unrolling increases code size and can hurt icache.

Loop fusion merges two adjacent loops that iterate over the same range into one, improving data locality (the data read by the first loop is still warm in cache when the second loop needs it). Loop fission does the opposite - splitting a loop to improve vectorizability or reduce register pressure. Both are applied where they improve memory access patterns.

Loop interchange swaps the nesting order of nested loops to achieve column-major vs row-major access. For a C 2D array (row-major), the innermost loop should walk columns; swapping row/column iteration order turns each cache line access from strided (cache-unfriendly) to sequential.

Autovectorization

This is what happened to your multiply-by-two loop. Autovectorization transforms a scalar loop into one that processes multiple data elements per instruction using SIMD (Single Instruction, Multiple Data) registers and operations.

AVX-512 on modern x86 provides 512-bit vector registers - sixteen 32-bit floats per register. The vectorizer transforms:

for (int i = 0; i < n; i++)
    out[i] = in[i] * 2.0f;

into something like: load 16 floats at once into a ZMM register, multiply all 16 by 2.0 with a single vmulps instruction, store 16 results. Then repeat for the next 16. The loop trip count drops by a factor of 16; throughput rises by the same factor.

For the compiler to vectorize, it needs to prove: (1) loop iterations are independent - one iteration does not depend on a previous iteration’s result; (2) memory accesses do not alias - out and in do not overlap; (3) the loop body is vectorizable - not all operations have SIMD equivalents.

Condition 2 is where C’s pointer aliasing rules cause problems.

Alias Analysis and the `restrict` Keyword

Two pointers alias if they may refer to overlapping memory. Without proof that pointers do not alias, the compiler cannot reorder loads and stores between them. This blocks vectorization, LICM, and many other optimizations.

void scale(float *out, const float *in, float factor, int n) {
    for (int i = 0; i < n; i++)
        out[i] = in[i] * factor;
}

The compiler sees: out and in are both float *. They might point to overlapping memory. If out[0] is the same memory as in[1], then storing out[0] changes in[1], and writing iterations out of order produces different results. The compiler must assume this could happen. Vectorization blocked.

The restrict keyword tells the compiler: “I promise these pointers do not alias. Any aliasing is my bug, not yours.”

void scale(float *restrict out, const float *restrict in, float factor, int n) {
    for (int i = 0; i < n; i++)
        out[i] = in[i] * factor;
}

Now the compiler can vectorize. On a large array, this is a 4-16x throughput improvement. The restrict keyword is one of the highest-leverage annotations in performance-critical C.

C++ lacks restrict in the standard, but GCC and Clang support __restrict__ as an extension.

Optimization Levels: -O2 vs -O3

-O0: no optimization. Every statement compiles independently. Variables are always read from and written to the stack. This makes debugging exact - the debugger can see every variable at every line - at the cost of 5-10x runtime performance.

-O1: basic optimizations. DCE, simple inlining, LICM. Compile time increases modestly.

-O2: the standard production level. Enables the full optimization pipeline: inlining with budget, aggressive DCE, CSE, LICM, loop unrolling, register allocation. Respects standard floating-point semantics.

-O3: enables optimizations that -O2 conservatively avoids - most notably autovectorization and more aggressive inlining. But -O3 also enables unsafe floating-point transformations such as reassociation ((a + b) + c rewritten as a + (b + c)), reciprocal approximations (replacing x / y with x * (1/y)), and fused multiply-add synthesis. These transformations change floating-point results, because floating-point arithmetic is not associative. For financial calculations or numerical stability-sensitive code, -O3 can silently introduce errors.

The specific flag that controls this is -ffast-math (implied by -O3). You can use -O3 -fno-fast-math to get -O3’s other benefits without floating-point unsafety.

-Os: optimize for binary size, disabling transformations that grow the binary (like unrolling). Used for embedded targets where flash storage is the constraint.

Undefined Behavior as an Optimization Lever

C and C++ specify that certain programs have undefined behavior (UB): signed integer overflow, null pointer dereference, use of uninitialized memory, out-of-bounds array access. The compiler is permitted to assume these situations never occur. If they do occur, the program’s behavior is literally undefined - the compiler is allowed to generate any code it likes.

This sounds like a footnote. It is not. Compilers use UB as a license to eliminate code.

The classic example: signed integer overflow is UB. Therefore x + 1 > x is always true for signed x (if it were false, overflow would have occurred, which is UB, which can be assumed not to happen). The compiler may optimize away any check of the form if (x + 1 < x) - including integer overflow security checks in C code. They vanish, silently, producing binaries that accept malformed inputs.

A second example: if the compiler sees *ptr used unconditionally, it can conclude ptr != NULL is always true and eliminate any subsequent null check - including a security check added by a security-aware developer who did not realize the UB.

UB-based misoptimizations are a real source of security vulnerabilities. The mitigations are: use -fsanitize=undefined during testing to catch UB, use Rust or other languages with defined overflow semantics in security-critical paths, and - when using C - compile security-sensitive functions with -fwrapv to give signed overflow defined wraparound semantics.

Profile-Guided Optimization

The compiler’s inlining budget, branch prediction hints, and code layout decisions are based on static heuristics: guesses about what will be hot. Profile-Guided Optimization (PGO) replaces guesses with measurements.

The workflow: compile with -fprofile-generate, run the binary on representative workloads (it records branch frequencies and call counts into a .profdata file), then recompile with -fprofile-use. The compiler now knows which functions are hot (inline aggressively), which branches are taken 99% of the time (lay out the hot path without branch instructions), and which code paths are dead (shrink or eliminate).

PGO consistently delivers 10-20% speedups on real workloads. The caveat: the training workload must be representative. PGO-optimized code tuned on one workload can regress on a different workload. This is the core limitation: PGO is a snapshot of one workload’s behavior, not a guarantee about all workloads.

Link-Time Optimization

Each .c file is compiled to a .o file independently. The compiler cannot inline across translation unit boundaries - at compile time, it sees only one file. A call to a helper function in another .o file always crosses an opaque call boundary: arguments are shuffled into calling-convention registers, a call is emitted, no further optimization is possible.

Link-Time Optimization (LTO) defers final code generation until link time. Instead of machine code, the compiler emits IR into the .o file. The linker combines all the IR, runs the full optimization pipeline across the entire program, and then generates machine code. Cross-module inlining, interprocedural dead code elimination, and global constant propagation all become possible.

LTO is enabled with -flto. Typical speedup: 5-15% on programs with many small functions spread across files. The cost: link time increases substantially - the linker now does what was previously spread across all compilation steps. ThinLTO (-flto=thin) parallelizes this with a summary-based approach, preserving most of the speedup with acceptable link times.

The Compiler in the ML Stack

Modern machine learning frameworks have their own compiler pipelines that apply similar passes.

torch.compile (PyTorch 2.0+) uses the Inductor backend, which traces PyTorch operations and compiles them to Triton kernels - a Python-embedded DSL for writing GPU kernels. Inductor performs operator fusion (analogous to loop fusion: merging multiple elementwise operations into one kernel to reduce memory bandwidth), layout optimization (choosing the right tensor memory layout for each operation), and tiling (partitioning work across GPU thread blocks for cache locality). The result: kernels that outperform hand-written PyTorch in many cases.

XLA (Accelerated Linear Algebra), used by JAX and TensorFlow, operates on HLO (High Level Operations) IR. XLA’s optimizer performs constant folding, dead code elimination, algebraic simplification (replacing x * 1.0 with x), and operation fusion - the GPU-compute equivalent of loop fusion, merging adjacent operations into single kernels to eliminate intermediate memory writes.

The conceptual parallels are exact: XLA’s HLO optimizer and LLVM’s optimizer solve the same fundamental problem with the same fundamental techniques, applied to different computation substrates.

The Optimization-Debugging Tension

Optimized code is hard to debug. Variables optimized into registers do not appear in gdb. Code reordered for performance does not match source line numbers in stack traces. Inlined functions do not appear as separate frames. Functions the compiler proved unreachable are gone - including, sometimes, the one you wanted to step into.

This is not a bug. It is the point. An optimized binary is a different program from the source code, related by equivalence under specified assumptions. When those assumptions are violated (UB, aliasing), or when you need to inspect intermediate state the compiler proved was unnecessary, you need -O0.

The practical resolution: develop and test with optimizations on (or at least -O2), debug with -O0 -g. Use sanitizers (-fsanitize=address,undefined) to catch bugs without hiding them. Never use an unoptimized binary for performance benchmarks.

The deeper lesson: the compiler is smarter than you about most optimizations. Write clear, idiomatic code. Profile first. Optimize only what the profiler identifies as hot. When you do need to help the compiler - add restrict, add __builtin_expect, reach for intrinsics - do so with a benchmark in hand that confirms the change made things better.

Summary

Optimization	What It Does	When It Matters
Constant folding/propagation	Evaluates known-at-compile-time expressions	Compile-time constants, template instantiation
Dead code elimination	Removes unused computations and unreachable branches	After inlining, UB reasoning
Function inlining	Copies callee body to call site	Small, hot functions; enables other optimizations
LICM	Hoists loop-invariant computation before loop	Loops with expensive invariant operations
Loop unrolling	Replicates loop body to reduce branch overhead	Tight loops with small bodies
Autovectorization	Converts scalar loops to SIMD	Embarrassingly parallel data processing
Alias analysis / `restrict`	Proves pointers do not overlap	Enabling vectorization in C/C++
PGO	Uses runtime profiles to guide heuristics	10-20% on real workloads with representative training
LTO	Enables inter-module optimization at link time	Programs with many small functions across files

Read Next:

Linkers & Loaders - The Last Step Before Your Code Runs