Compiler Optimizations
Prerequisite:
A compiler is not a transcriptionist. Between your source code and the binary that runs, the compiler runs dozens of analysis and transformation passes, each probing for opportunities to produce faster, smaller, or more correct code. Understanding what these passes do - and what limits them - tells you how to write code the compiler can actually optimise, and when you need to help it along.
The Compiler Pipeline
The compiler’s work divides into three stages.
Frontend parses source code into an Abstract Syntax Tree (AST), then lowers the AST into an intermediate representation (IR). The frontend is language-specific: Clang’s frontend understands C/C++, rustc’s understands Rust.
Middle-end (the optimiser) operates entirely on the IR and is language-agnostic. LLVM’s middle-end applies a configurable sequence of passes - each transformation is a pure function on the IR - then hands off the optimised IR to the backend.
Backend translates IR into machine code for a specific target: x86-64, ARM64, RISC-V. It handles instruction selection, scheduling, and register allocation.
LLVM IR and SSA Form
LLVM’s IR is in Static Single Assignment (SSA) form: every variable is assigned exactly once, and every use of a variable refers to exactly one definition. When control flow merges (at the join point of an if-else), SSA uses a special phi instruction to select between the incoming values.
SSA makes many analyses trivially easy. To find all uses of a value, just search for uses of that SSA name. To check if a variable is constant, look at its one definition. This is why virtually every modern optimising compiler uses SSA internally.
Key Optimisation Passes
Constant folding evaluates expressions whose operands are known at compile time. int x = 3 * 7 + 2; becomes int x = 23; - no runtime arithmetic at all. The compiler extends this transitively: if x = 23 and y = x + 1, then y = 24.
Dead code elimination (DCE) removes instructions whose results are never used. After constant folding collapses a branch condition to false, DCE can delete the unreachable branch entirely. These two passes are typically run together in a loop until no further progress is made.
Function inlining replaces a call site with a copy of the callee’s body. This eliminates call overhead (saving/restoring registers, the call/ret instructions, stack frame setup), but more importantly it enables all other optimisations to operate across the call boundary - the inlined body can have its constants folded, its dead branches eliminated, its values propagated. Inlining is the most powerful single optimisation the compiler performs, which is also why __attribute__((always_inline)) and __attribute__((noinline)) exist for when you need to override the heuristic.
Loop-Invariant Code Motion (LICM) identifies computations inside a loop that produce the same result on every iteration, and hoists them outside the loop. If a loop computes strlen(s) on every iteration but s does not change, LICM moves the strlen call before the loop and stores the result in a temporary.
Common Subexpression Elimination (CSE) recognises when the same computation appears multiple times and replaces the duplicates with a single shared result. a[i*stride + j] and a[i*stride + k] share the subexpression i*stride; CSE computes it once.
Loop unrolling replicates the loop body multiple times per iteration, reducing branch overhead and exposing more instruction-level parallelism to the CPU’s out-of-order engine. Unrolling by 4 means the loop counter is checked every 4 iterations instead of every 1. The compiler must emit scalar cleanup code for iterations that do not divide evenly.
Register Allocation
Registers are the fastest storage on a machine, but there are only sixteen general-purpose registers on x86-64. Register allocation decides which values live in registers and which must be spilled to the stack. The problem is equivalent to graph colouring: construct a graph where each live value is a node and edges connect values that are live simultaneously; the number of colours needed equals the number of registers required. When the graph requires more colours than registers, values are spilled.
Spills hurt performance because they turn register-speed access into cache-speed access. Writing code with fewer live variables - fewer temporaries, shorter variable lifetimes - reduces register pressure and produces better allocation.
Alias Analysis
Two pointers alias if they may refer to overlapping memory. Alias analysis determines whether two memory accesses can interfere. When the compiler cannot prove that pointers do not alias, it must assume they might, which blocks many reorderings.
void scale(float *out, const float *in, float factor, int n) {
for (int i = 0; i < n; i++)
out[i] = in[i] * factor;
}
Without a restrict qualifier, the compiler must consider that out and in might overlap. With restrict on both parameters, it knows they do not, enabling vectorization and other reorderings. Alias analysis is one of the biggest limiters of automatic optimisation in C and C++.
Link-Time Optimisation (LTO)
Normally, each .c file is compiled to a .o file independently. The compiler cannot inline across translation unit boundaries because at compile time it sees only one file at a time.
LTO defers final code generation until link time. Instead of machine code, the compiler emits IR into the .o file. The linker then combines all the IR, runs the full optimisation pipeline across the entire program, and generates machine code. Cross-module inlining, inter-procedural dead code elimination, and global constant propagation all become possible. LTO is enabled with -flto and can produce meaningful speedups (5–15% is common) for programs with many small functions spread across files.
Profile-Guided Optimisation (PGO)
The compiler makes decisions - inline this function or not? lay out this branch for the likely or unlikely case? - based on static heuristics. PGO replaces guesses with measurements.
The workflow: compile with -fprofile-generate, run the binary on representative inputs (it records branch frequencies and call counts into a .profdata file), then recompile with -fprofile-use. The compiler now knows which functions are hot, which branches are taken 99% of the time, and which code paths are dead. It can inline aggressively where it matters, arrange code so hot paths have no branches, and shrink cold paths. PGO typically yields 10–20% speedups on compute-heavy workloads.
Optimisation Levels
-O0: no optimisation; each statement compiles independently, making debugging exact. -O1: basic optimisations that do not significantly increase compile time (DCE, simple inlining). -O2: the standard production level; enables the full optimisation pipeline including inlining, LICM, CSE, loop unrolling. -O3: enables more aggressive transformations including auto-vectorization and more aggressive inlining. -Os: optimise for size, disabling transformations that grow the binary.
Undefined Behaviour as Optimisation Opportunity
C and C++ specify that certain programs have undefined behaviour (UB): signed integer overflow, null pointer dereference, use of uninitialised memory, out-of-bounds array access. The compiler is permitted to assume these situations never occur. This lets it eliminate range checks, remove impossible branches, and fold conditions - but it also means that programs containing UB can be “optimised” into something completely unexpected.
The classic example: if (ptr != NULL) { *ptr = 1; } - if the compiler later proves ptr is dereferenced unconditionally, it can conclude ptr != NULL is always true and eliminate the check entirely. A security check vanished because of UB in a call it did not even see. UB-based misoptimisations are a real source of security vulnerabilities.
Examples
Godbolt.org for live exploration. Paste any C function into godbolt.org, select GCC or Clang, set the optimisation flag, and read the assembly output. Toggle between -O0 and -O2 on a loop with a constant bound and watch it collapse to a single mov instruction.
LICM in action. Write a loop that calls strlen on a string that does not change. At -O2, check objdump - the call will appear before the loop, not inside it. At -O0, it will be inside.
Why UB is dangerous. Compile a signed integer overflow check like if (x + 1 < x) abort(); with -O2. The compiler will remove the entire check because signed overflow is UB, so the condition is provably false. The bug now silently passes.
A compiler optimiser is a collection of formal proofs about code equivalence. When you understand which proofs the compiler can and cannot construct - and why - you can write code that is simultaneously readable and fast, and you know exactly when to reach for restrict, __builtin_expect, or LTO.
Read Next: