How Compilers Work - From Source Code to Something a CPU Understands
Helpful context:
- Context-Free Grammars - Languages With Recursive Rules
- Automata Theory - What Machines Can and Cannot Compute
You write x = a + b * 2. The compiler turns this into machine code that:
- Evaluates
b * 2beforea + b * 2because multiplication has higher precedence - Computes
b * 2using addition rather than multiplication, because addition is cheaper on most hardware - Eliminates the variable
xentirely if it’s never used - Inlines the function call if you wrapped this in a tiny function
How does a text transformation tool - which is all a compiler is - produce code that is faster than most programmers could hand-write? The answer is a pipeline of increasingly abstract representations, each designed to make a different class of optimization obvious. Understanding that pipeline is what separates engineers who treat compilation as a black box from engineers who can predict what the compiler will do and write code that cooperates with it.
The Historical Context: Why Compilers Exist
Before compilers, programs were written in machine code or assembly directly. Adding two numbers meant knowing the binary encoding of the add instruction, knowing the register numbers, encoding the operands. Grace Hopper, working on the UNIVAC in the early 1950s, observed that programmers kept writing the same routines - square root, logarithm, output formatting. She built A-0 (1952), a system that could retrieve these routines from a catalog and combine them. This was not quite a compiler in the modern sense - it was more a linker, a tool that takes separately compiled pieces of code and combines them into a single executable - but it established the idea: code that writes code.
The first true compiler, in the sense of translating a high-level language to machine code, is usually credited to John Backus’s FORTRAN compiler at IBM (1957). The radical claim was that a compiler could produce code as fast as an expert human programmer. Most computer scientists at the time believed this was impossible. The FORTRAN team proved them wrong - and established compiler optimization as a research field.
The next revolution was LLVM (2003), developed by Chris Lattner as a University of Illinois PhD project. Before LLVM, building a new language required implementing a complete compiler backend - the component that translates an intermediate representation into machine instructions for a specific CPU. This meant writing instruction selection, scheduling, and register allocation (covered in Phase 6 below) from scratch for every new language and every target architecture. LLVM provided a reusable, language-agnostic backend. You only need to compile your language to LLVM’s intermediate representation (IR) - a structured, language-independent form that LLVM can optimize and translate to machine code - and LLVM handles the rest. Clang, Rust, Swift, Julia, Kotlin Native, Emscripten, and dozens of other compilers all use LLVM. This is one person’s PhD thesis reshaping the compiler landscape for twenty years.
Phase 1: Lexical Analysis
The compiler’s first job is to read the character stream and break it into tokens - the meaningful units of the language. The lexer doesn’t care about structure; it only cares about boundaries.
int x = 3 + y * 2;
→ [KW_INT] [IDENT "x"] [EQ] [NUM 3] [PLUS] [IDENT "y"] [STAR] [NUM 2] [SEMI]
The lexer discards whitespace and comments. It classifies character sequences: a sequence of digits is NUM, a sequence of letters and digits starting with a letter is IDENT, single characters like + and = are themselves tokens.
The formal machinery underlying lexers is the deterministic finite automaton (DFA) - a state machine that reads input characters one at a time and transitions between states, accepting when it reaches a final state. Each state represents what the machine has recognized so far; reading a digit in the “reading a number” state keeps it in that state, while reading a space terminates the number token. The lexer’s job is exactly what a DFA computes: recognize strings in a regular language. Token types are specified as regular expressions; the lexer generator (lex, flex, re2c) converts these to a combined DFA that recognizes all token types simultaneously and tells you which one matched.
In practice, hand-written lexers are common for production compilers (GCC’s lexer is hand-written; so is Clang’s) because they handle edge cases more explicitly and are easier to tune for performance. The conceptual model is the same either way.
Phase 2: Parsing
The parser takes the token stream and determines whether it conforms to the language’s grammar, simultaneously building an Abstract Syntax Tree (AST) - a tree that represents the hierarchical structure of the program.
Context-free grammars describe what valid programs look like. A grammar for arithmetic expressions:
Expr → Expr + Term | Expr - Term | Term
Term → Term * Factor | Term / Factor | Factor
Factor → ( Expr ) | NUM | IDENT
Each line is a production - a grammar rule defining how a non-terminal symbol (like Expr) can be rewritten as a sequence of symbols. This grammar encodes operator precedence structurally: multiplication (Term * Factor) nests inside addition (Expr + Term), so multiplication will always sit deeper in the parse tree, evaluated first.
The AST for 3 + y * 2:
ADD
/ \
3 MUL
/ \
y 2
MUL is deeper, so it is evaluated first. The tree encodes precedence without any special runtime checks.
The Discomfort: Why Parsing Is Hard
Parsing is where most compiler courses lose students. The formalism - LL parsers, LR parsers, shift-reduce conflicts, FIRST and FOLLOW sets (the sets of tokens that can legally appear at the start of or after a given non-terminal, used to guide parsing decisions) - is dense and the connection to the intuition is non-obvious.
Here is the intuition, without the full formalism.
A recursive descent parser is a direct translation of the grammar into mutually recursive functions. Each non-terminal in the grammar becomes a function. The parseExpr function handles additions and subtractions. The parseTerm function handles multiplications and divisions. The parseFactor function handles atoms (numbers, identifiers, parenthesized expressions).
def parse_expr(tokens):
left = parse_term(tokens)
while tokens.peek() in ('+', '-'):
op = tokens.consume()
right = parse_term(tokens)
left = BinaryOp(op, left, right)
return left
def parse_term(tokens):
left = parse_factor(tokens)
while tokens.peek() in ('*', '/'):
op = tokens.consume()
right = parse_factor(tokens)
left = BinaryOp(op, left, right)
return left
def parse_factor(tokens):
if tokens.peek() == '(':
tokens.consume('(')
expr = parse_expr(tokens)
tokens.consume(')')
return expr
elif tokens.peek().type == 'NUM':
return Number(tokens.consume())
else:
return Identifier(tokens.consume())
This is LL(1) parsing - top-down (it starts from the top-level rule and works down to tokens), left-to-right (it reads tokens left to right), using one token of lookahead (it only needs to look one token ahead to decide which production to apply). GCC, Clang, and the Rust compiler all use hand-written recursive descent parsers.
Left-recursive productions - like Expr → Expr + Term, where Expr appears on both sides - cause infinite recursion in recursive descent. They are converted to right recursion or to an iterative while loop, as in the parse_expr example above.
LR parsing (bottom-up, right-to-left) handles a strictly larger class of grammars and is more powerful, but it is essentially impossible to write by hand - you generate it using tools like yacc, bison, or LALR parser generators. The generated parser uses a stack-based state machine. It is efficient but opaque. Error messages from LR-generated parsers are notoriously bad (“unexpected token at line 42”), which is why most modern production compilers prefer hand-written recursive descent with custom error recovery.
Phase 3: Semantic Analysis
Having established that the token stream is syntactically valid, the compiler determines whether it is semantically meaningful. Syntax answers “is this grammatical?” Semantics answers “does this make sense?”
The symbol table maps every identifier to its declaration: type, scope, storage class. When the parser sees x = 3 + y * 2, semantic analysis looks up x and y in the symbol table to find their types, checks that the types are compatible with + and *, and checks that x has been declared (not just used).
Nested scopes work as a chain of symbol tables. A function body’s symbol table has a parent pointer to the file scope’s symbol table. Looking up a name walks the chain outward until a declaration is found - this is the implementation of variable shadowing.
Type checking annotates every AST node with its type. For 3 + y * 2:
3has typeintyhas type (from the symbol table) - sayint2has typeinty * 2has typeint(int × int → int)3 + (y * 2)has typeint
Type errors are detected here. Assigning a string to an int variable, passing 3 arguments to a function that expects 2, using an undeclared identifier - all caught in semantic analysis, before any code generation.
Type inference (Haskell, Rust, OCaml, modern C++ auto) extends this: the compiler solves a system of type equations generated from the AST, inferring types that were not written. The Hindley-Milner algorithm - which works by assigning a fresh type variable to each expression, then solving the resulting constraints through unification (matching types together and propagating what is known) - solves this for a large class of languages in essentially linear time. When Rust infers that let x = vec.iter().map(|y| y * 2).collect() produces a Vec<i32>, it is running Hindley-Milner on the expression tree.
The output of semantic analysis is an annotated AST: every node tagged with its type, every identifier resolved to its declaration, every scope boundary marked.
Phase 4: Intermediate Representation (IR)
Compilers don’t translate AST directly to machine code. An intermediate representation is a language-independent, target-independent form that is easier to analyze and transform than either source or machine code.
Before discussing LLVM IR, it helps to understand what a CPU register is. A register is a tiny, extremely fast storage location built directly into the CPU - faster than any memory, because it is part of the processor itself. Modern CPUs have a small fixed number of them (typically 16-32 general-purpose registers on x86-64). The register file is the complete set of physical registers the CPU makes available. Every arithmetic instruction operates on registers: add rax, rbx adds the value in register rbx to register rax.
The most influential IR design is LLVM IR - a textual format that resembles assembly but operates on an infinite set of typed virtual registers rather than the CPU’s finite register file. These virtual registers are not real hardware registers; they are placeholders that the compiler later maps to real ones.
LLVM IR for a simple function:
define i32 @add_mul(i32 %a, i32 %b) {
entry:
%mul = mul i32 %b, 2 ; b * 2
%add = add i32 %a, %mul ; a + (b * 2)
ret i32 %add
}
The key features:
- Typed: every value has an explicit type (
i32,float,ptr). This enables type-based aliasing analysis and catches bugs in lowering. - Infinite virtual registers: no spilling or register allocation yet. Every sub-expression gets its own register (
%mul,%add). - SSA form: each register is assigned exactly once. This is the key that makes optimization tractable.
SSA (Static Single Assignment) form means every variable is defined in exactly one place. When control flow merges (after an if/else), the variable has multiple possible values. SSA handles this with a phi instruction - named after the mathematical notation φ used in early SSA formalisms - that selects between values based on which branch was taken:
define i32 @abs(i32 %x) {
entry:
%cmp = icmp slt i32 %x, 0 ; x < 0?
br i1 %cmp, label %neg, label %pos
neg:
%negx = sub i32 0, %x ; -x
br label %merge
pos:
br label %merge
merge:
%result = phi i32 [ %negx, %neg ], [ %x, %pos ] ; pick based on which branch
ret i32 %result
}
SSA form makes many optimizations trivially easy to implement. Def-use chains - links from each definition of a value to every place that uses it - are explicit in SSA: every use of a register traces directly back to its unique definition. Dead code elimination: a register with no uses can be deleted. Constant folding: if a register is defined as a constant, substitute the constant at every use. These require complex analysis on non-SSA code; on SSA, they reduce to simple traversals.
Phase 5: Optimization Passes
The middle-end applies a sequence of optimization passes to the IR. Each pass is a transformation: it reads the IR, applies an analysis, and writes modified IR. Because all passes operate on the same IR representation, they compose freely.
Constant folding: evaluate expressions whose operands are known at compile time. 2 * 3 * 4 → 24. This sounds trivial, but combined with other optimizations, it cascades: if a branch condition becomes a constant, the dead branch can be eliminated.
Dead code elimination (DCE): remove instructions whose results are never used. DCE is straightforward in SSA: any register with no uses is dead and can be removed. Removing it may make other registers dead, so DCE iterates until no more dead code exists.
Inlining: replace a function call with the body of the function, substituting arguments. Inlining eliminates call overhead (saving/restoring registers, branch prediction penalty) and enables further optimizations - the inlined function’s code can now be folded with the caller’s. The trade-off is code size: inlining every function call would produce enormous binaries and thrash the instruction cache. Compilers use heuristics (function size, call frequency, hot path indicators) to decide when to inline.
Loop-invariant code motion (LICM): move computations out of loops when the result doesn’t change across iterations.
// Before LICM:
for (int i = 0; i < n; i++)
a[i] = b[i] * (x + y); // x + y computed n times
// After LICM:
float xy = x + y;
for (int i = 0; i < n; i++)
a[i] = b[i] * xy; // x + y computed once
Strength reduction: replace expensive operations with cheaper equivalents. Multiplying by a constant power of two becomes a bit shift (×8 → left shift by 3). An induction variable - a variable whose value changes by a fixed amount each loop iteration, like i in a for loop or a derived quantity like i * stride - can have its multiply replaced by a running addition. Each iteration simply adds the stride to the previous value rather than recomputing i * stride from scratch.
Common subexpression elimination (CSE): compute a shared sub-expression once. A basic block is a straight-line sequence of instructions with no branches in or out - execution always enters at the top and exits at the bottom. Within a basic block, if a * b + c appears twice, the compiler computes a * b once, saves the result in a register, and reuses it.
Alias analysis: determine which memory locations might refer to the same address. Without alias information, the compiler must conservatively assume any load might return a value just written by any store. Alias analysis allows the compiler to prove “these two pointers can’t alias” and eliminate spurious dependencies, enabling more aggressive reordering and elimination.
Modern compilers (LLVM, GCC) run dozens of passes, some multiple times, in a carefully tuned order. The pass order matters: inlining before CSE allows CSE to eliminate subexpressions revealed by inlining; running DCE after constant folding removes branches that folding made constant. Getting the pass order right is one of the deep expertise areas of compiler engineering.
Phase 6: Code Generation
The backend translates optimized IR into machine instructions for a specific target. Three sub-problems:
Instruction selection: which machine instruction implements each IR operation? For add i32, use addl. For mul i32 with a constant power of two, use shll (shift left). For multiplication by a small constant like 3, compilers often use a load effective address (leal) instruction - originally designed to compute memory addresses (base + index × scale + offset) but repurposed for cheap arithmetic, since it can compute expressions like x + x*2 in a single cycle without a true multiply. These decisions are made by pattern matching: a pattern matcher maps IR subtrees to instruction sequences. In LLVM, instruction patterns are specified in TableGen files - a domain-specific language used within LLVM to describe target architectures and their instruction encodings. In GCC, they are specified using RTL (Register Transfer Language) patterns - a representation that describes computations as transfers of values between abstract registers, making instruction-level pattern matching systematic.
Instruction scheduling: reorder instructions to maximize pipeline utilization. A pipeline is the hardware mechanism by which a CPU overlaps the execution of multiple instructions - while one instruction is in its execution stage, the next is being decoded, and the one after is being fetched from memory. Modern CPUs are superscalar (they can execute multiple instructions per cycle simultaneously, by having multiple independent execution units) and out-of-order (they can execute instructions in a different order than they appear in code, as long as data dependencies are respected). A hazard occurs when an instruction needs the result of a previous instruction that has not finished yet - if instruction A produces a value and instruction B uses it, and A has 4 cycles of latency, placing B immediately after A stalls the pipeline for 4 cycles. The compiler pre-schedules instructions to fill those gaps with independent work.
Register allocation: assign virtual IR registers to physical CPU registers. When there are more live values than registers, some must be spilled - stored to the stack and reloaded when needed. A value is live at a point in the program if it will be used again before being overwritten. Register allocation is NP-hard in general - meaning no known polynomial-time algorithm exists and the problem grows exponentially in the worst case, making exact optimal solutions impractical for large programs. The standard formulation is graph coloring: build an interference graph where each live range is a node and two nodes share an edge if their live ranges overlap (meaning they cannot occupy the same register simultaneously); then assign colors (physical registers) such that no two adjacent nodes share a color, using as few colors as possible.
In practice, two approaches dominate. Linear scan allocation, used by many JIT compilers for speed, treats live ranges as intervals on a number line, scans left to right, assigns registers greedily, and spills the interval with the longest remaining extent when registers run out. Graph-coloring allocation, used by GCC and LLVM for code quality, operates directly on the interference graph and produces better register assignments at the cost of more compile time.
The LLVM IR: Why It Revolutionized the Ecosystem
Before LLVM, building a new language required implementing a complete compiler backend: instruction selection, scheduling, and register allocation for each target CPU. This was a multi-year project, which is why most research languages (ML, Haskell, OCaml) had backends for only one or two architectures.
LLVM changed this by providing a high-quality reusable backend with a clean IR interface. Compiling to LLVM IR requires weeks of work, not years. The resulting compiler targets every architecture LLVM supports (x86-64, ARM, RISC-V, PowerPC, WebAssembly, and more), with competitive code quality.
The languages that owe their practical viability to LLVM:
- Clang (C/C++/Objective-C): the compiler Apple ships with Xcode
- Rust: Rust’s compiler emits LLVM IR; without LLVM, Rust’s backend would have required years more work
- Swift: uses LLVM; without it, Apple would have needed to build a backend from scratch
- Julia: LLVM enables just-in-time compilation of Julia to native code for any LLVM target
- Kotlin Native: compiles Kotlin to native binaries via LLVM
- Emscripten: compiles C/C++ to WebAssembly by compiling to LLVM IR, then to Wasm
LLVM’s success has created its own form of lock-in: there is now strong pressure for new languages to target LLVM, because doing so gives you a production-quality backend “for free.” Languages that don’t use LLVM (V8 for JavaScript, CPython for Python, OpenJDK’s JIT for Java) have invested enormous resources into their own backends - resources only feasible for languages with massive ecosystems and corporate backing.
JIT vs AOT: Two Compilation Philosophies
Ahead-of-time (AOT) compilation is the traditional model: the full pipeline runs before the program is distributed. The user receives a binary that starts immediately, with no compilation overhead at runtime. GCC, Clang, and the Rust compiler all do AOT.
Just-in-time (JIT) compilation compiles code at runtime. The key insight is that real programs spend most of their time in a small fraction of the code - the hot paths, the code paths that execute frequently (the roughly 20% of code that runs 80% of the time). JIT compilers identify these hot paths by profiling at runtime and compile only those, while interpreting the rest. The trade-offs differ from AOT:
Advantages of JIT:
- Can optimize using runtime information: concrete types, actual branch frequencies, observed value distributions
- Can deoptimize - discard compiled machine code and fall back to slower interpreted execution - and recompile if a previous assumption (like “this value is always an integer”) turns out to be false at runtime
- Works for languages where types and code are not fully known at compile time, including those using dynamic dispatch (the mechanism that selects which method to call at runtime based on the actual type of an object, rather than its declared type) or monkey-patching (modifying or replacing functions and classes at runtime - replacing
json.loadswith a custom version, for example)
Disadvantages of JIT:
- Startup latency (cold code paths are slow until warmed up)
- Memory overhead (the program must store both the intermediate representation and the compiled machine code simultaneously)
- Optimization is time-limited (you can’t spend seconds optimizing a single function at runtime)
CPython compiles Python to bytecode (.pyc files) - a compact instruction set for the CPython virtual machine, sitting between source code and machine code. Bytecode instructions are higher-level than CPU instructions (one bytecode instruction might do what several machine instructions do) and are platform-independent. CPython’s VM is an interpreter: it reads each bytecode instruction and executes it in a C switch statement. This is simple and portable but slow.
PyPy takes a different approach: it is a JIT compiler for Python implemented in RPython (a subset of Python). PyPy identifies hot loops at runtime, compiles them to machine code, and applies speculative optimizations based on observed types. Typical speedup over CPython: 5-10x for CPU-bound code. The limitation: PyPy lags CPython in compatibility and is less effective for code that spends most of its time in C extension modules (NumPy, PyTorch) - those are already compiled and JIT doesn’t help them.
Java’s HotSpot JVM is the most mature JIT compiler. It starts interpreting bytecode, identifies hot methods (those called more than a threshold number of times), compiles them with the client JIT (fast compile, moderate optimization), and recompiles the hottest methods with the server JIT (slow compile, aggressive optimization). The result: Java code that runs for minutes often achieves performance close to C++ equivalents.
JAX’s Tracing: Compilation as a Library Call
JAX’s approach to compilation is conceptually interesting and worth understanding separately from traditional compilers.
When you call jax.jit(f)(x), JAX traces f - it runs the function with abstract “tracer” values that record the operations performed without carrying concrete numbers. The trace produces an XLA HLO computation graph. XLA (Accelerated Linear Algebra) is a domain-specific compiler for array computations, used in JAX and TensorFlow. HLO (High-Level Optimizer) is XLA’s intermediate representation - a graph of high-level array operations like matrix multiply, convolution, and element-wise arithmetic. XLA compiles that HLO graph to machine code and caches it. Subsequent calls with the same input shapes skip Python entirely and execute the compiled binary.
This is not a compiler in the traditional sense - there is no lexer, no parser, no AST. The “source code” is a Python function; JAX captures the computation graph by executing it with abstract values.
The limitation: Python control flow that depends on concrete values is frozen at trace time. if x > 0 where x is a traced JAX value evaluates the condition with an abstract value, producing an arbitrary result. The branch taken at trace time is baked into the compiled binary.
torch.compile (PyTorch 2.0) solves this differently: it captures the Python bytecode executed during a real forward pass using a bytecode interpreter called Dynamo, converts the captured computation to a graph IR called an FX graph (a directed graph of PyTorch operations with explicit data flow between them), and compiles that IR using Inductor - PyTorch’s compiler backend, which lowers FX graphs to optimized C++/CUDA or Triton GPU kernels. The advantage: it handles arbitrary PyTorch code with fallback to eager execution when compilation fails. The disadvantage: less aggressive optimization than XLA because the compilation boundary is less strict.
The Critique: Compilers vs ML-Guided Compilation
Traditional compiler optimizations are rule-based: each pass applies a transformation whenever a specific pattern is recognized. The patterns are written by experts who understand hardware and have analyzed benchmarks. This works well for common patterns but is brittle for unusual code.
ML-guided compilation uses machine learning to make optimization decisions. MLGO (ML-Guided Optimization), deployed in Google’s production LLVM builds, uses a reinforcement learning policy to make inlining decisions - whether to inline a given function call at a given call site. The RL policy outperforms the heuristic-based inlining decisions by 3-5% on benchmark suites.
Autotuning (used by TensorFlow XLA, CUDA libraries, and Halide) searches over a space of implementation choices (loop tile sizes, operation orderings) by empirically benchmarking different configurations and selecting the best. This is slower than rule-based optimization (it requires actual execution) but can find configurations no human-written rule would discover.
The longer-term question is whether ML will eventually outperform rule-based compilers across the board. The evidence so far is mixed: ML-guided optimization outperforms heuristics for specific decisions (inlining, tile size selection) but doesn’t yet replace entire optimization passes. The main obstacle is the cost of training: generating high-quality training data for compiler optimization requires compiling large codebases many times with different decisions and measuring the resulting performance.
Summary
Two key data structures appear in the summary table that are worth naming explicitly. A Control Flow Graph (CFG) is a graph where each node is a basic block and edges represent possible jumps between them - an if/else creates two outgoing edges, one for each branch; it is the primary structure for reasoning about program flow. A call graph is a graph where nodes are functions and edges represent calls between them, used by inter-procedural optimizations like inlining to understand which functions call which others.
| Phase | Input | Output | Key Data Structures |
|---|---|---|---|
| Lexical analysis | Character stream | Token stream | DFA, token table |
| Parsing | Token stream | AST | Recursive descent functions, LR automaton |
| Semantic analysis | AST | Annotated AST | Symbol table, type environment |
| IR generation | Annotated AST | SSA IR (LLVM IR) | CFG, def-use chains |
| Optimization | SSA IR | Optimized SSA IR | Call graph, alias analysis, loop tree (a tree of nested loops, used to drive LICM and strength reduction) |
| Code generation | Optimized IR | Machine code | Instruction selection patterns, interference graph |
| Compiler | Language | Backend | AOT/JIT |
|---|---|---|---|
| GCC | C/C++/Fortran | GCC RTL | AOT |
| Clang | C/C++/ObjC/CUDA | LLVM | AOT |
| rustc | Rust | LLVM (via MIR) | AOT |
| Swift | Swift | LLVM (via SIL) | AOT |
| Julia | Julia | LLVM | JIT |
| CPython | Python | CPython VM | Interpreted |
| PyPy | Python | RPython/LLVM | JIT |
| JAX (XLA) | Python subset (traced) | XLA HLO | JIT |
| V8 (Turbofan) | JavaScript | V8 Turbofan | JIT |
Read Next: