Prerequisite:


A compiler translates a program written in one language into an equivalent program in another. Usually that means high-level source code goes in and machine code comes out, but the same architecture applies whenever one formal language is mechanically transformed into another. What makes compilers interesting is that they must do this translation correctly, efficiently, and in a way that preserves the meaning of the original program exactly - even as they produce code that looks nothing like it.

The Phases of Compilation

A compiler works in a pipeline of phases, each transforming one representation into another. The clean separation between phases is what makes compilers tractable to build and reason about.

Lexical Analysis

The first phase is the lexer (also called the scanner or tokenizer). It reads the raw character stream and groups characters into tokens - the minimal meaningful units of the language.

int x = 3 + y * 2;
→ [INT] [IDENT:x] [EQUALS] [NUM:3] [PLUS] [IDENT:y] [STAR] [NUM:2] [SEMI]

The lexer discards whitespace and comments (which carry no meaning) and classifies each group of characters by its token type. Lexers are specified using regular expressions, and the underlying machinery is a deterministic finite automaton (DFA). Tools like lex and flex take a set of regular expression rules and generate the DFA automatically.

Parsing

The parser takes the token stream and checks that it conforms to the language’s grammar, simultaneously building an Abstract Syntax Tree (AST) that captures the hierarchical structure of the program.

Context-free grammars describe the syntactic structure of programming languages. A grammar for arithmetic might include:

Expr → Expr + Term | Expr - Term | Term
Term → Term * Factor | Term / Factor | Factor
Factor → NUM | IDENT | ( Expr )

There are two main parsing strategies. LL(1) parsing (top-down, left-to-right, leftmost derivation, one lookahead token) is used in recursive descent parsers - each non-terminal in the grammar maps to a function in the parser. These are easy to write by hand and are used in GCC, Clang, and rustc. LR(1) parsing (bottom-up) is more powerful - it can handle a larger class of grammars - and is the approach used by generated parsers from tools like yacc and bison.

The AST for 3 + y * 2 looks like:

    +
   / \
  3   *
     / \
    y   2

The tree encodes operator precedence structurally: multiplication binds tighter because it sits deeper in the tree.

Semantic Analysis

Having established that the token stream is syntactically valid, the compiler must determine whether it is semantically meaningful. Semantic analysis answers questions that the grammar cannot: are variables declared before use? are types compatible? does this function call provide the right number of arguments?

The symbol table is the central data structure here. It maps identifiers to their declarations, including type, scope level, and storage class. Nested scopes (function bodies inside class bodies inside namespaces) are represented as a chain of symbol tables, with inner scopes shadowing outer ones.

Type checking verifies that every expression has a consistent type. In statically typed languages like C++ or Java, the type checker annotates every node in the AST with its type and rejects programs where types do not match. In languages with type inference (Haskell, Rust, OCaml), the compiler uses the Hindley-Milner algorithm to infer types that were never written - a set of type equations is generated from the AST and solved by unification.

The output of semantic analysis is an annotated AST: the original tree with type information and resolved symbol references attached to every node.

Intermediate Code Generation

Compilers rarely translate directly from AST to machine code. An intermediate representation (IR) is a language-independent, target-independent form that is easier to analyse and transform than either source code or machine code.

The most common IR form is three-address code: each instruction has at most one operator and three operands (two sources, one destination). For a = b + c * d:

t1 = c * d
t2 = b + t1
a  = t2

Modern compilers convert this into Static Single Assignment (SSA) form, where each variable is assigned exactly once. Where control flow merges, a special phi instruction selects between incoming values:

if cond goto L_true
x0 = 1
goto L_end
L_true:
x1 = 2
L_end:
x2 = phi(x0, x1)   ; x2 is 1 or 2 depending on path taken

SSA makes dataflow analysis - tracing how values move through the program - much simpler, which is why virtually every serious compiler uses it.

The IR is also represented as a control flow graph (CFG): a directed graph where each node is a basic block (a maximal straight-line sequence of instructions with no branches in or out except at the entry and exit), and edges represent possible control flow transfers.

Optimisation

The middle-end applies a sequence of transformation passes to the IR, each preserving program semantics while improving performance or reducing size. Key passes include constant folding, dead code elimination, inlining, loop-invariant code motion, and common subexpression elimination. Because all passes operate on the same IR, they compose freely and can be run in any order or repeated.

Code Generation

The backend translates optimised IR into machine instructions for a specific target. This involves three main tasks: instruction selection (choosing which machine instruction implements each IR operation), instruction scheduling (reordering instructions to hide latency and maximise pipeline utilisation), and register allocation (deciding which values live in registers and which must be spilled to memory).

Register allocation is modelled as a graph colouring problem: each live value is a node, edges connect simultaneously-live values, and the number of colours is the number of registers. When the graph needs more colours than registers, some values are spilled.

Linking

The compiler produces one object file per source file. The linker combines them: it resolves symbol references (every use of a function defined in another file), patches addresses (relocation), and produces the final executable. Linking is a separate phase, but link-time optimisation (LTO) blurs this boundary by deferring code generation to link time so the optimiser can see the whole program.

JIT vs AOT Compilation

Ahead-of-time (AOT) compilation is what GCC and Clang do: the full compilation pipeline runs before the program is distributed. The user receives a binary that starts immediately.

Just-in-time (JIT) compilation happens at runtime. CPython compiles Python source to bytecode (a compact interpreted format, not machine code). PyPy’s JIT goes further: it identifies hot loops at runtime, compiles them to native machine code on the fly, and applies speculative optimisations based on observed runtime types. The advantage is that JIT can exploit information only available at runtime (concrete types, actual branch frequencies); the disadvantage is warmup time.

Examples

Lexer for arithmetic expressions. Given the input 3 + y * 2, a hand-written lexer scans character by character, accumulating digits into NUM tokens and letters into IDENT tokens, and emitting single-character tokens for operators and punctuation.

Recursive descent parser. Parse 3 + y * 2 with functions parseExpr(), parseTerm(), parseFactor() mirroring the grammar non-terminals. parseFactor handles numbers and identifiers; parseTerm calls parseFactor and then loops consuming * and /; parseExpr calls parseTerm and loops consuming + and -. This naturally produces the right precedence tree.

Three-address code for a = b + c * d. The compiler first computes c * d into a temporary t1, then b + t1 into t2, then assigns t2 to a. The tree structure of the AST directly determines the order of the temporaries.

Every tool in the compiler ecosystem - sanitisers, static analysers, transpilers, language servers - is built on the same sequence of phases. Learning to read IR, understanding what the parser sees, and knowing where type information lives are skills that transfer to every language toolchain you will ever encounter.


Read Next: