Context-Free Grammars & Parsing
Prerequisite: Automata Theory & FSMs
Context-Free Grammars
A context-free grammar (CFG) is a formal system for describing recursive syntactic structure. It is the backbone of programming language design: virtually every language reference manual defines syntax using a grammar.
Formally, a CFG is a 4-tuple $G = (V, \Sigma, R, S)$ where:
- $V$ is a finite set of variables (nonterminals)
- $\Sigma$ is a finite set of terminals (disjoint from $V$)
- $R \subseteq V \times (V \cup \Sigma)^\ast$ is a finite set of production rules, written $A \to \alpha$
- $S \in V$ is the start variable
A string $\alpha A \beta$ derives $\alpha \gamma \beta$ in one step (written $\alpha A \beta \Rightarrow \alpha \gamma \beta$) if $A \to \gamma$ is a production. The language of $G$ is:
$$L(G) = {w \in \Sigma^\ast : S \xRightarrow{*} w}$$
where $\xRightarrow{*}$ denotes zero or more derivation steps.
Parse Trees and Derivations
A parse tree for a string $w \in L(G)$ is a rooted ordered tree where internal nodes are labeled by variables, leaves are labeled by terminals or $\varepsilon$, the root is labeled $S$, and for every internal node labeled $A$ whose children are labeled $X_1, \ldots, X_k$ (left to right), $A \to X_1 \cdots X_k$ is a production.
The yield of a parse tree is the concatenation of its leaves. A string $w$ may have multiple parse trees - the grammar is then called ambiguous. For example, the grammar $E \to E + E \mid E \times E \mid a$ is ambiguous: the string $a + a \times a$ has two parse trees corresponding to $(a + a) \times a$ and $a + (a \times a)$.
Two standard derivation orders resolve the tree traversal but not the ambiguity:
- Leftmost derivation: always expand the leftmost variable
- Rightmost derivation: always expand the rightmost variable
A grammar is unambiguous if every string in $L(G)$ has exactly one leftmost (equivalently, rightmost) derivation. Ambiguity is undecidable in general: no algorithm can determine whether an arbitrary CFG is ambiguous.
Chomsky Normal Form
Any CFG can be transformed into Chomsky Normal Form (CNF), where every production is of the form:
$$A \to BC \quad \text{or} \quad A \to a$$
with $A, B, C \in V$, $B \neq S$, $C \neq S$, and $a \in \Sigma$.
The conversion algorithm proceeds in four steps:
- Eliminate $\varepsilon$-productions. Find all nullable variables (those deriving $\varepsilon$), then for each production containing a nullable variable $A$, add a version with $A$ removed. Delete all $A \to \varepsilon$ rules (except possibly $S \to \varepsilon$).
- Eliminate unit productions. For each unit rule $A \to B$, add $A \to \gamma$ for every $B \to \gamma$ that is not a unit rule. Remove all unit rules.
- Eliminate long right-hand sides. Replace $A \to X_1 X_2 \cdots X_k$ ($k \geq 3$) with $A \to X_1 A_1$, $A_1 \to X_2 A_2$, …, $A_{k-2} \to X_{k-1} X_k$ using fresh variables.
- Eliminate terminals mixed with variables. Replace any terminal $a$ appearing in a rule with more than one symbol by a new variable $N_a$ and add the rule $N_a \to a$.
CNF is particularly convenient for algorithm design since every parse tree for a string of length $n$ has exactly $2n - 1$ internal nodes.
Pushdown Automata
A pushdown automaton (PDA) is an NFA augmented with a stack. Formally, a PDA is a 6-tuple $P = (Q, \Sigma, \Gamma, \delta, q_0, F)$ where $\Gamma$ is the stack alphabet and $\delta: Q \times \Sigma_\varepsilon \times \Gamma_\varepsilon \to \mathcal{P}(Q \times \Gamma_\varepsilon)$.
A transition $(q', b) \in \delta(q, a, c)$ means: in state $q$, reading input symbol $a$ (or $\varepsilon$), popping $c$ from the stack (or $\varepsilon$), move to state $q'$ and push $b$ (or $\varepsilon$).
Theorem. A language is context-free if and only if some PDA recognizes it.
The key insight is that a PDA can push the right-hand side of a production onto the stack when it expands a variable, and pop a terminal when it matches input. This implements a nondeterministic top-down parse. Conversely, given a PDA, one can construct a CFG by introducing a variable $[pq]$ for each pair of states, generating strings that take the PDA from state $p$ with an empty stack to state $q$ with an empty stack.
The Pumping Lemma for CFLs
Theorem. If $L$ is a context-free language, there exists a pumping length $p$ such that any $w \in L$ with $|w| \geq p$ can be written $w = uvxyz$ satisfying:
- $|vy| \geq 1$
- $|vxy| \leq p$
- $uv^i xy^i z \in L$ for all $i \geq 0$
Proof sketch. Any parse tree for a long string must have a tall path (by König’s lemma applied to CNF trees). Along this path, some variable $A$ repeats. The subtree rooted at the upper $A$ generates $vxy$, and the subtree rooted at the lower $A$ generates $x$. Replacing the upper subtree with the lower gives $uxz$; inserting extra copies of the upper subtree gives $uv^i xy^i z$.
Example: ${a^n b^n c^n : n \geq 0}$ is not a CFL. Take $w = a^p b^p c^p$. Write $w = uvxyz$ with $|vxy| \leq p$ and $|vy| \geq 1$. Since $|vxy| \leq p$, the combined string $vxy$ cannot span all three blocks, so $v$ and $y$ together touch at most two of the three symbols. Pumping up ($i = 2$) increases the count of at most two symbols, violating the equal-count requirement. Contradiction.
Parsing Algorithms
CYK Algorithm
The Cocke-Younger-Kasami (CYK) algorithm decides membership in a CFL in $O(n^3 |G|)$ time using dynamic programming on a CNF grammar.
Let $w = w_1 \cdots w_n$. Define $T[i][j]$ = set of variables that derive $w_i \cdots w_j$.
- Base case. $A \in T[i][i]$ iff $A \to w_i$ is a production.
- Recursive case. $A \in T[i][j]$ iff there exist $k$ with $i \leq k < j$ and a production $A \to BC$ with $B \in T[i][k]$ and $C \in T[k+1][j]$.
$w \in L(G)$ iff $S \in T[1][n]$.
Earley Parser
The Earley parser handles arbitrary CFGs (including ambiguous ones) in $O(n^3)$ time and $O(n^2)$ space in general, and $O(n^2)$ time for unambiguous grammars. It maintains sets of Earley items of the form $[A \to \alpha \bullet \beta, j]$, meaning the parser has matched $\alpha$ of production $A \to \alpha\beta$ and the match started at position $j$. Three operations drive the algorithm: predict (add items for productions of the next expected variable), scan (advance items when the next input symbol matches), and complete (advance items when a variable has been fully matched).
LR Parsers
LR(1) and LALR(1) parsers are the practical workhorses behind tools like yacc and bison. They operate bottom-up, reducing input strings by reversing productions. The $(1)$ denotes one token of lookahead. LR parsers can handle a strict superset of LL grammars and have linear time complexity for LR-parseable grammars. The parser uses a stack of states and a parsing table constructed from an automaton over items (dotted rules). A state/lookahead pair tells the parser whether to shift (push input) or reduce (apply a production in reverse).
Examples
Example 1: Arithmetic expressions. The grammar $E \to E + T \mid T$, $T \to T \times F \mid F$, $F \to (E) \mid a$ is unambiguous and captures standard operator precedence. The left-recursive structure gives left-associativity.
Example 2: Balanced parentheses.
$S \to SS \mid (S) \mid \varepsilon$ generates all strings of balanced parentheses. A PDA recognizes this by pushing on ( and popping on ), accepting when input is exhausted and stack is empty.
Example 3: XML/HTML validation. The nested structure of XML tags is context-free. A PDA pushes opening tags and checks that closing tags match. Full XML validation requires additional constraints (DTDs, schemas) that go beyond CFLs.
Applications
- Compiler front ends. Every programming language parser is built on a CFG. Modern compilers use LALR(1) or PEG parsers derived directly from the grammar.
- Natural language processing. Phrase-structure grammars for English are CFGs. The Earley and CYK algorithms underlie many NLP parsers.
- Protocol specification. Binary protocols (ASN.1, Protocol Buffers) are specified by grammars and parsed by generated code.
- Type checking. Type expressions in polymorphic type systems have context-free structure; type inference algorithms exploit this.
Read Next: Turing Machines