Context-Free Grammars & Parsing // Megha Bose

Prerequisite: Automata Theory & FSMs

Context-Free Grammars

A context-free grammar (CFG) is a formal system for describing recursive syntactic structure. It is the backbone of programming language design: virtually every language reference manual defines syntax using a grammar.

Formally, a CFG is a 4-tuple $G = (V, \Sigma, R, S)$ where:

$V$ is a finite set of variables (nonterminals)
$\Sigma$ is a finite set of terminals (disjoint from $V$)
$R \subseteq V \times (V \cup \Sigma)^\ast$ is a finite set of production rules, written $A \to \alpha$
$S \in V$ is the start variable

A string $\alpha A \beta$ derives $\alpha \gamma \beta$ in one step (written $\alpha A \beta \Rightarrow \alpha \gamma \beta$) if $A \to \gamma$ is a production. The language of $G$ is:

$$L(G) = {w \in \Sigma^\ast : S \xRightarrow{*} w}$$

where $\xRightarrow{*}$ denotes zero or more derivation steps.

Parse Trees and Derivations

A parse tree for a string $w \in L(G)$ is a rooted ordered tree where internal nodes are labeled by variables, leaves are labeled by terminals or $\varepsilon$, the root is labeled $S$, and for every internal node labeled $A$ whose children are labeled $X_1, \ldots, X_k$ (left to right), $A \to X_1 \cdots X_k$ is a production.

The yield of a parse tree is the concatenation of its leaves. A string $w$ may have multiple parse trees - the grammar is then called ambiguous. For example, the grammar $E \to E + E \mid E \times E \mid a$ is ambiguous: the string $a + a \times a$ has two parse trees corresponding to $(a + a) \times a$ and $a + (a \times a)$.

Two standard derivation orders resolve the tree traversal but not the ambiguity:

Leftmost derivation: always expand the leftmost variable
Rightmost derivation: always expand the rightmost variable

A grammar is unambiguous if every string in $L(G)$ has exactly one leftmost (equivalently, rightmost) derivation. Ambiguity is undecidable in general: no algorithm can determine whether an arbitrary CFG is ambiguous.

Chomsky Normal Form

Any CFG can be transformed into Chomsky Normal Form (CNF), where every production is of the form:

$$A \to BC \quad \text{or} \quad A \to a$$

with $A, B, C \in V$, $B \neq S$, $C \neq S$, and $a \in \Sigma$.

The conversion algorithm proceeds in four steps:

Eliminate $\varepsilon$-productions. Find all nullable variables (those deriving $\varepsilon$), then for each production containing a nullable variable $A$, add a version with $A$ removed. Delete all $A \to \varepsilon$ rules (except possibly $S \to \varepsilon$).
Eliminate unit productions. For each unit rule $A \to B$, add $A \to \gamma$ for every $B \to \gamma$ that is not a unit rule. Remove all unit rules.
Eliminate long right-hand sides. Replace $A \to X_1 X_2 \cdots X_k$ ($k \geq 3$) with $A \to X_1 A_1$, $A_1 \to X_2 A_2$, …, $A_{k-2} \to X_{k-1} X_k$ using fresh variables.
Eliminate terminals mixed with variables. Replace any terminal $a$ appearing in a rule with more than one symbol by a new variable $N_a$ and add the rule $N_a \to a$.

CNF is particularly convenient for algorithm design since every parse tree for a string of length $n$ has exactly $2n - 1$ internal nodes.

Pushdown Automata

A pushdown automaton (PDA) is an NFA augmented with a stack. Formally, a PDA is a 6-tuple $P = (Q, \Sigma, \Gamma, \delta, q_0, F)$ where $\Gamma$ is the stack alphabet and $\delta: Q \times \Sigma_\varepsilon \times \Gamma_\varepsilon \to \mathcal{P}(Q \times \Gamma_\varepsilon)$.

A transition $(q', b) \in \delta(q, a, c)$ means: in state $q$, reading input symbol $a$ (or $\varepsilon$), popping $c$ from the stack (or $\varepsilon$), move to state $q'$ and push $b$ (or $\varepsilon$).

Theorem. A language is context-free if and only if some PDA recognizes it.

The key insight is that a PDA can push the right-hand side of a production onto the stack when it expands a variable, and pop a terminal when it matches input. This implements a nondeterministic top-down parse. Conversely, given a PDA, one can construct a CFG by introducing a variable $[pq]$ for each pair of states, generating strings that take the PDA from state $p$ with an empty stack to state $q$ with an empty stack.

The Pumping Lemma for CFLs

Theorem. If $L$ is a context-free language, there exists a pumping length $p$ such that any $w \in L$ with $|w| \geq p$ can be written $w = uvxyz$ satisfying:

$|vy| \geq 1$
$|vxy| \leq p$
$uv^i xy^i z \in L$ for all $i \geq 0$

Proof sketch. Any parse tree for a long string must have a tall path (by König’s lemma applied to CNF trees). Along this path, some variable $A$ repeats. The subtree rooted at the upper $A$ generates $vxy$, and the subtree rooted at the lower $A$ generates $x$. Replacing the upper subtree with the lower gives $uxz$; inserting extra copies of the upper subtree gives $uv^i xy^i z$.

Example: ${a^n b^n c^n : n \geq 0}$ is not a CFL. Take $w = a^p b^p c^p$. Write $w = uvxyz$ with $|vxy| \leq p$ and $|vy| \geq 1$. Since $|vxy| \leq p$, the combined string $vxy$ cannot span all three blocks, so $v$ and $y$ together touch at most two of the three symbols. Pumping up ($i = 2$) increases the count of at most two symbols, violating the equal-count requirement. Contradiction.

Parsing Algorithms

CYK Algorithm

The Cocke-Younger-Kasami (CYK) algorithm decides membership in a CFL in $O(n^3 |G|)$ time using dynamic programming on a CNF grammar.

Let $w = w_1 \cdots w_n$. Define $T[i][j]$ = set of variables that derive $w_i \cdots w_j$.

Base case. $A \in T[i][i]$ iff $A \to w_i$ is a production.
Recursive case. $A \in T[i][j]$ iff there exist $k$ with $i \leq k < j$ and a production $A \to BC$ with $B \in T[i][k]$ and $C \in T[k+1][j]$.

$w \in L(G)$ iff $S \in T[1][n]$.

Earley Parser

The Earley parser handles arbitrary CFGs (including ambiguous ones) in $O(n^3)$ time and $O(n^2)$ space in general, and $O(n^2)$ time for unambiguous grammars. It maintains sets of Earley items of the form $[A \to \alpha \bullet \beta, j]$, meaning the parser has matched $\alpha$ of production $A \to \alpha\beta$ and the match started at position $j$. Three operations drive the algorithm: predict (add items for productions of the next expected variable), scan (advance items when the next input symbol matches), and complete (advance items when a variable has been fully matched).

LR Parsers

LR(1) and LALR(1) parsers are the practical workhorses behind tools like yacc and bison. They operate bottom-up, reducing input strings by reversing productions. The $(1)$ denotes one token of lookahead. LR parsers can handle a strict superset of LL grammars and have linear time complexity for LR-parseable grammars. The parser uses a stack of states and a parsing table constructed from an automaton over items (dotted rules). A state/lookahead pair tells the parser whether to shift (push input) or reduce (apply a production in reverse).

Examples

Example 1: Arithmetic expressions. The grammar $E \to E + T \mid T$, $T \to T \times F \mid F$, $F \to (E) \mid a$ is unambiguous and captures standard operator precedence. The left-recursive structure gives left-associativity.

Example 2: Balanced parentheses. $S \to SS \mid (S) \mid \varepsilon$ generates all strings of balanced parentheses. A PDA recognizes this by pushing on ( and popping on ), accepting when input is exhausted and stack is empty.

Example 3: XML/HTML validation. The nested structure of XML tags is context-free. A PDA pushes opening tags and checks that closing tags match. Full XML validation requires additional constraints (DTDs, schemas) that go beyond CFLs.

Applications

Compiler front ends. Every programming language parser is built on a CFG. Modern compilers use LALR(1) or PEG parsers derived directly from the grammar.
Natural language processing. Phrase-structure grammars for English are CFGs. The Earley and CYK algorithms underlie many NLP parsers.
Protocol specification. Binary protocols (ASN.1, Protocol Buffers) are specified by grammars and parsed by generated code.
Type checking. Type expressions in polymorphic type systems have context-free structure; type inference algorithms exploit this.

Read Next: Turing Machines