String Algorithms & Pattern Matching
Prerequisite: Searching Algorithms
String pattern matching is the problem of finding all occurrences of a pattern $P$ of length $m$ inside a text $T$ of length $n$. The naive approach works but is slow. A family of algorithms - KMP, Rabin-Karp, the Z-function, suffix arrays, suffix trees, and Aho-Corasick - each exploit different structural properties to do better. Understanding them also illuminates broader ideas: automata, hashing, and index data structures.
The Naive Algorithm
Slide the pattern over every position in the text and compare character by character:
NaiveSearch(T, P):
for i = 0 to n - m:
match = true
for j = 0 to m - 1:
if T[i + j] != P[j]:
match = false
break
if match: report occurrence at i
Complexity. $O(nm)$ in the worst case (e.g. $T = \texttt{aaa…a}$, $P = \texttt{aaa…ab}$). On random text it is much closer to $O(n)$ in practice, but worst-case matters for adversarial inputs.
Knuth-Morris-Pratt (KMP)
The key observation: when a mismatch occurs after matching $j$ characters of $P$, you already know what those $j$ characters were. The failure function $\pi[i]$ captures reusable information.
Definition. $\pi[i]$ is the length of the longest proper prefix of $P[0..i]$ that is also a suffix of $P[0..i]$.
Building $\pi$ in $O(m)$:
BuildFailure(P):
pi[0] = 0
k = 0
for i = 1 to m - 1:
while k > 0 and P[k] != P[i]:
k = pi[k - 1]
if P[k] == P[i]: k++
pi[i] = k
return pi
Searching in $O(n)$:
KMPSearch(T, P):
pi = BuildFailure(P)
q = 0 // characters matched so far
for i = 0 to n - 1:
while q > 0 and P[q] != T[i]:
q = pi[q - 1]
if P[q] == T[i]: q++
if q == m:
report occurrence at i - m + 1
q = pi[q - 1]
Correctness. At every step, $q$ is the length of the longest prefix of $P$ that is also a suffix of $T[0..i]$. The failure function lets us restart matching from the longest prefix that could still extend, rather than from zero.
Complexity. $O(n + m)$ total. The amortised argument: $q$ can increment at most $n$ times (one per character of $T$), and each while iteration strictly decreases $q$, so the total number of while steps across the whole search is also bounded by $n$.
Rabin-Karp: Rolling Hash
Instead of comparing characters, compare hash values. Use a polynomial rolling hash:
$$h(s) = \left(\sum_{k=0}^{m-1} s[k] \cdot b^{m-1-k}\right) \bmod p$$
When you slide the window by one position, update in $O(1)$:
$$h_{\text{new}} = \left((h_{\text{old}} - s[0] \cdot b^{m-1}) \cdot b + s[m]\right) \bmod p$$
Only do a full character comparison when hashes match, to confirm and rule out spurious hits (hash collisions).
Complexity. $O(n + m)$ expected with a good hash. Worst case $O(nm)$ if every window collides (pathological for a fixed hash). Use double hashing (two independent hash functions) to reduce collision probability to $O(1/p^2)$.
Strength of Rabin-Karp is its simplicity and extensibility to 2D pattern matching or multi-pattern search (hash the pattern set).
Z-Algorithm
Define $Z[i]$ as the length of the longest substring of $S$ starting at $i$ that matches a prefix of $S$. By convention $Z[0]$ is undefined.
ZFunction(S):
Z = array of length n, Z[0] = n
l, r = 0, 0
for i = 1 to n - 1:
if i < r:
Z[i] = min(r - i, Z[i - l])
while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]:
Z[i]++
if i + Z[i] > r:
l, r = i, i + Z[i]
return Z
To search for $P$ in $T$, form the concatenation $S = P + $ + T$ (where $$$ is a sentinel not in the alphabet). Any $i > m$ with $Z[i] = m$ indicates an occurrence of $P$ in $T$.
Complexity. $O(n + m)$. The invariant is that the window $[l, r)$ is the rightmost $Z$-box seen, so each character is extended at most once.
Suffix Arrays and LCP
A suffix array $SA$ is an array of starting indices of all suffixes of $S$, sorted lexicographically. Combined with the LCP array (longest common prefix between adjacent suffixes in sorted order), it supports pattern search and many string queries in near-linear time.
Construction. The DC3/Skew algorithm builds $SA$ in $O(n)$; the simpler doubling (prefix-doubling) approach runs in $O(n \log n)$:
- Start with ranks based on single characters.
- At each round, sort by pairs of ranks (current character rank, rank $2^k$ positions ahead).
- Repeat for $k = 0, 1, 2, \ldots$ until all ranks are distinct.
Pattern search. Binary search on $SA$: find the range $[lo, hi]$ such that $SA[i]$ is a position where $P$ matches. Each comparison is $O(m)$, so search is $O(m \log n)$. With the LCP array and an RMQ structure you can reduce to $O(m + \log n)$.
LCP array. Kasai’s algorithm computes $LCP$ in $O(n)$ from $SA$ using the phi array: if suffix at position $i$ is rank $r$ in $SA$, then $\phi[i]$ is the text position of the suffix at rank $r - 1$.
Aho-Corasick for Multi-Pattern Search
Given $k$ patterns $P_1, \ldots, P_k$ with total length $M$, find all occurrences in $T$ simultaneously.
- Build a trie of all patterns in $O(M)$.
- Add failure links (analogous to KMP failure function) via BFS in $O(M \cdot |\Sigma|)$.
- Add output links so each node points to the deepest proper suffix that is also a complete pattern.
- Search by feeding $T$ into the automaton in $O(n + \text{matches})$.
Total construction $O(M \cdot |\Sigma|)$, search $O(n + M + \text{matches})$.
The failure link construction mirrors KMP: $\text{fail}(u)$ is the longest proper suffix of the string labelling node $u$ that also appears in the trie.
Complexity Summary
| Algorithm | Preprocessing | Search | Space |
|---|---|---|---|
| Naive | - | $O(nm)$ | $O(1)$ |
| KMP | $O(m)$ | $O(n)$ | $O(m)$ |
| Rabin-Karp | $O(m)$ | $O(n)$ exp. | $O(1)$ |
| Z-algorithm | $O(n+m)$ | included | $O(n)$ |
| Suffix array | $O(n \log n)$ | $O(m \log n)$ | $O(n)$ |
| Aho-Corasick | $O(M | \Sigma | )$ |
Examples
DNA sequence alignment. Short-read aligners like BWA-MEM build a suffix array (or FM-index, which is a compressed suffix array) of the human reference genome once. Each sequencing read is then located in $O(m \log n)$ time - far faster than re-running naive search for each of billions of reads.
grep implementation. GNU grep uses a combination of Boyer-Moore and Aho-Corasick. For a single fixed pattern it uses Boyer-Moore for fast skipping; for multiple patterns (-e flag) it falls back to an Aho-Corasick automaton. The -P flag uses PCRE, which is a generalisation beyond exact string matching.
Read Next: Automata Theory