String Algorithms & Pattern Matching - Finding Needles in Haystacks, Fast
Helpful context:
Ctrl+F. Every time you search a document, your computer scans millions of characters looking for a pattern. The naive algorithm is $O(nm)$ - for a 1 MB document and a 10-character query, that’s 10 million comparisons. The KMP algorithm does it in $O(n + m)$. The whole improvement comes down to one insight: when a match fails, don’t restart from scratch.
The Naive Algorithm and Why It’s Bad
Slide the pattern $P$ of length $m$ over every position in text $T$ of length $n$. At each position, compare character by character:
NaiveSearch(T, P):
for i = 0 to n - m:
if T[i..i+m-1] == P:
report match at i
In the worst case this is $O(nm)$. Consider searching for pattern aaab in text aaaaaaaaaa. At every position you match 3 characters before failing on the 4th, then shift by just 1. You do nearly $nm$ comparisons and find nothing. For adversarial inputs - like scanning source code for a pattern with repeated characters - this is genuinely slow.
KMP: Don’t Throw Away What You Already Know
When you match $j$ characters of $P$ against $T$ and then fail, you know exactly what those $j$ characters were - they equal $P[0..j-1]$. That knowledge tells you where you can safely restart without rechecking characters.
The failure function $\pi[i]$ captures this. It’s the length of the longest proper prefix of $P[0..i]$ that is also a suffix of $P[0..i]$.
Example: for pattern abcab, the failure function is:
| Index | Char | $\pi$ |
|---|---|---|
| 0 | a | 0 |
| 1 | b | 0 |
| 2 | c | 0 |
| 3 | a | 1 |
| 4 | b | 2 |
So $\pi[4] = 2$ means the last 2 characters of abcab (which are ab) also appear as the first 2 characters. If you matched 5 characters and then failed, you know the next potential match can start as if you’d already matched 2 characters.
Building the failure function in $O(m)$:
BuildFailure(P):
pi[0] = 0
k = 0
for i = 1 to m - 1:
while k > 0 and P[k] != P[i]:
k = pi[k - 1]
if P[k] == P[i]: k++
pi[i] = k
return pi
Searching in $O(n)$:
KMPSearch(T, P):
pi = BuildFailure(P)
q = 0
for i = 0 to n - 1:
while q > 0 and P[q] != T[i]:
q = pi[q - 1]
if P[q] == T[i]: q++
if q == m:
report match at i - m + 1
q = pi[q - 1]
Why is this $O(n)$? The variable q can only increase by 1 per character of $T$, so it increases at most $n$ times total. Each time through the while loop, q strictly decreases. So the total number of while iterations across the entire search is at most $n$. Combined with the outer loop: $O(n)$ total.
Rabin-Karp: Rolling Hash
A completely different approach: instead of comparing characters, compare hash values.
Hash the pattern once: $h(P) = \sum_{k=0}^{m-1} P[k] \cdot b^{m-1-k} \bmod p$ for some base $b$ and prime $p$.
Now slide a window of length $m$ over $T$. If the hash of the current window matches $h(P)$, verify with a full comparison. The key: updating the window hash takes $O(1)$ via the rolling hash trick:
$$h_\text{new} = (h_\text{old} - T[i] \cdot b^{m-1}) \cdot b + T[i+m] \pmod{p}$$
You subtract out the leftmost character and add in the new rightmost character. Each slide costs $O(1)$ instead of $O(m)$.
Expected time: $O(n + m)$. Worst case: $O(nm)$ if every window produces a hash collision. In practice this is rare, and using two independent hash functions (double hashing) reduces the collision probability to near zero.
Discomfort check. Rabin-Karp can have $O(nm)$ worst case due to hash collisions. But we still use it - why?
Because it generalizes beautifully to problems that KMP doesn’t handle cleanly. Multi-pattern search: hash all $k$ patterns into a hash set, slide one window, check membership. Time $O(n + m_\text{total})$ expected. Two-dimensional pattern matching: compute rolling hashes on rows, then column-hash the row-hashes. KMP doesn’t extend to 2D at all. Rabin-Karp is the right tool when the pattern structure is rich or when you’re searching for many patterns at once.
Suffix Arrays: Preprocess the Text Once
The previous algorithms answer “does $P$ appear in $T$?” freshly each time. If you’ll run many queries on the same text $T$, preprocess it.
A suffix array $SA$ lists all $n$ suffixes of $T$ in sorted lexicographic order (by storing their starting indices). Combined with the LCP array (longest common prefix between adjacent entries in $SA$), you can answer any substring query in $O(m + \log n)$ time via binary search.
Construction costs $O(n \log n)$ with the prefix-doubling algorithm, or $O(n)$ with more sophisticated constructions. The payoff: a genomics database built once on the human reference genome can answer thousands of short-read alignment queries per second, each in $O(m \log n)$ time.
Aho-Corasick: Many Patterns at Once
KMP handles one pattern. What if you have $k$ patterns $P_1, \ldots, P_k$ and want to find all occurrences of any of them in $T$?
The naive approach runs KMP separately for each pattern: $O(n \cdot k + m_\text{total})$. Aho-Corasick does better: preprocess all patterns into a trie with failure links (exactly like KMP’s failure function, but over the trie), then scan $T$ once. Total time: $O(m_\text{total} \cdot |\Sigma|)$ preprocessing, $O(n + \text{matches})$ search.
The construction mirrors KMP. Each node in the trie has a failure link pointing to the longest proper suffix of its string that also appears in the trie. During search, these links let you shift efficiently when a mismatch occurs - without reprocessing $T$.
Summary
| Algorithm | Preprocessing | Search | Space | Best for |
|---|---|---|---|---|
| Naive | - | $O(nm)$ | $O(1)$ | Short text, simple needs |
| KMP | $O(m)$ | $O(n)$ | $O(m)$ | Single pattern, streaming |
| Rabin-Karp | $O(m)$ | $O(n)$ expected | $O(1)$ | Multi-pattern, 2D search |
| Suffix array | $O(n \log n)$ | $O(m \log n)$ | $O(n)$ | Many queries, same text |
| Aho-Corasick | $O(m_\text{total} \cdot | \Sigma | )$ | $O(n + \text{hits})$ |
Read next: