String Algorithms & Pattern Matching - Finding Needles in Haystacks, Fast // Megha Bose

Helpful context:

Ctrl+F. Every time you search a document, your computer scans millions of characters looking for a pattern. The naive algorithm is $O(nm)$ - for a 1 MB document and a 10-character query, that’s 10 million comparisons. The KMP algorithm does it in $O(n + m)$. The whole improvement comes down to one insight: when a match fails, don’t restart from scratch.

The Naive Algorithm and Why It’s Bad

Slide the pattern $P$ of length $m$ over every position in text $T$ of length $n$. At each position, compare character by character:

NaiveSearch(T, P):
    for i = 0 to n - m:
        if T[i..i+m-1] == P:
            report match at i

In the worst case this is $O(nm)$. Consider searching for pattern aaab in text aaaaaaaaaa. At every position you match 3 characters before failing on the 4th, then shift by just 1. You do nearly $nm$ comparisons and find nothing. For adversarial inputs - like scanning source code for a pattern with repeated characters - this is genuinely slow.

KMP: Don’t Throw Away What You Already Know

When you match $j$ characters of $P$ against $T$ and then fail, you know exactly what those $j$ characters were - they equal $P[0..j-1]$. That knowledge tells you where you can safely restart without rechecking characters.

The failure function $\pi[i]$ captures this. It’s the length of the longest proper prefix of $P[0..i]$ that is also a suffix of $P[0..i]$.

Example: for pattern abcab, the failure function is:

Index	Char	$\pi$
0	a	0
1	b	0
2	c	0
3	a	1
4	b	2

So $\pi[4] = 2$ means the last 2 characters of abcab (which are ab) also appear as the first 2 characters. If you matched 5 characters and then failed, you know the next potential match can start as if you’d already matched 2 characters.

Building the failure function in $O(m)$:

BuildFailure(P):
    pi[0] = 0
    k = 0
    for i = 1 to m - 1:
        while k > 0 and P[k] != P[i]:
            k = pi[k - 1]
        if P[k] == P[i]: k++
        pi[i] = k
    return pi

Searching in $O(n)$:

KMPSearch(T, P):
    pi = BuildFailure(P)
    q = 0
    for i = 0 to n - 1:
        while q > 0 and P[q] != T[i]:
            q = pi[q - 1]
        if P[q] == T[i]: q++
        if q == m:
            report match at i - m + 1
            q = pi[q - 1]

Why is this $O(n)$? The variable q can only increase by 1 per character of $T$, so it increases at most $n$ times total. Each time through the while loop, q strictly decreases. So the total number of while iterations across the entire search is at most $n$. Combined with the outer loop: $O(n)$ total.

Rabin-Karp: Rolling Hash

A completely different approach: instead of comparing characters, compare hash values.

Hash the pattern once: $h(P) = \sum_{k=0}^{m-1} P[k] \cdot b^{m-1-k} \bmod p$ for some base $b$ and prime $p$.

Now slide a window of length $m$ over $T$. If the hash of the current window matches $h(P)$, verify with a full comparison. The key: updating the window hash takes $O(1)$ via the rolling hash trick:

$$h_\text{new} = (h_\text{old} - T[i] \cdot b^{m-1}) \cdot b + T[i+m] \pmod{p}$$

You subtract out the leftmost character and add in the new rightmost character. Each slide costs $O(1)$ instead of $O(m)$.

Expected time: $O(n + m)$. Worst case: $O(nm)$ if every window produces a hash collision. In practice this is rare, and using two independent hash functions (double hashing) reduces the collision probability to near zero.

Discomfort check. Rabin-Karp can have $O(nm)$ worst case due to hash collisions. But we still use it - why?

Because it generalizes beautifully to problems that KMP doesn’t handle cleanly. Multi-pattern search: hash all $k$ patterns into a hash set, slide one window, check membership. Time $O(n + m_\text{total})$ expected. Two-dimensional pattern matching: compute rolling hashes on rows, then column-hash the row-hashes. KMP doesn’t extend to 2D at all. Rabin-Karp is the right tool when the pattern structure is rich or when you’re searching for many patterns at once.

Suffix Arrays: Preprocess the Text Once

The previous algorithms answer “does $P$ appear in $T$?” freshly each time. If you’ll run many queries on the same text $T$, preprocess it.

A suffix array $SA$ lists all $n$ suffixes of $T$ in sorted lexicographic order (by storing their starting indices). Combined with the LCP array (longest common prefix between adjacent entries in $SA$), you can answer any substring query in $O(m + \log n)$ time via binary search.

Construction costs $O(n \log n)$ with the prefix-doubling algorithm, or $O(n)$ with more sophisticated constructions. The payoff: a genomics database built once on the human reference genome can answer thousands of short-read alignment queries per second, each in $O(m \log n)$ time.

Aho-Corasick: Many Patterns at Once

KMP handles one pattern. What if you have $k$ patterns $P_1, \ldots, P_k$ and want to find all occurrences of any of them in $T$?

The naive approach runs KMP separately for each pattern: $O(n \cdot k + m_\text{total})$. Aho-Corasick does better: preprocess all patterns into a trie with failure links (exactly like KMP’s failure function, but over the trie), then scan $T$ once. Total time: $O(m_\text{total} \cdot |\Sigma|)$ preprocessing, $O(n + \text{matches})$ search.

The construction mirrors KMP. Each node in the trie has a failure link pointing to the longest proper suffix of its string that also appears in the trie. During search, these links let you shift efficiently when a mismatch occurs - without reprocessing $T$.

Summary

Algorithm	Preprocessing	Search	Space	Best for
Naive	-	$O(nm)$	$O(1)$	Short text, simple needs
KMP	$O(m)$	$O(n)$	$O(m)$	Single pattern, streaming
Rabin-Karp	$O(m)$	$O(n)$ expected	$O(1)$	Multi-pattern, 2D search
Suffix array	$O(n \log n)$	$O(m \log n)$	$O(n)$	Many queries, same text
Aho-Corasick	$O(m_\text{total} \cdot	\Sigma	)$	$O(n + \text{hits})$

Read next:

Boyer-Moore - The String Matcher That Skips More Than It Reads