String Algorithms & Pattern Matching // Megha Bose

Prerequisite: Searching Algorithms

String pattern matching is the problem of finding all occurrences of a pattern $P$ of length $m$ inside a text $T$ of length $n$. The naive approach works but is slow. A family of algorithms - KMP, Rabin-Karp, the Z-function, suffix arrays, suffix trees, and Aho-Corasick - each exploit different structural properties to do better. Understanding them also illuminates broader ideas: automata, hashing, and index data structures.

The Naive Algorithm

Slide the pattern over every position in the text and compare character by character:

NaiveSearch(T, P):
  for i = 0 to n - m:
    match = true
    for j = 0 to m - 1:
      if T[i + j] != P[j]:
        match = false
        break
    if match: report occurrence at i

Complexity. $O(nm)$ in the worst case (e.g. $T = \texttt{aaa…a}$, $P = \texttt{aaa…ab}$). On random text it is much closer to $O(n)$ in practice, but worst-case matters for adversarial inputs.

Knuth-Morris-Pratt (KMP)

The key observation: when a mismatch occurs after matching $j$ characters of $P$, you already know what those $j$ characters were. The failure function $\pi[i]$ captures reusable information.

Definition. $\pi[i]$ is the length of the longest proper prefix of $P[0..i]$ that is also a suffix of $P[0..i]$.

Building $\pi$ in $O(m)$:

BuildFailure(P):
  pi[0] = 0
  k = 0
  for i = 1 to m - 1:
    while k > 0 and P[k] != P[i]:
      k = pi[k - 1]
    if P[k] == P[i]: k++
    pi[i] = k
  return pi

Searching in $O(n)$:

KMPSearch(T, P):
  pi = BuildFailure(P)
  q = 0                        // characters matched so far
  for i = 0 to n - 1:
    while q > 0 and P[q] != T[i]:
      q = pi[q - 1]
    if P[q] == T[i]: q++
    if q == m:
      report occurrence at i - m + 1
      q = pi[q - 1]

Correctness. At every step, $q$ is the length of the longest prefix of $P$ that is also a suffix of $T[0..i]$. The failure function lets us restart matching from the longest prefix that could still extend, rather than from zero.

Complexity. $O(n + m)$ total. The amortised argument: $q$ can increment at most $n$ times (one per character of $T$), and each while iteration strictly decreases $q$, so the total number of while steps across the whole search is also bounded by $n$.

Rabin-Karp: Rolling Hash

Instead of comparing characters, compare hash values. Use a polynomial rolling hash:

$$h(s) = \left(\sum_{k=0}^{m-1} s[k] \cdot b^{m-1-k}\right) \bmod p$$

When you slide the window by one position, update in $O(1)$:

$$h_{\text{new}} = \left((h_{\text{old}} - s[0] \cdot b^{m-1}) \cdot b + s[m]\right) \bmod p$$

Only do a full character comparison when hashes match, to confirm and rule out spurious hits (hash collisions).

Complexity. $O(n + m)$ expected with a good hash. Worst case $O(nm)$ if every window collides (pathological for a fixed hash). Use double hashing (two independent hash functions) to reduce collision probability to $O(1/p^2)$.

Strength of Rabin-Karp is its simplicity and extensibility to 2D pattern matching or multi-pattern search (hash the pattern set).

Z-Algorithm

Define $Z[i]$ as the length of the longest substring of $S$ starting at $i$ that matches a prefix of $S$. By convention $Z[0]$ is undefined.

ZFunction(S):
  Z = array of length n, Z[0] = n
  l, r = 0, 0
  for i = 1 to n - 1:
    if i < r:
      Z[i] = min(r - i, Z[i - l])
    while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]:
      Z[i]++
    if i + Z[i] > r:
      l, r = i, i + Z[i]
  return Z

To search for $P$ in $T$, form the concatenation $S = P + $ + T$ (where $$$ is a sentinel not in the alphabet). Any $i > m$ with $Z[i] = m$ indicates an occurrence of $P$ in $T$.

Complexity. $O(n + m)$. The invariant is that the window $[l, r)$ is the rightmost $Z$-box seen, so each character is extended at most once.

Suffix Arrays and LCP

A suffix array $SA$ is an array of starting indices of all suffixes of $S$, sorted lexicographically. Combined with the LCP array (longest common prefix between adjacent suffixes in sorted order), it supports pattern search and many string queries in near-linear time.

Construction. The DC3/Skew algorithm builds $SA$ in $O(n)$; the simpler doubling (prefix-doubling) approach runs in $O(n \log n)$:

Start with ranks based on single characters.
At each round, sort by pairs of ranks (current character rank, rank $2^k$ positions ahead).
Repeat for $k = 0, 1, 2, \ldots$ until all ranks are distinct.

Pattern search. Binary search on $SA$: find the range $[lo, hi]$ such that $SA[i]$ is a position where $P$ matches. Each comparison is $O(m)$, so search is $O(m \log n)$. With the LCP array and an RMQ structure you can reduce to $O(m + \log n)$.

LCP array. Kasai’s algorithm computes $LCP$ in $O(n)$ from $SA$ using the phi array: if suffix at position $i$ is rank $r$ in $SA$, then $\phi[i]$ is the text position of the suffix at rank $r - 1$.

Aho-Corasick for Multi-Pattern Search

Given $k$ patterns $P_1, \ldots, P_k$ with total length $M$, find all occurrences in $T$ simultaneously.

Build a trie of all patterns in $O(M)$.
Add failure links (analogous to KMP failure function) via BFS in $O(M \cdot |\Sigma|)$.
Add output links so each node points to the deepest proper suffix that is also a complete pattern.
Search by feeding $T$ into the automaton in $O(n + \text{matches})$.

Total construction $O(M \cdot |\Sigma|)$, search $O(n + M + \text{matches})$.

The failure link construction mirrors KMP: $\text{fail}(u)$ is the longest proper suffix of the string labelling node $u$ that also appears in the trie.

Complexity Summary

Algorithm	Preprocessing	Search	Space
Naive	-	$O(nm)$	$O(1)$
KMP	$O(m)$	$O(n)$	$O(m)$
Rabin-Karp	$O(m)$	$O(n)$ exp.	$O(1)$
Z-algorithm	$O(n+m)$	included	$O(n)$
Suffix array	$O(n \log n)$	$O(m \log n)$	$O(n)$
Aho-Corasick	$O(M	\Sigma	)$

Examples

DNA sequence alignment. Short-read aligners like BWA-MEM build a suffix array (or FM-index, which is a compressed suffix array) of the human reference genome once. Each sequencing read is then located in $O(m \log n)$ time - far faster than re-running naive search for each of billions of reads.

grep implementation. GNU grep uses a combination of Boyer-Moore and Aho-Corasick. For a single fixed pattern it uses Boyer-Moore for fast skipping; for multiple patterns (-e flag) it falls back to an Aho-Corasick automaton. The -P flag uses PCRE, which is a generalisation beyond exact string matching.

Read Next: Automata Theory