Boyer-Moore - The String Matcher That Skips More Than It Reads // Megha Bose

Helpful context:

String Algorithms & Pattern Matching - Finding Needles in Haystacks, Fast

Most string search algorithms scan left to right. Boyer-Moore scans right to left within the pattern, which lets it skip large chunks of text. For English text searching typical English patterns, Boyer-Moore does far fewer than $n$ character comparisons - it’s sublinear in practice. When grep is fast, Boyer-Moore is often why.

The algorithm is surprisingly counterintuitive: it reads the pattern backwards against the text, and a mismatch is information to be exploited, not just a failure to recover from.

Two Heuristics, Take the Maximum

Boyer-Moore uses two independent rules for deciding how far to shift the pattern on each mismatch. It takes whichever shift is larger.

Bad Character Rule

When we compare the pattern right-to-left and find a mismatch at position $j$ in the pattern - text character $c$ doesn’t match $P[j]$ - we ask: does $c$ appear anywhere in $P$ to the left of position $j$?

If $c$ doesn’t appear in $P$ at all (to the left of $j$): shift the pattern completely past the mismatch. Skip all $j + 1$ characters.
If $c$ appears at position $k < j$: shift the pattern so $P[k]$ aligns with the text’s $c$.

Formally, let $\text{last}[c]$ be the rightmost index of character $c$ in $P$ (or $-1$ if absent). The bad character shift is:

$$\text{shift}_\text{bc} = j - \text{last}[c]$$

If this is negative or zero, clamp to 1 (always move forward).

Preprocessing the bad character table:

BadCharTable(P):
    for c in Sigma: last[c] = -1
    for k = 0 to m - 1:
        last[P[k]] = k
    return last

Good Suffix Rule

Suppose we matched a suffix of the pattern - positions $P[j+1..m-1]$ all matched the text - but then $P[j]$ failed. Call the matched portion $t = P[j+1..m-1]$.

We want to shift the pattern so a previous occurrence of $t$ aligns with what we just matched in the text. There are two sub-cases:

Sub-case 1: The string $t$ appears elsewhere in $P$, preceded by a character different from $P[j]$ (so we won’t immediately fail again at the same position). Shift to align that occurrence.

Sub-case 2: No complete copy of $t$ appears earlier in $P$. But some suffix of $t$ might match a prefix of $P$. Find the longest such suffix-prefix overlap and shift accordingly.

Discomfort check. The good suffix rule is notoriously tricky. The original 1977 Boyer-Moore paper contained a bug in the good suffix table construction. This is one of the most carefully tested pieces of code in algorithms: correctness proofs, published errata, multiple independent implementations. The two sub-cases above are genuinely distinct and both matter for correctness. If you’re implementing Boyer-Moore, test it on patterns like abab and aab where the sub-cases interact.

The fix for the bug, and the standard correct implementation, was published by Knuth, Morris, and Pratt in their own 1977 paper responding to Boyer and Moore.

Preprocessing the good suffix table in $O(m)$:

GoodSuffixTable(P):
    suf = SuffixLengths(P)    // suf[j] = length of longest suffix of P ending at j
                               // that matches a suffix of P
    for j = 0 to m - 1: gs[j] = m
    // Sub-case 1: matched suffix also occurs earlier
    for j = m - 1 downto 0:
        if suf[j] == j + 1:
            for k = 0 to m - 2 - j:
                if gs[k] == m: gs[k] = m - 1 - j
    // Sub-case 2: suffix of t matches prefix of P
    for j = 0 to m - 2:
        gs[m - 1 - suf[j]] = m - 1 - j
    return gs

The Search Loop

BoyerMoore(T, P):
    last = BadCharTable(P)
    gs   = GoodSuffixTable(P)
    i = 0
    while i <= n - m:
        j = m - 1
        while j >= 0 and P[j] == T[i + j]:
            j--
        if j < 0:
            report match at i
            i += gs[0]
        else:
            i += max(j - last[T[i + j]], gs[j])

The inner loop compares right-to-left. A mismatch at $j$ gives information about both the mismatching character (bad character rule) and the already-matched suffix (good suffix rule). Take the larger shift.

Why Sublinear?

On English text with a large alphabet, consider a random character $c$ from the text. For a pattern of length $m$ over an alphabet of size $|\Sigma|$, the probability that $c$ doesn’t appear in $P$ at all is roughly $(1 - 1/|\Sigma|)^m$. When it doesn’t appear, we shift by $m$ - skipping the whole pattern.

The average shift from the bad character rule alone is approximately:

$$\text{avg shift} \approx m \cdot \left(1 - \frac{1}{|\Sigma|}\right)$$

For English ($|\Sigma| \approx 26$) and a 10-character pattern, that’s about 9.6 positions per step. We examine roughly $n/9.6 \approx 0.1 \cdot n$ characters - genuinely sublinear.

For long patterns on large alphabets, this effect is dramatic. Searching a 100-character pattern through English text, Boyer-Moore examines far less than 1% of the characters.

Boyer-Moore-Horspool: The Simplified Version

Horspool (1980) noticed that nearly all the practical speed of Boyer-Moore comes from the bad character rule applied to just one character: the text character aligned with the rightmost position of the pattern.

Drop the good suffix table entirely. Precompute, for each character $c$, how far to shift when $c$ aligns with $P[m-1]$:

Horspool(T, P):
    for c in Sigma: shift[c] = m
    for k = 0 to m - 2:
        shift[P[k]] = m - 1 - k
    i = 0
    while i <= n - m:
        j = m - 1
        while j >= 0 and P[j] == T[i + j]: j--
        if j < 0: report match at i
        i += shift[T[i + m - 1]]

Horspool is much simpler to implement correctly. In practice on ASCII text it’s nearly as fast as full Boyer-Moore. The good suffix rule matters mostly for small alphabets (binary, DNA) where the bad character rule alone gives small shifts.

Summary

Algorithm	Preprocessing	Worst case	Average case	Notes
Naive	-	$O(nm)$	$O(n)$	Simple
KMP	$O(m)$	$O(n + m)$	$O(n + m)$	Guaranteed linear
Rabin-Karp	$O(m)$	$O(nm)$	$O(n + m)$	Best for multi-pattern
Boyer-Moore	$O(m +	\Sigma	)$	$O(nm)$
Boyer-Moore-Horspool	$O(m +	\Sigma	)$	$O(nm)$
Aho-Corasick	$O(m_\text{total} \cdot	\Sigma	)$	$O(n + \text{hits})$

When to use which: KMP for streaming or guaranteed linear time; Rabin-Karp for multi-pattern or 2D search; Boyer-Moore or Horspool for interactive text search with large alphabets; Aho-Corasick when you have dozens or thousands of patterns.

Read next:

The Viterbi Algorithm - Finding the Most Probable Hidden Path