Boyer-Moore // Megha Bose

Prerequisite: String Algorithms & Pattern Matching

Boyer-Moore is one of the fastest practical string-search algorithms ever devised. Unlike KMP, which reads the text left-to-right and never backtracks, Boyer-Moore aligns the pattern from right to left and skips large portions of the text when a mismatch is detected. On random text with a large alphabet it achieves sublinear $O(n/m)$ average-case performance - meaning it reads fewer characters than are in the text.

Two orthogonal heuristics drive the skipping. They are computed independently during preprocessing and the algorithm takes the maximum shift suggested by either one.

Bad Character Heuristic

When we compare $P[j]$ against text character $c = T[i + j]$ and find a mismatch, we ask: where does $c$ appear in the pattern?

If $c$ does not appear anywhere in $P[0..j-1]$ (to the left of the mismatch), slide the pattern completely past position $i + j$.
If $c$ appears at pattern position $k < j$, shift the pattern right so $P[k]$ aligns with $T[i+j]$.

Formally, let $\text{last}[c]$ be the rightmost index of character $c$ in $P$, or $-1$ if absent. The bad character shift is:

$$\text{shift}_{\text{bc}} = j - \text{last}[c]$$

(If this value is $\leq 0$, clamp to $1$ to guarantee forward progress.)

Preprocessing. Build a table $\text{last}[c]$ for every $c \in \Sigma$:

BadCharTable(P, Sigma):
  for c in Sigma: last[c] = -1
  for k = 0 to m - 1:
    last[P[k]] = k
  return last

Time $O(m + |\Sigma|)$, space $O(|\Sigma|)$.

Good Suffix Heuristic

Suppose positions $P[j+1..m-1]$ matched but $P[j]$ did not. The matched suffix $t = P[j+1..m-1]$ is known. We look for the next rightmost occurrence of $t$ inside $P$ that is preceded by a different character than $P[j]$ (so the same mismatch won’t recur).

More precisely, define:

$\text{shift1}[j]$: shift so that the rightmost copy of $t = P[j+1..m-1]$ in $P$ that has a different character to its left aligns with $T$.
$\text{shift2}[j]$: if no such copy exists, find the longest proper suffix of $t$ that is also a prefix of $P$ and align it.

Preprocessing. Use the failure function (as in KMP) on the reversed pattern to compute both tables in $O(m)$.

GoodSuffixTables(P):
  // Compute suffix array of P (lengths of suffixes that are also suffixes of P)
  suf = SuffixLengths(P)
  for j = 0 to m - 1: gs[j] = m     // default: shift by full length
  // Case 1: suffix also occurs elsewhere in P
  for j = m - 1 downto 0:
    if suf[j] == j + 1:               // P[0..j] is a suffix of P
      for k = 0 to m - 2 - j:
        if gs[k] == m: gs[k] = m - 1 - j
  // Case 2: part of suffix matches a prefix
  for j = 0 to m - 2:
    gs[m - 1 - suf[j]] = m - 1 - j
  return gs

Total preprocessing time $O(m)$, space $O(m)$.

The Search Loop

BoyerMoore(T, P):
  last = BadCharTable(P)
  gs   = GoodSuffixTables(P)
  i = 0
  while i <= n - m:
    j = m - 1
    while j >= 0 and P[j] == T[i + j]:
      j--
    if j < 0:
      report occurrence at i
      i += gs[0]          // use good suffix shift after full match
    else:
      shift_bc = j - last[T[i + j]]
      shift_gs = gs[j]
      i += max(shift_bc, shift_gs)  // always at least 1

The inner loop compares right-to-left. The key insight: a mismatch at position $j$ from the right gives us information about both the mismatching character and the already-matched suffix.

Complexity Analysis

Worst case. $O(nm)$: consider $T = \texttt{aaa…a}$ and $P = \texttt{ba…a}$. Every alignment causes $m-1$ matches before a mismatch at the leftmost position, and the good suffix shift is only 1. However, adding the Galil rule restores linear worst-case: after a full match, remember the last $k$ characters that were proven to match and skip re-comparing them in the next alignment. This yields $O(n + m)$ worst case.

Best/average case. $O(n/m)$ on random text over a large alphabet: the bad character heuristic alone typically shifts by close to $m$ positions, so we examine only about $n/m$ text positions.

Boyer-Moore-Horspool Simplification

Horspool (1980) observed that using only the bad character heuristic applied to the text character aligned with the rightmost pattern position (not the mismatch position) yields nearly the same practical speed with far simpler code:

Horspool(T, P):
  // Preprocess: bad char shift for character under P[m-1]
  for c in Sigma: shift[c] = m
  for k = 0 to m - 2: shift[P[k]] = m - 1 - k
  i = 0
  while i <= n - m:
    j = m - 1
    while j >= 0 and P[j] == T[i + j]: j--
    if j < 0: report occurrence at i
    i += shift[T[i + m - 1]]

Horspool is the variant typically taught first in courses, and is competitive with full Boyer-Moore on ASCII text (e.g. English prose). The additional good suffix table matters more when the alphabet is small (binary or DNA).

Why Boyer-Moore Works So Well in Practice

On English text with $|\Sigma| = 26$, a random character from the text appears in a length-$m$ pattern with probability roughly $1 - (25/26)^m$, which is close to 1 for $m \geq 20$. But when it does not appear, the shift is $m$. On average the bad character shift is approximately $m \cdot (1 - 1/|\Sigma|)$, giving $O(n/m)$ comparisons. For long patterns on large alphabets, Boyer-Moore can be dramatically faster than $O(n)$.

Examples

Text editors. vim’s : search and VS Code’s find-in-file both use Boyer-Moore or Horspool as the fast path for fixed-string (non-regex) queries. When you search a 10 MB source file for a long identifier, BM skips most of the file entirely.

Virus signature scanning. Antivirus engines maintain databases of thousands of byte-sequence signatures. Aho-Corasick handles simultaneous multi-pattern matching, but for scanning a binary against a single known malware signature, Boyer-Moore’s right-to-left scanning and large skips are particularly effective on binary (often repetitive) data.