Prerequisite: Dynamic Programming


Given a sequence of noisy observations, what is the most probable sequence of hidden states that could have generated it? This is the decoding problem for Hidden Markov Models (HMMs), and the Viterbi algorithm solves it exactly in polynomial time using dynamic programming on a trellis graph.

Hidden Markov Models

An HMM is defined by five components:

  • $\mathcal{S} = {s_1, \ldots, s_K}$: a finite set of hidden states
  • $\mathcal{O} = {o_1, \ldots, o_V}$: a finite observation vocabulary
  • $A \in \mathbb{R}^{K \times K}$: the transition matrix, where $A_{ij} = P(q_t = s_j \mid q_{t-1} = s_i)$
  • $B \in \mathbb{R}^{K \times V}$: the emission matrix, where $B_{ij} = P(x_t = o_j \mid q_t = s_i)$
  • $\pi \in \mathbb{R}^K$: the initial state distribution, $\pi_i = P(q_1 = s_i)$

Both $A$ and $B$ have rows summing to 1. The joint probability of a hidden state sequence $\mathbf{q} = q_1, \ldots, q_T$ and an observation sequence $\mathbf{x} = x_1, \ldots, x_T$ is:

$$P(\mathbf{q}, \mathbf{x}) = \pi_{q_1} \prod_{t=2}^{T} A_{q_{t-1}, q_t} \prod_{t=1}^{T} B_{q_t, x_t}$$

There are three canonical problems for HMMs:

  1. Evaluation: $P(\mathbf{x} \mid \lambda)$ - solved by the Forward algorithm in $O(K^2 T)$.
  2. Decoding: $\arg\max_{\mathbf{q}} P(\mathbf{q} \mid \mathbf{x}, \lambda)$ - solved by Viterbi.
  3. Learning: $\arg\max_\lambda P(\mathbf{x} \mid \lambda)$ - solved by Baum-Welch (an EM algorithm).

The Viterbi Recurrence

Brute-force evaluation of all $K^T$ state sequences is intractable. Define the Viterbi variable:

$$\delta_t(j) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1}, q_t = s_j, x_1, \ldots, x_t \mid \lambda)$$

This is the probability of the most probable path ending in state $s_j$ at time $t$, having emitted observations $x_1, \ldots, x_t$.

Initialisation ($t = 1$):

$$\delta_1(j) = \pi_j \cdot B_{j, x_1}, \quad \psi_1(j) = 0$$

Recursion ($t = 2, \ldots, T$):

$$\delta_t(j) = \max_{i=1}^{K} \left[ \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_{j, x_t}$$

$$\psi_t(j) = \arg\max_{i=1}^{K} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$

The backpointer $\psi_t(j)$ records which previous state led to the best path ending at $(t, j)$.

Termination:

$$P^\ast = \max_{j} \delta_T(j), \qquad q_T^\ast = \arg\max_j \delta_T(j)$$

Backtracking ($t = T-1, \ldots, 1$):

$$q_t^\ast = \psi_{t+1}(q_{t+1}^\ast)$$

Pseudocode

Viterbi(x[1..T], pi, A, B):
  // Initialisation
  for j = 1 to K:
    delta[1][j] = pi[j] * B[j][x[1]]
    psi[1][j]   = 0
  // Recursion
  for t = 2 to T:
    for j = 1 to K:
      best_prob  = -infinity
      best_state = -1
      for i = 1 to K:
        val = delta[t-1][i] * A[i][j]
        if val > best_prob:
          best_prob  = val
          best_state = i
      delta[t][j] = best_prob * B[j][x[t]]
      psi[t][j]   = best_state
  // Termination
  q*[T] = argmax_j delta[T][j]
  // Backtrack
  for t = T-1 downto 1:
    q*[t] = psi[t+1][q*[t+1]]
  return q*

Complexity

Time. The recursion has $T$ steps, each requiring $K$ updates, each scanning $K$ previous states: $O(K^2 T)$.

Space. The full trellis $\delta$ and backpointer table $\psi$ each have $K \times T$ entries: $O(KT)$ space. If only the final probability is needed (not the path), a rolling array reduces space to $O(K)$.

If transitions are sparse (each state transitions to at most $d \ll K$ others), time reduces to $O(d \cdot K \cdot T)$.

Numerical Stability: Log-Domain Computation

Products of many small probabilities underflow to zero in floating-point. Work in log space instead. Since $\log$ is monotone, maximising $\delta_t(j)$ is equivalent to maximising $\log \delta_t(j)$:

$$\log \delta_t(j) = \max_{i} \left[ \log \delta_{t-1}(i) + \log A_{ij} \right] + \log B_{j, x_t}$$

Additions replace multiplications; the $\max$ operation is unchanged. This eliminates underflow for sequences of any practical length.

Connection to DAG Shortest Path

The Viterbi trellis is a directed acyclic graph (DAG) with $K \times T$ nodes. Each node $(t, j)$ has edges from every $(t-1, i)$ with weight $\log A_{ij} + \log B_{j, x_t}$. The source nodes at $t=1$ have initial weight $\log \pi_j + \log B_{j, x_1}$.

Viterbi is exactly the longest path in this DAG, solved in $O(|V| + |E|) = O(K^2 T)$ time by the standard DAG DP. This connection makes it easy to generalise: add start/end states, allow weighted emissions, or extend to semi-Markov models.

Beam Search Variant

When $K$ is very large (e.g. thousands of states in speech recognition), maintaining the full trellis is impractical. Beam search keeps only the top $\beta$ states at each time step by probability, discarding the rest. This reduces time to $O(\beta^2 T)$ and space to $O(\beta T)$, at the cost of approximate (not guaranteed optimal) decoding. Beam widths of 10-100 often recover near-optimal solutions in practice.

Examples

Part-of-speech tagging. States are POS tags (noun, verb, adjective, …), observations are words. The transition matrix captures grammatical regularities (nouns tend to follow determiners), and emissions capture lexical likelihoods (the word “run” is more likely to be tagged as a verb than a noun). Viterbi decodes the most probable tag sequence in $O(K^2 \cdot |\text{sentence}|)$.

Speech recognition. In classical GMM-HMM systems, states represent phoneme subunits (triphone contexts). An utterance of 2 seconds at 100 frames/sec gives $T = 200$; with $K \sim 10{,}000$ states, the log-domain Viterbi with beam pruning is the standard decoder in both research and production systems (Kaldi, HTK).

Gene finding. Tools like GENSCAN model genomic DNA with an HMM whose states encode biological regions: exons, introns, promoters, splice sites. Viterbi decoding over a chromosome recovers the most probable gene structure.


Read Next: Sequence Labeling