The Viterbi Algorithm
Prerequisite: Dynamic Programming
Given a sequence of noisy observations, what is the most probable sequence of hidden states that could have generated it? This is the decoding problem for Hidden Markov Models (HMMs), and the Viterbi algorithm solves it exactly in polynomial time using dynamic programming on a trellis graph.
Hidden Markov Models
An HMM is defined by five components:
- $\mathcal{S} = {s_1, \ldots, s_K}$: a finite set of hidden states
- $\mathcal{O} = {o_1, \ldots, o_V}$: a finite observation vocabulary
- $A \in \mathbb{R}^{K \times K}$: the transition matrix, where $A_{ij} = P(q_t = s_j \mid q_{t-1} = s_i)$
- $B \in \mathbb{R}^{K \times V}$: the emission matrix, where $B_{ij} = P(x_t = o_j \mid q_t = s_i)$
- $\pi \in \mathbb{R}^K$: the initial state distribution, $\pi_i = P(q_1 = s_i)$
Both $A$ and $B$ have rows summing to 1. The joint probability of a hidden state sequence $\mathbf{q} = q_1, \ldots, q_T$ and an observation sequence $\mathbf{x} = x_1, \ldots, x_T$ is:
$$P(\mathbf{q}, \mathbf{x}) = \pi_{q_1} \prod_{t=2}^{T} A_{q_{t-1}, q_t} \prod_{t=1}^{T} B_{q_t, x_t}$$
There are three canonical problems for HMMs:
- Evaluation: $P(\mathbf{x} \mid \lambda)$ - solved by the Forward algorithm in $O(K^2 T)$.
- Decoding: $\arg\max_{\mathbf{q}} P(\mathbf{q} \mid \mathbf{x}, \lambda)$ - solved by Viterbi.
- Learning: $\arg\max_\lambda P(\mathbf{x} \mid \lambda)$ - solved by Baum-Welch (an EM algorithm).
The Viterbi Recurrence
Brute-force evaluation of all $K^T$ state sequences is intractable. Define the Viterbi variable:
$$\delta_t(j) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1}, q_t = s_j, x_1, \ldots, x_t \mid \lambda)$$
This is the probability of the most probable path ending in state $s_j$ at time $t$, having emitted observations $x_1, \ldots, x_t$.
Initialisation ($t = 1$):
$$\delta_1(j) = \pi_j \cdot B_{j, x_1}, \quad \psi_1(j) = 0$$
Recursion ($t = 2, \ldots, T$):
$$\delta_t(j) = \max_{i=1}^{K} \left[ \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_{j, x_t}$$
$$\psi_t(j) = \arg\max_{i=1}^{K} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$
The backpointer $\psi_t(j)$ records which previous state led to the best path ending at $(t, j)$.
Termination:
$$P^\ast = \max_{j} \delta_T(j), \qquad q_T^\ast = \arg\max_j \delta_T(j)$$
Backtracking ($t = T-1, \ldots, 1$):
$$q_t^\ast = \psi_{t+1}(q_{t+1}^\ast)$$
Pseudocode
Viterbi(x[1..T], pi, A, B):
// Initialisation
for j = 1 to K:
delta[1][j] = pi[j] * B[j][x[1]]
psi[1][j] = 0
// Recursion
for t = 2 to T:
for j = 1 to K:
best_prob = -infinity
best_state = -1
for i = 1 to K:
val = delta[t-1][i] * A[i][j]
if val > best_prob:
best_prob = val
best_state = i
delta[t][j] = best_prob * B[j][x[t]]
psi[t][j] = best_state
// Termination
q*[T] = argmax_j delta[T][j]
// Backtrack
for t = T-1 downto 1:
q*[t] = psi[t+1][q*[t+1]]
return q*
Complexity
Time. The recursion has $T$ steps, each requiring $K$ updates, each scanning $K$ previous states: $O(K^2 T)$.
Space. The full trellis $\delta$ and backpointer table $\psi$ each have $K \times T$ entries: $O(KT)$ space. If only the final probability is needed (not the path), a rolling array reduces space to $O(K)$.
If transitions are sparse (each state transitions to at most $d \ll K$ others), time reduces to $O(d \cdot K \cdot T)$.
Numerical Stability: Log-Domain Computation
Products of many small probabilities underflow to zero in floating-point. Work in log space instead. Since $\log$ is monotone, maximising $\delta_t(j)$ is equivalent to maximising $\log \delta_t(j)$:
$$\log \delta_t(j) = \max_{i} \left[ \log \delta_{t-1}(i) + \log A_{ij} \right] + \log B_{j, x_t}$$
Additions replace multiplications; the $\max$ operation is unchanged. This eliminates underflow for sequences of any practical length.
Connection to DAG Shortest Path
The Viterbi trellis is a directed acyclic graph (DAG) with $K \times T$ nodes. Each node $(t, j)$ has edges from every $(t-1, i)$ with weight $\log A_{ij} + \log B_{j, x_t}$. The source nodes at $t=1$ have initial weight $\log \pi_j + \log B_{j, x_1}$.
Viterbi is exactly the longest path in this DAG, solved in $O(|V| + |E|) = O(K^2 T)$ time by the standard DAG DP. This connection makes it easy to generalise: add start/end states, allow weighted emissions, or extend to semi-Markov models.
Beam Search Variant
When $K$ is very large (e.g. thousands of states in speech recognition), maintaining the full trellis is impractical. Beam search keeps only the top $\beta$ states at each time step by probability, discarding the rest. This reduces time to $O(\beta^2 T)$ and space to $O(\beta T)$, at the cost of approximate (not guaranteed optimal) decoding. Beam widths of 10-100 often recover near-optimal solutions in practice.
Examples
Part-of-speech tagging. States are POS tags (noun, verb, adjective, …), observations are words. The transition matrix captures grammatical regularities (nouns tend to follow determiners), and emissions capture lexical likelihoods (the word “run” is more likely to be tagged as a verb than a noun). Viterbi decodes the most probable tag sequence in $O(K^2 \cdot |\text{sentence}|)$.
Speech recognition. In classical GMM-HMM systems, states represent phoneme subunits (triphone contexts). An utterance of 2 seconds at 100 frames/sec gives $T = 200$; with $K \sim 10{,}000$ states, the log-domain Viterbi with beam pruning is the standard decoder in both research and production systems (Kaldi, HTK).
Gene finding. Tools like GENSCAN model genomic DNA with an HMM whose states encode biological regions: exons, introns, promoters, splice sites. Viterbi decoding over a chromosome recovers the most probable gene structure.
Read Next: Sequence Labeling