Helpful context:


Your phone’s voice recognition doesn’t hear words - it hears pressure waves. To understand speech, it needs to find the most likely sequence of words given the sounds it receives. Hidden Markov Models are the statistical backbone. Viterbi is the algorithm that decodes them. It’s in your phone, GPS systems, computational biology (finding genes), and spell-checkers.

The problem it solves: given a sequence of observations you can see, find the most likely sequence of hidden states that produced them.


Hidden Markov Models

An HMM has two levels: what’s actually happening (hidden states), and what you can observe (emissions).

The classic example: a biased coin-flipper who secretly switches between a fair coin and a loaded coin. You see only the coin flips (H or T). You can’t see which coin they’re using. You want to infer the sequence of coins.

Formally, an HMM has:

  • A set of hidden states $S = \{s_1, \ldots, s_K\}$
  • An observation vocabulary $O = \{o_1, \ldots, o_V\}$
  • Transition probabilities $A_{ij} = P(\text{next state} = s_j \mid \text{current state} = s_i)$
  • Emission probabilities $B_{jv} = P(\text{observe } o_v \mid \text{state} = s_j)$
  • An initial distribution $\pi_j = P(\text{start in state } s_j)$

The joint probability of a hidden state sequence $\mathbf{q} = q_1, \ldots, q_T$ and observations $\mathbf{x} = x_1, \ldots, x_T$ is:

$$P(\mathbf{q}, \mathbf{x}) = \pi_{q_1} \prod_{t=2}^{T} A_{q_{t-1}, q_t} \prod_{t=1}^{T} B_{q_t, x_t}$$

There are three classic HMM problems:

  1. Evaluation: What is $P(\mathbf{x})$ - how probable is this observation sequence? Solved by the forward algorithm in $O(K^2 T)$.
  2. Decoding: What is the most likely state sequence $\mathbf{q}$ given $\mathbf{x}$? Solved by Viterbi.
  3. Learning: Given observations, estimate $A$, $B$, $\pi$. Solved by Baum-Welch (an EM algorithm).

Viterbi as Dynamic Programming on a Trellis

Brute force means checking all $K^T$ possible state sequences. For $K = 100$ states and $T = 100$ time steps, that’s $100^{100}$ - completely hopeless.

The key structure: the probability of the best path ending at state $j$ at time $t$ depends only on the best paths ending at time $t-1$. This is optimal substructure - DP applies.

Define the Viterbi variable:

$$\delta_t(j) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1}, q_t = s_j, x_1, \ldots, x_t)$$

This is the probability of the most probable path that ends in state $j$ at time $t$, having produced observations $x_1, \ldots, x_t$.

Initialization ($t = 1$):

$$\delta_1(j) = \pi_j \cdot B_{j, x_1}$$

Recursion ($t = 2, \ldots, T$):

$$\delta_t(j) = \max_{i=1}^{K} \left[\delta_{t-1}(i) \cdot A_{ij}\right] \cdot B_{j, x_t}$$

$$\psi_t(j) = \arg\max_{i=1}^{K} \left[\delta_{t-1}(i) \cdot A_{ij}\right]$$

The backpointer $\psi_t(j)$ records which previous state led to the best path ending at $(t, j)$.

Termination:

$$P^* = \max_j \delta_T(j), \qquad q_T^* = \arg\max_j \delta_T(j)$$

Backtrack: follow the backpointers from $q_T^$ backwards: $q_t^ = \psi_{t+1}(q_{t+1}^*)$.


A Concrete Walk-Through

Three states: Rainy (R), Cloudy (C), Sunny (S). Two observations: umbrella (U), no umbrella (N).

Say the model is:

  • Initial: $\pi = [0.5, 0.3, 0.2]$ for R, C, S
  • Transitions: rainy days tend to follow rainy days, etc.
  • Emissions: you bring an umbrella on rainy and cloudy days

You observe the sequence U, U, N over three days. Build the trellis:

Time Rainy $\delta$ Cloudy $\delta$ Sunny $\delta$
1 $0.5 \cdot P(U \mid R)$ $0.3 \cdot P(U \mid C)$ $0.2 \cdot P(U \mid S)$
2 $\max(\ldots) \cdot P(U \mid R)$ $\max(\ldots) \cdot P(U \mid C)$ $\max(\ldots) \cdot P(U \mid S)$
3 $\max(\ldots) \cdot P(N \mid R)$ $\max(\ldots) \cdot P(N \mid C)$ $\max(\ldots) \cdot P(N \mid S)$

At each cell you record both the value and the backpointer. After filling the table, follow backpointers from the max value at $t=3$ to recover the most likely weather sequence.

In practice: never do this in regular floating point. Products of many small probabilities underflow to zero quickly. Work in log space instead:

$$\log \delta_t(j) = \max_{i} \left[\log \delta_{t-1}(i) + \log A_{ij}\right] + \log B_{j, x_t}$$

Additions replace multiplications. The $\max$ operation is unchanged. Underflow is eliminated.


Discomfort check. Viterbi maximizes $P(\mathbf{q} \mid \mathbf{x})$ - the most likely joint state sequence. Why not just pick the most likely state at each time step independently?

Because states are correlated. The most likely state at each step, chosen independently, might form a sequence with very low joint probability. Example: if transitions between certain states are rare, the most likely individual states might form a path that’s actually impossible or near-impossible when you account for the transitions between them. Viterbi finds the globally most likely path - optimizing over the whole sequence at once, not position by position. This is analogous to why shortest path algorithms need global coordination, not greedy local choices.


Complexity and Space

The recursion has $T$ steps. At each step we update $K$ states, each requiring a scan of $K$ previous states. Time: $O(K^2 T)$.

Space: the trellis $\delta$ and backpointer table $\psi$ together need $O(KT)$. If you only want the final probability (not the path itself), you only need the current column - $O(K)$ space.

When $K$ is very large (thousands of states in speech recognition), keeping the full trellis is impractical. Beam search keeps only the top $\beta$ states at each time step: $O(\beta^2 T)$ time and $O(\beta T)$ space, at the cost of approximate rather than exact decoding. Beam widths of 10 - 100 usually recover near-optimal solutions in practice.


Applications

Speech recognition. In classical GMM-HMM systems, states represent phoneme subunits. Viterbi decodes the most probable phoneme sequence from the acoustic observations. For decades this was the core of every production speech recognizer.

Gene prediction. Tools like GENSCAN model genomic DNA with an HMM whose states encode biological regions: exons, introns, promoters, splice sites. Viterbi decoding over a chromosome recovers the most probable gene structure. A single chromosome contains millions of base pairs - the $O(K^2 T)$ algorithm makes this tractable.

Error-correcting codes. The Viterbi decoder is the standard decoder for convolutional codes used in satellite communication, cellular networks, and deep-space probes. Every smartphone transmission uses Viterbi decoding.


Summary

Concept Detail
HMM Hidden states, observable emissions, transition + emission probs
Three HMM problems Evaluation (forward), Decoding (Viterbi), Learning (Baum-Welch)
Viterbi idea DP on a $K \times T$ trellis
Recurrence $\delta_t(j) = \max_i [\delta_{t-1}(i) \cdot A_{ij}] \cdot B_{j, x_t}$
Time $O(K^2 T)$
Space $O(KT)$ for full path, $O(K)$ if only probability needed
Numerical stability Work in log space: replace products with sums
Why not per-step max State correlations make joint path optimization necessary

Read next: