Prerequisite: Graph Algorithms & Traversal

Search with a Memory Budget

Breadth-first search (BFS) over a sequence of decisions is guaranteed to find the optimal solution - but its memory usage grows exponentially with depth. Best-first search (like A*) is similar: it keeps all reached states in a priority queue, which can be enormous. For large structured prediction problems - machine translation, speech recognition, summarisation - the search space has millions or billions of states. Something must give.

Beam search makes the trade-off explicit: at each depth level, keep only the $k$ best partial solutions (the “beam”), discarding everything else. This constrains memory to $O(k)$ states at any depth and makes search tractable.

The parameter $k$ is called the beam width.

The Algorithm

Beam search builds solutions step by step (left to right in sequence generation). At each step $t$:

  1. For each hypothesis $h$ in the current beam of size $k$, expand $h$ by all possible next tokens/decisions. This generates up to $k \cdot V$ candidates where $V$ is the branching factor.
  2. Score each candidate using the model’s log-probability (or any additive scoring function).
  3. Retain only the top-$k$ candidates by score. These form the new beam.
  4. Terminate when all hypotheses have reached a stopping condition (e.g., end-of-sequence token).

Relationship to limiting cases:

  • $k = 1$: beam search reduces to greedy search - always take the highest-scoring next step.
  • $k = \infty$ (unbounded): beam search becomes full BFS, exploring all partial solutions.

Scoring and Length Normalisation

Accumulated log-probabilities have a systematic bias: longer sequences accumulate more negative log-probability terms, so shorter sequences are unfairly preferred.

Length normalisation corrects for this. The score of a hypothesis $y = (y_1, \ldots, y_T)$ is adjusted as:

$$\text{score}(y) = \frac{\log P(y \mid x)}{T^\alpha}$$

where $\alpha \in [0, 1]$ controls the strength of normalisation. $\alpha = 0$ gives raw log-probability (no correction); $\alpha = 1$ gives full normalisation by length. Values around $\alpha = 0.6$–$0.7$ work well empirically for neural machine translation.

Coverage penalty. In attention-based models, beam hypotheses may attend to source tokens repeatedly. A coverage penalty discourages this:

$$\text{score}(y) = \frac{\log P(y \mid x)}{T^\alpha} + \beta \sum_i \log!\left(\min!\left(\sum_t a_{ti}, 1\right)\right)$$

where $a_{ti}$ is the attention weight on source token $i$ at decoding step $t$.

Why Beam Search Can Miss the Global Optimum

Consider a sequence model with two possible first tokens: $a$ (probability $0.6$) and $b$ (probability $0.4$). Beam width $k = 1$ picks $a$ first. But suppose the optimal continuation of $b$ gives a total log-probability of $-0.3$, while the best continuation of $a$ gives $-2.0$. Greedy/beam-1 irreversibly chose the wrong first token.

More formally: beam search does not maintain the Viterbi guarantee (optimal path in a lattice). It is a heuristic that trades optimality for tractability. This is well-understood, and in practice the beam is chosen large enough that the gap is small.

Standard beam search is deterministic and produces $k$ highly similar hypotheses - they often share a long common prefix. Two extensions address this.

Stochastic beam search. Instead of taking the top-$k$ candidates, sample $k$ candidates from a softmax distribution over all candidates:

$$P(\text{select } h) \propto e^{\text{score}(h) / \tau}$$

Temperature $\tau$ controls diversity: $\tau \to 0$ recovers top-$k$; large $\tau$ approaches uniform sampling. This produces more varied outputs at the cost of some quality.

Diverse beam search (Vijayakumar et al., 2016). Partition the beam into $G$ groups of size $k/G$. Within each group, run standard beam search. Between groups, add a diversity penalty that discourages a group from selecting hypotheses already selected by earlier groups. This produces outputs that cover different semantic or structural options.

Beam Search vs. Sampling

For neural language models and seq2seq systems, the alternative to beam search is sampling - drawing the next token from the model’s distribution. Variants:

  • Temperature sampling: scale logits by $1/\tau$ before softmax. $\tau < 1$ sharpens the distribution (more deterministic); $\tau > 1$ flattens it (more random).
  • Top-$k$ sampling: sample only from the top-$k$ tokens by probability.
  • Top-$p$ (nucleus) sampling: sample from the smallest set of tokens whose cumulative probability exceeds $p$.

Beam search maximises probability and tends to produce fluent, safe, repetitive outputs. Sampling introduces diversity, preferred for open-ended generation (stories, dialogue). For tasks with a right answer (translation, summarisation, ASR), beam search consistently outperforms sampling.

Examples

Neural Machine Translation. Every major NMT system (Google, DeepL, Meta NLLB) uses beam search at inference time with beam width 4–5. The computation cost scales as $O(k \cdot T \cdot V)$ with sequence length $T$ and vocabulary size $V$, which is manageable with GPU batching.

Automatic Speech Recognition. ASR systems combine acoustic model scores (frame-level phone probabilities) with language model scores in a beam search over word lattices. Beam search allows the acoustic and language models to jointly constrain the search.

Summarisation. Abstractive summarisation with Transformer models uses beam search with length normalisation and no-repeat-$n$gram constraints to prevent degenerate looping outputs. Beam width 4 is standard.

Code Generation. Models like Codex use beam search (or sampling) to generate candidate programs. Multiple beam hypotheses can be executed and tested, combining beam search with program execution as an external oracle - a form of tree search in program space.


Read Next: Approximation Algorithms