Beam Search - The Best Path You Can Afford to Search For // Megha Bose

Helpful context:

GPT-4 generates text one token at a time. At each step it could choose from 50,000 possible tokens. If you greedily pick the most likely token, you often get repetitive, low-quality text - the model commits to an early choice and can’t recover. If you explore all paths (breadth-first search), the tree grows exponentially - 50,000 possibilities at step 1, $50,000^2$ at step 2, completely impractical.

Beam search is the compromise used in virtually every language model, machine translation system, and speech recognizer: keep the $k$ best partial sequences at each step, discard the rest. $k$ is called the beam width. It’s a heuristic - there’s no optimality guarantee - but it works empirically, and understanding why requires thinking carefully about what “best” means in sequential search.

Three Strategies for Sequential Search

To make the tradeoffs concrete, consider building a sequence of length $T$ from a vocabulary of size $V$.

Greedy decoding. Keep only the single best token at each step. At step $t$, pick the token maximizing the model’s log-probability conditioned on all previous tokens. Fast - $O(T \cdot V)$ - but locally myopic. The greedy choice at step 1 may foreclose globally good completions.

Exhaustive search. Maintain all $V^T$ complete sequences, score each, return the best. Optimal, but completely impractical for $T = 100$ and $V = 50,000$.

Beam search. Maintain exactly $k$ “beam hypotheses” at each step - the $k$ best partial sequences by cumulative log-probability. At each step, expand all $k$ hypotheses by all $V$ possible next tokens, generating $k \cdot V$ candidates. Keep only the top $k$ by score. Continue until all hypotheses have produced an end-of-sequence token.

Beam search at each step:

Expand: from $k$ partial sequences, generate $k \cdot V$ one-step extensions.
Score: rank all $k \cdot V$ candidates by cumulative log-probability.
Prune: keep only the top $k$.
Repeat.

Total computation: $O(k \cdot T \cdot V)$ - linear in $k$ instead of exponential.

The Tradeoff

Beam width 1 = greedy decoding. Fast, myopic.

Beam width $k$ = keep the $k$ best partial sequences at each step. More exploration, higher quality, $k$ times more computation than greedy.

Beam width $\to \infty$ = exhaustive BFS. Optimal (finds the highest-probability complete sequence), exponential.

In practice for language generation, the quality gain from increasing beam width diminishes quickly. Going from $k=1$ (greedy) to $k=4$ is a large gain. Going from $k=4$ to $k=10$ is modest. Going from $k=10$ to $k=100$ is often negligible. This is the empirical justification for using beam widths in the range 4 - 10.

Length Normalization

Longer sequences accumulate more log-probability terms - each one negative - so raw log-probability is biased toward short sequences. Beam search without length normalization produces unnaturally short outputs.

Length normalization adjusts the score of a hypothesis $y = (y_1, \ldots, y_T)$ as:

$$\text{score}(y) = \frac{\log P(y \mid x)}{T^\alpha},$$

where $\alpha \in [0, 1]$ controls normalization strength. With $\alpha = 0$ there’s no correction. With $\alpha = 1$ you normalize fully by length. Values around $\alpha = 0.6$ - $0.7$ work well empirically for neural machine translation.

The intuition: dividing by $T^\alpha$ penalizes very short sequences (small denominator, but short sequences had fewer penalty terms) and rewards longer sequences by making each additional token “cheaper” in the score. You’re trading off length against probability in a calibrated way.

Diverse Beam Search

Standard beam search produces $k$ hypotheses that are usually very similar - they often share a long common prefix, since the top $k$ sequences at step 1 all started with the same top token or two. You get $k$ nearly-identical outputs instead of $k$ diverse alternatives.

Diverse beam search (Vijayakumar et al., 2016) fixes this by partitioning the beam into $G$ groups of size $k/G$. Within each group, run standard beam search. Between groups, add a diversity penalty discouraging a group from selecting hypotheses already selected by earlier groups. The penalty rewards Hamming distance from previously selected hypotheses.

The result: $G$ groups of beam hypotheses, each exploring a different region of the output space. For tasks where you want multiple genuinely different outputs (summarization with multiple styles, candidate generation for downstream reranking), diverse beam search is the standard approach.

Discomfort check. Beam search with beam width $k$ is not guaranteed to find any solution within $k$ times optimal. It’s a pure heuristic. Why use it?

Because it works empirically in structured search spaces. Language generation, machine translation, and speech recognition all have the property that the model’s probability distribution over next tokens is well-calibrated and locally smooth. Good completions tend to follow from good prefixes. The beam search assumption - that the best completions of a sequence are the completions of the best partial sequences - is usually approximately correct.

The alternatives are either too slow (exact search) or too low quality (greedy). Beam search sits in the practical sweet spot. In machine translation specifically, BLEU scores are typically maximized around beam width 4 - 8; beyond that, quality often slightly decreases (a phenomenon called “beam search curse” - very high-probability sequences tend to be short and generic, so larger beams paradoxically find worse outputs in some settings).

Beam Search vs. Sampling

The alternative to beam search for language models is sampling - drawing the next token from the model’s distribution rather than taking the argmax.

Temperature sampling. Scale logits by $1/\tau$ before softmax. Low $\tau < 1$ sharpens the distribution (more deterministic). High $\tau > 1$ flattens it (more diverse, more random).

Top-$k$ sampling. Sample only from the $k$ most probable tokens. Prevents sampling from very low-probability tail tokens.

Nucleus sampling (top-$p$). Sample from the smallest set of tokens whose cumulative probability exceeds $p$. Adapts the effective vocabulary size to the local distribution shape.

Beam search maximizes probability - it finds the most likely sequence. Sampling draws from the distribution - it finds a typical sequence. For tasks with a definite correct answer (translation, transcription, structured prediction), beam search consistently wins. For open-ended creative generation (stories, dialogue, brainstorming), sampling wins - beam search produces repetitive, safe outputs.

Applications

Neural machine translation. Every major NMT system - Google Translate, DeepL, Meta’s NLLB - uses beam search at inference time with beam width 4 - 5. The computation cost $O(k \cdot T \cdot V)$ is easily parallelized on GPU with batched matrix multiplications.

Speech recognition. Acoustic models produce per-frame phone probabilities; beam search over word lattices combines acoustic model scores with language model scores. Both constrain the search jointly, and beam search allows this combination to operate efficiently.

Theorem proving. Neural theorem provers (like AlphaCode for competition math, or systems based on Lean) use beam search over proof step sequences, keeping multiple partial proofs alive simultaneously to avoid committing to one proof strategy too early.

Robot motion planning. Beam search over discretized action sequences finds near-optimal plans in structured environments faster than exhaustive search.

Summary

Strategy	Beam width	Computation	Optimality	Use when
Greedy decoding	1	$O(T \cdot V)$	No guarantee	Fastest inference, tolerant of quality loss
Beam search	$k$	$O(k \cdot T \cdot V)$	No guarantee (heuristic)	Standard NLP inference
Diverse beam search	$k$ with $G$ groups	$O(k \cdot T \cdot V)$	No guarantee	Need varied outputs
Sampling	-	$O(T \cdot V)$	None	Open-ended generation
Exhaustive BFS	$\infty$	$O(V^T)$	Optimal	Only for tiny $T, V$

Beam search is the practical answer to sequential combinatorial search. It can’t promise optimality - but it concentrates computation on the most promising partial solutions, and for structured prediction tasks with smooth probability landscapes, that’s usually enough. The beam width is a single knob trading off computation against quality, and empirically, a small $k$ captures most of the quality gain over greedy.

Read next:

BFGS - When First-Order Gradients Aren’t Fast Enough