Sequence Modeling & Language Models
Prerequisite:
Language Models: The Formal Setup
A language model assigns a probability to every sequence of words. By the chain rule of probability, any joint distribution over $(w_1, \ldots, w_T)$ factors as:
$$P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^T P(w_t \mid w_1, w_2, \ldots, w_{t-1})$$
This is not an approximation - it is an exact identity. The challenge is estimating each conditional $P(w_t \mid w_{<t})$ from finite data for arbitrarily long histories $w_{<t}$.
N-Gram Models
The simplest approximation is the Markov assumption: the next word depends only on the previous $n-1$ words.
$$P(w_t \mid w_{<t}) \approx P(w_t \mid w_{t-n+1}, \ldots, w_{t-1})$$
Bigram ($n=2$) and trigram ($n=3$) models estimate probabilities from corpus counts: $P(w_t \mid w_{t-1}) = C(w_{t-1}, w_t) / C(w_{t-1})$. These models have two fundamental problems. First, data sparsity: most $n$-gram sequences never appear in training, so raw count-based estimates are zero for most of the test distribution. Second, fixed context: trigrams cannot capture dependencies between words separated by more than 2 positions.
Kneser-Ney smoothing addresses sparsity through interpolation. The key insight is the distinction between frequency and versatility - “Francisco” is common but almost exclusively follows “San,” so it should not be highly weighted as a context-independent completion. Kneser-Ney assigns probabilities based on the number of distinct contexts a word appears in, not just its raw frequency:
$$P_{\text{KN}}(w \mid h) = \frac{\max(C(h,w) - d, 0)}{C(h)} + \lambda(h), P_{\text{KN}}(w)$$
where $d \in (0,1)$ is a discount and $\lambda(h)$ ensures the distribution sums to 1. Kneser-Ney smoothing with $n=5$ was the dominant language modeling approach before neural networks.
Neural Language Models
A neural language model uses a neural network to predict the conditional distribution over vocabulary given context. The standard architecture: embed the context tokens, process them with a neural network (RNN or Transformer), and project to a softmax distribution over vocabulary:
$$P(w_t \mid w_{<t}) = \text{softmax}(W_o, h_t)_{,w_t}$$
where $h_t$ is the final hidden representation at position $t$. Unlike $n$-gram models, neural LMs share statistical strength across similar contexts via learned embeddings - “the dog chased” and “a dog chased” produce similar $h_t$ because “dog” and “a/the” embed nearby.
Perplexity
Perplexity is the standard evaluation metric for language models. It is the exponentiated cross-entropy of the model on a held-out test sequence:
$$\text{PP}(P, w_{1:T}) = \exp!\left(-\frac{1}{T}\sum_{t=1}^T \log P(w_t \mid w_{<t})\right) = 2^{H(P,Q)}$$
where $H(P, Q)$ is the cross-entropy between the true distribution $P$ and the model $Q$. Perplexity equals the geometric mean inverse probability per token - lower is better. A perplexity of $k$ means the model is, on average, as confused as if it had to choose uniformly among $k$ options at each step.
A uniform model over a vocabulary of 50,000 words has perplexity 50,000. A good 5-gram model achieves around 100-150 on Penn Treebank. Modern large language models achieve perplexities below 10 on standard benchmarks.
Causal vs Masked Language Modeling
Causal (autoregressive) language modeling predicts $w_t$ from $w_{<t}$ only - the future is masked. This is the natural objective for generation: GPT and its successors use left-to-right causal LM. The model is trained to minimize $-\sum_t \log P(w_t \mid w_{<t})$ over the training corpus.
Masked language modeling (MLM), introduced in BERT, randomly masks 15% of tokens and trains the model to predict them from both left and right context:
$$\mathcal{L}{\text{MLM}} = -\mathbb{E}\left[\sum{t \in \mathcal{M}} \log P(w_t \mid w_{\setminus \mathcal{M}})\right]$$
where $\mathcal{M}$ is the set of masked positions. MLM produces bidirectional representations that are powerful for understanding tasks (NER, QA, classification), but the model cannot be directly used for left-to-right generation.
Autoregressive Generation Strategies
At inference, autoregressive generation samples one token at a time, conditioning each new token on all previously generated tokens.
Greedy decoding takes $\hat{w}t = \arg\max_w P(w \mid w{<t})$ at each step. It is fast but suboptimal - locally optimal choices lead to globally poor sequences.
Beam search maintains a beam of $B$ partial hypotheses at each step, keeping the $B$ highest-probability continuations. At the end, the highest-scoring complete sequence is returned. Beam search with $B=4$ to $10$ is standard in neural machine translation. A known failure mode: it produces repetitive, overly generic text for open-ended generation tasks.
Temperature sampling scales logits before softmax: $P_\tau(w) \propto \exp(z_w / \tau)$. At $\tau = 1$, sampling is from the model distribution. As $\tau \to 0$, it approaches greedy decoding. At $\tau > 1$, the distribution flattens toward uniform, increasing diversity at the cost of coherence.
Top-$k$ sampling restricts sampling to the $k$ most probable tokens, renormalizing over them. This avoids low-probability nonsense words, but a fixed $k$ is suboptimal because the vocabulary’s probability mass is spread differently in different contexts.
Top-$p$ (nucleus) sampling (Holtzman et al., 2020) instead finds the smallest set $V^{(p)}$ such that $\sum_{w \in V^{(p)}} P(w \mid w_{<t}) \ge p$, then samples from this nucleus. This adapts to context: when the model is confident, the nucleus is small; when uncertain, it is larger. Nucleus sampling with $p = 0.9$ or $0.95$ produces text that human raters prefer to both greedy decoding and beam search for open-ended generation.
Sequence-to-Sequence Models
Sequence-to-sequence (seq2seq) models transform one sequence into another of potentially different length and vocabulary. The encoder-decoder architecture encodes the source with an encoder RNN, then decodes with a separate decoder RNN conditioned on the encoder’s output.
Teacher forcing is the standard training procedure: at each decoder step, the ground-truth previous token $y_{t-1}$ is fed as input, even if the model’s own prediction $\hat{y}_{t-1}$ was wrong. This makes training stable and efficient because errors do not compound.
However, at inference the model receives its own predictions as input. This exposure bias creates a train-test mismatch: the model has never learned to recover from its own mistakes. Scheduled sampling (Bengio et al., 2015) addresses this by gradually replacing teacher-forced inputs with model predictions during training.
Examples
GPT as an autoregressive language model. GPT is a Transformer decoder trained on the causal language modeling objective. At each position, the self-attention is masked so that position $t$ can only attend to positions $\le t$ - preserving the autoregressive property. Despite its conceptual simplicity (just next-token prediction), scaling GPT to billions of parameters and trillions of tokens yields a model capable of in-context learning, translation, arithmetic, and code generation - all as instances of next-token prediction.
Perplexity on held-out text. A concrete evaluation: train a trigram model and a small GPT-2 (117M parameters) on WikiText-103 (103M training tokens). The trigram model achieves perplexity around 170; GPT-2 achieves around 18. To verify no data contamination, evaluate on a held-out Wikipedia article published after the training data cutoff - the gap should remain similar. Perplexity compresses model quality into a single number but does not measure factual accuracy, harmlessness, or fluency as humans perceive it. This gap between perplexity and human judgment motivates RLHF and other alignment-oriented training objectives.
Read Next: