Sequence Modeling & Language Models // Megha Bose

Prerequisite:

Language Models: The Formal Setup

A language model assigns a probability to every sequence of words. By the chain rule of probability, any joint distribution over $(w_1, \ldots, w_T)$ factors as:

$$P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^T P(w_t \mid w_1, w_2, \ldots, w_{t-1})$$

This is not an approximation - it is an exact identity. The challenge is estimating each conditional $P(w_t \mid w_{<t})$ from finite data for arbitrarily long histories $w_{<t}$.

N-Gram Models

The simplest approximation is the Markov assumption: the next word depends only on the previous $n-1$ words.

$$P(w_t \mid w_{<t}) \approx P(w_t \mid w_{t-n+1}, \ldots, w_{t-1})$$

Bigram ($n=2$) and trigram ($n=3$) models estimate probabilities from corpus counts: $P(w_t \mid w_{t-1}) = C(w_{t-1}, w_t) / C(w_{t-1})$. These models have two fundamental problems. First, data sparsity: most $n$-gram sequences never appear in training, so raw count-based estimates are zero for most of the test distribution. Second, fixed context: trigrams cannot capture dependencies between words separated by more than 2 positions.

Kneser-Ney smoothing addresses sparsity through interpolation. The key insight is the distinction between frequency and versatility - “Francisco” is common but almost exclusively follows “San,” so it should not be highly weighted as a context-independent completion. Kneser-Ney assigns probabilities based on the number of distinct contexts a word appears in, not just its raw frequency:

$$P_{\text{KN}}(w \mid h) = \frac{\max(C(h,w) - d, 0)}{C(h)} + \lambda(h), P_{\text{KN}}(w)$$

where $d \in (0,1)$ is a discount and $\lambda(h)$ ensures the distribution sums to 1. Kneser-Ney smoothing with $n=5$ was the dominant language modeling approach before neural networks.

Neural Language Models

A neural language model uses a neural network to predict the conditional distribution over vocabulary given context. The standard architecture: embed the context tokens, process them with a neural network (RNN or Transformer), and project to a softmax distribution over vocabulary:

$$P(w_t \mid w_{<t}) = \text{softmax}(W_o, h_t)_{,w_t}$$

where $h_t$ is the final hidden representation at position $t$. Unlike $n$-gram models, neural LMs share statistical strength across similar contexts via learned embeddings - “the dog chased” and “a dog chased” produce similar $h_t$ because “dog” and “a/the” embed nearby.

Perplexity

Perplexity is the standard evaluation metric for language models. It is the exponentiated cross-entropy of the model on a held-out test sequence:

$$\text{PP}(P, w_{1:T}) = \exp!\left(-\frac{1}{T}\sum_{t=1}^T \log P(w_t \mid w_{<t})\right) = 2^{H(P,Q)}$$

where $H(P, Q)$ is the cross-entropy between the true distribution $P$ and the model $Q$. Perplexity equals the geometric mean inverse probability per token - lower is better. A perplexity of $k$ means the model is, on average, as confused as if it had to choose uniformly among $k$ options at each step.

A uniform model over a vocabulary of 50,000 words has perplexity 50,000. A good 5-gram model achieves around 100-150 on Penn Treebank. Modern large language models achieve perplexities below 10 on standard benchmarks.

Causal vs Masked Language Modeling

Causal (autoregressive) language modeling predicts $w_t$ from $w_{<t}$ only - the future is masked. This is the natural objective for generation: GPT and its successors use left-to-right causal LM. The model is trained to minimize $-\sum_t \log P(w_t \mid w_{<t})$ over the training corpus.

Masked language modeling (MLM), introduced in BERT, randomly masks 15% of tokens and trains the model to predict them from both left and right context:

$$\mathcal{L}{\text{MLM}} = -\mathbb{E}\left[\sum{t \in \mathcal{M}} \log P(w_t \mid w_{\setminus \mathcal{M}})\right]$$

where $\mathcal{M}$ is the set of masked positions. MLM produces bidirectional representations that are powerful for understanding tasks (NER, QA, classification), but the model cannot be directly used for left-to-right generation.

Autoregressive Generation Strategies

At inference, autoregressive generation samples one token at a time, conditioning each new token on all previously generated tokens.

Greedy decoding takes $\hat{w}t = \arg\max_w P(w \mid w{<t})$ at each step. It is fast but suboptimal - locally optimal choices lead to globally poor sequences.

Beam search maintains a beam of $B$ partial hypotheses at each step, keeping the $B$ highest-probability continuations. At the end, the highest-scoring complete sequence is returned. Beam search with $B=4$ to $10$ is standard in neural machine translation. A known failure mode: it produces repetitive, overly generic text for open-ended generation tasks.

Temperature sampling scales logits before softmax: $P_\tau(w) \propto \exp(z_w / \tau)$. At $\tau = 1$, sampling is from the model distribution. As $\tau \to 0$, it approaches greedy decoding. At $\tau > 1$, the distribution flattens toward uniform, increasing diversity at the cost of coherence.

Top-$k$ sampling restricts sampling to the $k$ most probable tokens, renormalizing over them. This avoids low-probability nonsense words, but a fixed $k$ is suboptimal because the vocabulary’s probability mass is spread differently in different contexts.

Top-$p$ (nucleus) sampling (Holtzman et al., 2020) instead finds the smallest set $V^{(p)}$ such that $\sum_{w \in V^{(p)}} P(w \mid w_{<t}) \ge p$, then samples from this nucleus. This adapts to context: when the model is confident, the nucleus is small; when uncertain, it is larger. Nucleus sampling with $p = 0.9$ or $0.95$ produces text that human raters prefer to both greedy decoding and beam search for open-ended generation.

Sequence-to-Sequence Models

Sequence-to-sequence (seq2seq) models transform one sequence into another of potentially different length and vocabulary. The encoder-decoder architecture encodes the source with an encoder RNN, then decodes with a separate decoder RNN conditioned on the encoder’s output.

Teacher forcing is the standard training procedure: at each decoder step, the ground-truth previous token $y_{t-1}$ is fed as input, even if the model’s own prediction $\hat{y}_{t-1}$ was wrong. This makes training stable and efficient because errors do not compound.

However, at inference the model receives its own predictions as input. This exposure bias creates a train-test mismatch: the model has never learned to recover from its own mistakes. Scheduled sampling (Bengio et al., 2015) addresses this by gradually replacing teacher-forced inputs with model predictions during training.

Examples

GPT as an autoregressive language model. GPT is a Transformer decoder trained on the causal language modeling objective. At each position, the self-attention is masked so that position $t$ can only attend to positions $\le t$ - preserving the autoregressive property. Despite its conceptual simplicity (just next-token prediction), scaling GPT to billions of parameters and trillions of tokens yields a model capable of in-context learning, translation, arithmetic, and code generation - all as instances of next-token prediction.

Perplexity on held-out text. A concrete evaluation: train a trigram model and a small GPT-2 (117M parameters) on WikiText-103 (103M training tokens). The trigram model achieves perplexity around 170; GPT-2 achieves around 18. To verify no data contamination, evaluate on a held-out Wikipedia article published after the training data cutoff - the gap should remain similar. Perplexity compresses model quality into a single number but does not measure factual accuracy, harmlessness, or fluency as humans perceive it. This gap between perplexity and human judgment motivates RLHF and other alignment-oriented training objectives.

Read Next:

Attention Mechanisms