Helpful context:


The original transformer had two halves: an encoder and a decoder. The encoder reads the input; the decoder generates the output. This made intuitive sense for translation - the encoder reads French, the decoder writes English. For five years, variations on this design dominated NLP. Then, quietly, the encoder disappeared from most state-of-the-art language models. GPT-3, LLaMA, Gemma, Mistral, Falcon, Qwen - all decoder-only. The question worth understanding is why.


The Three Architectures

Modern transformer variants fall into three families, distinguished by how they use attention masking.

Encoder-only (BERT, RoBERTa, DeBERTa): Every token attends to every other token - bidirectional attention. The encoder reads the full input and produces contextual representations of each token. No token generation: these models output representations, not sequences. Suited for classification, named entity recognition, extractive question answering - tasks where you need rich understanding of a fixed input.

Encoder-decoder (T5, BART, original transformer): An encoder with bidirectional attention processes the input; a decoder with causal (left-to-right) attention generates the output, cross-attending to the encoder’s representations at each layer. The encoder sees the full source; the decoder generates the target one token at a time. Suited for tasks with distinct input and output sequences: translation, summarization, document-to-answer.

Decoder-only (GPT, LLaMA, Gemma, Mistral): A single transformer stack where every token attends only to past tokens - a causal mask prevents attending to future positions. There is no encoder; input and output are in the same sequence. The model generates text by continuing whatever has been given to it. Suited for language modeling, instruction following, generation in general.


The Causal Mask

The mechanism that makes decoder-only work is the causal attention mask. In standard self-attention, the attention matrix $A \in \mathbb{R}^{n \times n}$ has $A_{ij}$ representing how much token $i$ attends to token $j$. In causal attention, positions where $j > i$ are masked to $-\infty$ before the softmax, so $\text{softmax}(-\infty) = 0$:

$$A_{ij} = \begin{cases} \frac{q_i \cdot k_j}{\sqrt{d_k}} & j \leq i \ -\infty & j > i \end{cases}$$

This lower-triangular mask enforces the property that each token’s representation depends only on tokens to its left. This is what allows a decoder-only model to be trained efficiently: you feed in the full sequence and compute all positions in parallel during training, but each position only sees its causal context. The causal mask makes training equivalent to processing all prefixes simultaneously.

Discomfort check. If the decoder can only look left, how does it understand the full context of a long prompt? It doesn’t need to - by the time it is generating the first output token, all prompt tokens are in its causal context. The decoder sees the full prompt (leftward) when generating the first output token, then sees the full prompt plus generated token 1 when generating token 2, and so on. The constraint is one-directional: the model cannot “look ahead” to future output tokens, which is what we want for autoregressive generation.


Cross-Attention: What the Encoder Adds

In an encoder-decoder model, the decoder has two types of attention at each layer:

  1. Causal self-attention over previously generated tokens
  2. Cross-attention to the encoder output:

$$\text{CrossAttention}(Q, K_{enc}, V_{enc}) = \text{softmax}\left(\frac{Q W_Q \cdot (K_{enc} W_K)^\top}{\sqrt{d_k}}\right) V_{enc} W_V$$

The queries $Q$ come from the decoder’s current hidden state; keys and values come from the encoder output. Every decoder layer can directly access any encoder position via cross-attention.

This explicit separation of “reading” (encoder) and “writing” (decoder) is the core architectural distinction. The encoder builds rich bidirectional representations of the source; the decoder uses these via cross-attention while generating the target.

Discomfort check. If cross-attention lets the decoder attend to the encoder at every layer, doesn’t that give the decoder-only model a structural disadvantage? Yes, in a strict sense. A decoder-only model must reconstruct the “reading” phase implicitly within its causal self-attention layers: the prompt tokens in the context serve as a de facto encoder, but they can only be attended to, not re-attended to with bidirectional context. For tasks where the input needs deep bidirectional analysis (morphologically complex languages, tasks requiring very long-range dependencies across the input), encoder-decoder has a theoretical advantage. In practice, scaling has largely erased this gap.


Why Decoder-Only Won

Several converging factors explain the dominance of decoder-only architectures.

Unified pretraining objective. Encoder-only models (BERT) are pretrained with masked language modeling (MLM): randomly mask 15% of tokens and predict them. This requires bidirectional attention but cannot be used for generation. Encoder-decoder models (T5) use a span corruption objective: mask contiguous spans and predict them with the decoder. Decoder-only models use next-token prediction on the entire sequence - the simplest possible pretraining objective, applicable to any text. The same objective that pretrains the model is identical to what is used at inference. No architectural mismatch.

Scaling efficiency. At scale, the encoder-decoder architecture doubles the parameter count for a given model depth: you have encoder parameters and decoder parameters. A decoder-only model with the same parameter budget puts all capacity into a single stack. Empirically, single-stack models at the same parameter count perform at least as well as encoder-decoder models on generation tasks.

KV cache. At inference, decoder-only models cache the key and value matrices from previous tokens (the KV cache), avoiding recomputation. Each new token requires only one forward pass through the network with the cached KV matrices - computation scales as $O(n \cdot d^2)$ per token rather than $O(n^2 \cdot d)$ per sequence. This is crucial for long-context inference. Encoder-decoder models must recompute cross-attention at every decoder step using the full encoder output, which is less cache-friendly.

In-context learning. Decoder-only models with next-token pretraining naturally develop in-context learning: the ability to learn from examples placed in the context window without parameter updates. This emergent capability - which appeared around 100B parameters in GPT-3 - made decoder-only models extraordinarily flexible. Any task can be framed as: “Here are some examples. Continue this pattern.” The unified sequence format makes this natural; encoder-decoder models require explicit input-output separation.

Instruction tuning generalizes better. Instruction-tuned decoder-only models (ChatGPT, LLaMA-chat, Gemma-IT) can handle diverse tasks within a single model via prompting. Encoder-decoder models typically need task-specific fine-tuning or at minimum careful prompt design that respects the encoder-decoder interface. The flexibility of the unified sequence format is a practical advantage.


When Encoder-Decoder Still Makes Sense

Decoder-only is not universally superior. Encoder-decoder models retain advantages in specific settings.

Fixed-length input, structured output. For translation, the source sentence is fixed and finite; you want to generate a complete, accurate translation. Bidirectional understanding of the source is valuable. T5 and mT5 remain competitive on translation benchmarks against larger decoder-only models.

Document understanding. When the task is to extract specific information from a long document (NER, relation extraction, document QA), bidirectional context over the document helps. Encoder-only models like DeBERTa still lead on many extractive benchmarks.

Constrained generation. Tasks with strict structural constraints on the output (SQL generation, structured data-to-text) benefit from the encoder’s ability to build a rich understanding of constraints before generation begins.

Efficiency at small scale. For resource-constrained deployments where a small model must handle a narrow task, an encoder-decoder trained for that task (e.g., T5-small for summarization) can outperform a larger decoder-only model in both accuracy and compute.


Modern Decoder-Only Innovations

The architectural evolution of decoder-only models has focused on attention efficiency and positional encoding.

Grouped Query Attention (GQA) (used in LLaMA 2/3, Gemma, Mistral): instead of one key-value head per query head, share key-value heads across groups of query heads. If there are $h$ query heads and $g$ KV head groups, the KV cache size is reduced by $h/g$. GQA interpolates between Multi-Head Attention ($g = h$) and Multi-Query Attention ($g = 1$), trading a small accuracy loss for substantial memory reduction.

Multi-head Latent Attention (MLA) (DeepSeek): compresses key-value matrices into a low-rank latent representation before projecting back to full KV. Reduces KV cache memory by a large factor while maintaining quality.

RoPE (Rotary Position Embedding): the dominant positional encoding for decoder-only models. Encodes position by rotating query and key vectors, making dot products depend on relative position. Extends gracefully to longer contexts than training length with interpolation. Used in LLaMA, Mistral, Falcon, Qwen, most modern decoder-only models.

SlidingWindow / Sparse Attention (Mistral): each token attends to a fixed window of recent tokens plus occasional global attention tokens, reducing attention complexity from $O(n^2)$ to $O(n \cdot w)$ where $w$ is the window size. Enables long contexts without quadratic memory growth.


The Convergence

The practical conclusion is clean: for language generation at scale, decoder-only is the right architecture. The unified pretraining objective, KV cache efficiency, in-context learning flexibility, and strong scaling properties make it the dominant design. Encoder-decoder remains relevant for narrow, high-accuracy generation tasks and for resource-constrained task-specific deployment. Encoder-only remains the best choice for classification and understanding tasks where generation is not required.

The fact that nearly every frontier model (GPT-4, Claude, LLaMA, Gemma, Mistral, Qwen, Falcon) is decoder-only is not coincidence. It is the result of empirical evidence accumulating over years of scaling experiments.


Summary

Architecture Attention Pretraining Best for
Encoder-only (BERT) Bidirectional Masked LM Classification, NER, extractive QA
Encoder-decoder (T5) Enc: bidirectional; Dec: causal + cross Span corruption Translation, structured generation
Decoder-only (GPT, LLaMA) Causal Next-token prediction Language modeling, instruction following, generation

Key reasons decoder-only dominates at scale: unified pretraining objective, KV cache efficiency, in-context learning emergence, parameter efficiency at large scale.


Read next: