Transformers From First Principles - Why Attention Changed Everything // Megha Bose

Helpful context:

In 2017, a paper titled “Attention Is All You Need” proposed replacing the dominant sequence modeling architecture - RNNs with attention - with something that used only attention and feedforward layers. The idea was radical: throw out the recurrence entirely.

Seven years later, this architecture - the transformer - runs GPT-4, Gemini, Claude, Llama, and every significant language model. It has also invaded image processing, protein folding, code generation, speech recognition, and reinforcement learning. No single deep learning architecture has ever spread so fast or so far.

Why did it win? Not because of a single clever trick, but because several design decisions happened to compose beautifully - each one making the others more effective. This post assembles the pieces we have built and shows how they fit together.

The Pieces We Have Assembled

Working backward from the final architecture, here is what we need:

Attention computes a weighted average over all positions in a sequence. Each token can look at every other token, with learned weights that determine how much to attend to each. Attention is $O(n^2)$ in sequence length but fully parallelizable across positions.

Positional encodings inject order information into the otherwise order-agnostic attention mechanism. Without them, a transformer cannot distinguish “dog bites man” from “man bites dog.”

LayerNorm stabilizes training by normalizing each token’s representation to zero mean and unit variance, then rescaling with learned parameters. It prevents the activations from growing or shrinking uncontrollably during the forward pass.

Feedforward layers apply a nonlinear transformation to each token’s representation independently. They are the “thinking” layers - where learned knowledge is stored and applied.

Residual connections are the structural skeleton that makes deep networks trainable. We will build these up from scratch.

Residual Connections

Before looking at the full architecture, we need to understand residual connections, because they are everywhere.

The idea: instead of computing $x_{\text{out}} = f(x_{\text{in}})$, compute:

$$x_{\text{out}} = x_{\text{in}} + f(x_{\text{in}}).$$

The function $f$ learns a residual - a correction to the input, not a full transformation.

Three reasons this works:

Gradient flow. The gradient of the loss with respect to $x_{\text{in}}$ is:

$$\frac{\partial \mathcal{L}}{\partial x_{\text{in}}} = \frac{\partial \mathcal{L}}{\partial x_{\text{out}}} \cdot \left(1 + \frac{\partial f}{\partial x_{\text{in}}}\right).$$

The additive 1 means the gradient always has a direct path backward through the identity branch. Even if $\partial f / \partial x_{\text{in}} \approx 0$ (vanishing gradients in $f$), the gradient $\partial \mathcal{L} / \partial x_{\text{out}}$ passes through unchanged. Deep networks with residual connections do not suffer from vanishing gradients in the same catastrophic way as plain deep networks.

Identity initialization. If you initialize $f$ so that $f(x) \approx 0$ at the start of training, then $x_{\text{out}} \approx x_{\text{in}}$ - the network starts as the identity. Training then gradually introduces corrections. This is a much more stable initialization than trying to learn a complete transformation from random weights.

Depth. Residual connections are what make very deep networks trainable. ResNet proved this for vision (up to 1000 layers); transformers apply the same principle across dozens to hundreds of attention and feedforward layers.

A Single Transformer Block

The fundamental unit of a transformer is the transformer block - a pair of residual sublayers.

Given an input $x \in \mathbb{R}^{n \times d}$ (a sequence of $n$ token representations, each of dimension $d$):

$$x' = x + \text{MultiHeadAttention}(\text{LayerNorm}(x))$$ $$x'' = x' + \text{FFN}(\text{LayerNorm}(x'))$$

These two lines are the entire transformer block. The final output $x''$ has the same shape as the input $x$.

This is the pre-norm variant (LayerNorm before the sublayer), which has become standard in modern transformers. The original Vaswani et al. paper used post-norm (LayerNorm after the residual addition); pre-norm trains more stably for very deep networks.

A transformer is nothing more than a stack of $N$ such blocks. Each block refines the token representations using attention (to gather context from other positions) and feedforward computation (to transform each representation independently).

The FFN: What It Does

The feedforward network in each transformer block is applied identically and independently to every token position:

$$\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2.$$

The weight matrices $W_1 \in \mathbb{R}^{d_{\text{ff}} \times d}$ and $W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}$ expand and then contract the representation. The standard choice is $d_{\text{ff}} = 4 \cdot d_{\text{model}}$: if the model dimension is 512, the FFN expands to 2048, applies ReLU, and projects back to 512.

Note what the FFN does not do: it does not mix information across positions. Each token’s representation is transformed in isolation. This is in direct contrast to attention, which computes a mixture across all positions.

The functional division is:

Attention: decides which positions to collect information from. Mixes across the sequence.
FFN: transforms what was collected. Operates token-by-token.

Evidence from mechanistic interpretability studies supports this division. Attention heads track syntactic and semantic relationships; FFN layers appear to store factual associations (the kind of knowledge tested by “Paris is the capital of ___"). Editing factual knowledge in LLMs typically targets FFN weights, not attention weights.

Encoder, Decoder, Encoder-Decoder

Not all transformers are the same. There are three architectural variants, each suited to different tasks.

Encoder-Only (BERT)

An encoder-only transformer processes the entire input sequence with bidirectional attention: every token can attend to every other token, both left and right.

After $N$ transformer blocks, you have a sequence of context-aware representations, one per input token. These representations are rich: each token’s vector contains information from the entire surrounding context.

Encoder-only models are excellent for:

Text classification (use the [CLS] token representation)
Named entity recognition (use the per-token representations)
Sentence embeddings (pool the token representations)
Anything where you want to “understand” a text, not generate one

BERT (Devlin et al. 2018) is the canonical encoder-only transformer. It is pre-trained with masked language modeling: randomly mask 15% of tokens and predict the masked tokens from context. This forces bidirectional contextual representations.

Decoder-Only (GPT and Most LLMs)

A decoder-only transformer uses causal (masked) attention: each token can only attend to previous tokens and itself. The attention matrix is lower-triangular.

This constraint is crucial: it means the model can be trained and evaluated on next-token prediction without cheating. At position $t$, the model predicts $w_{t+1}$ using only $w_1, \ldots, w_t$. During training, you compute the loss at all $T$ positions in parallel - no sequential processing needed - because the mask ensures each position only sees valid context.

Decoder-only models are the dominant architecture for language generation. GPT, GPT-2, GPT-3, LLaMA, Mistral, Qwen, Gemma - all decoder-only. They are pre-trained with next-token prediction on vast corpora and then fine-tuned (via RLHF or instruction tuning) to follow instructions.

The causal mask is what makes this architecture suitable for generation: at inference, you autoregressively extend the sequence one token at a time.

Encoder-Decoder (T5, Original Transformer)

The original “Attention Is All You Need” transformer, and T5, use an encoder-decoder architecture with three types of attention:

Self-attention in the encoder (bidirectional): each source token attends to all other source tokens.
Self-attention in the decoder (causal): each target token attends only to previous target tokens.
Cross-attention in the decoder: each decoder position attends to all encoder hidden states.

The encoder processes the source sequence fully. The decoder generates the output, consulting the encoder via cross-attention at each layer.

This architecture is natural for tasks with a distinct source and target: machine translation, summarization with explicit extractive/abstractive separation, document-to-table conversion. The encoder has full bidirectional context over the source; the decoder has access to the source through cross-attention.

Key Dimensions and How They Scale

A transformer is parameterized by a handful of hyperparameters.

$d_{\text{model}}$: the embedding dimension. The width of every vector throughout the model. Larger $d_{\text{model}}$ means more representational capacity per token.

$n_{\text{heads}}$: the number of attention heads. Each head uses $d_k = d_{\text{model}} / n_{\text{heads}}$ dimensions. Typical values: 8 (original), 12 (BERT-base), 32 (LLaMA-7B), 96 (GPT-3).

$n_{\text{layers}}$: the number of transformer blocks. Depth.

$d_{\text{ff}}$: the FFN intermediate dimension, typically $4 \times d_{\text{model}}$.

Representative configurations:

Model	$d_{\text{model}}$	$n_{\text{heads}}$	$n_{\text{layers}}$	Parameters
Original transformer	512	8	6 + 6	~65M
BERT-base	768	12	12	110M
GPT-2 large	1280	20	36	774M
LLaMA-7B	4096	32	32	7B
GPT-3	12288	96	96	175B

Parameter Counting

Where do the parameters come from in a decoder-only transformer? Let us count, per layer.

Attention sublayer: four weight matrices $W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ and $W_O \in \mathbb{R}^{d \times d}$.

$$\text{Attention parameters per layer} = 4 d^2.$$

FFN sublayer: $W_1 \in \mathbb{R}^{d_{\text{ff}} \times d}$ and $W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}$ with $d_{\text{ff}} = 4d$.

$$\text{FFN parameters per layer} = 2 \times d \times 4d = 8d^2.$$

Total per layer: $\approx 12 d^2$ (ignoring biases and LayerNorm, which are small).

Across $L$ layers: $12 d^2 L$.

For GPT-3 ($d = 12288$, $L = 96$):

$$12 \times 12288^2 \times 96 \approx 12 \times 1.51 \times 10^8 \times 96 \approx 1.74 \times 10^{11} \approx 174 \text{ billion.}$$

This matches the reported 175B parameter count (the discrepancy is the token embedding matrix and output projection).

Training: Next-Token Prediction

A decoder-only transformer is trained on next-token prediction. For a sequence of length $T$ with tokens $w_1, \ldots, w_T$:

Embed: $x_t = E[w_t]$ where $E \in \mathbb{R}^{V \times d}$ is the token embedding matrix.
Add positional encoding: $\tilde{x}_t = x_t + \text{PE}(t)$.
Pass through $N$ transformer blocks (with causal mask).
At each position $t$, compute logits: $\ell_t = \tilde{x}t^{(N)} W{\text{vocab}}$ where $W_{\text{vocab}} \in \mathbb{R}^{d \times V}$.
Compute cross-entropy loss at all positions simultaneously:

$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P(w_{t+1} \mid w_1, \ldots, w_t).$$

The critical point: steps 1 through 5 are computed at all $T$ positions in parallel, in one forward pass. No sequential dependency. This is the fundamental reason transformers replaced RNNs for language modeling: you can use the full parallel power of modern hardware (GPUs with thousands of cores) to process an entire sequence at once.

The gradient computation (backpropagation through all $T$ positions, all $N$ layers) is similarly parallel. Training a transformer on a corpus of a trillion tokens is feasible because of this parallelism; training an RNN on the same corpus would be thousands of times slower.

Discomfort check. “Attention is $O(n^2)$ - how do GPT-4 and Claude handle 100,000 or 200,000 tokens?” The honest answer involves several layers. First, the quadratic cost is real: a full $n \times n$ attention matrix for $n = 100{,}000$ positions, over 96 layers, 96 heads, at 4 bytes per float, is $100{,}000^2 \times 96 \times 96 \times 4 \approx 3.7$ petabytes - obviously impossible to materialize. In practice: (1) FlashAttention computes attention in tiled blocks that fit in GPU SRAM, never materializing the full matrix. The computation is still $O(n^2)$ but the memory is $O(n)$, and the hardware efficiency is dramatically better. (2) At very long contexts, some models use sparse attention patterns (sliding window, global tokens) that are $O(n \cdot w)$ where $w$ is the window size. (3) KV-cache reduces inference cost by reusing previously computed keys and values. The quadratic cost is the central engineering constraint of modern LLMs - it is why “extending context” is a research problem, not a default configuration.

Why Transformers Beat RNNs

Transformers won not because of one advantage, but because of a combination that compounded.

Parallelism. RNNs require sequential computation: $h_t$ depends on $h_{t-1}$. On a GPU with 10,000 cores, processing a sequence of 10,000 tokens takes 10,000 sequential steps - the cores sit mostly idle. A transformer processes all 10,000 positions in parallel. This makes training orders of magnitude faster per unit of compute.

Long-range dependencies. In an RNN, the path between token at position 1 and token at position $T$ passes through $T-1$ hidden state transitions. Gradients must flow through all of them, and information must be preserved through all of them. In a transformer, the path between any two positions is always length 1 - a direct attention weight. There is no fundamental barrier to learning long-range dependencies.

Scale. Empirically, transformers benefit more from scale (more parameters, more data, more compute) than RNNs ever did. Scaling laws (Hoffmann et al., Kaplan et al.) show smooth power-law improvements in loss as you increase model size and training tokens. RNNs showed diminishing returns much sooner. The reason is not fully understood, but the empirical fact is consistent across dozens of studies.

Summary

Component	Role	Formula
Residual connection	Gradient flow + stable initialization	$x_{\text{out}} = x + f(x)$
Transformer block	Attend, then transform	$x' = x + \text{MHA}(\text{LN}(x))$; $x'' = x' + \text{FFN}(\text{LN}(x'))$
Attention	Mix information across positions	$\text{softmax}(QK^\top/\sqrt{d})V$
FFN	Transform each position independently	$W_2 \text{ReLU}(W_1 x)$
Causal mask	Prevent future look-ahead in decoders	Lower-triangular attention matrix
Encoder-only	Bidirectional; best for understanding	BERT, RoBERTa
Decoder-only	Causal; best for generation	GPT, LLaMA, Mistral
Encoder-decoder	Explicit source-target; best for translation	T5, original transformer
Parameters per layer	Dominated by attention and FFN	$\approx 12 d^2$
Training objective	Next-token prediction	$-\frac{1}{T}\sum_t \log P(w_{t+1} \mid w_{1:t})$

The transformer won because it trades a sequential bottleneck (the RNN hidden state) for a parallel, direct mechanism (attention). Every position can attend to every other position, the computation over the entire sequence is simultaneous, and the architecture scales cleanly with more parameters and data.

The pieces are now all in place: probability factorization gives the training objective, attention gives the mechanism for context, positional encodings give the notion of order, LayerNorm and residual connections give the structural scaffolding, and the FFN gives each token a place to integrate and store. Stack them, train on next-token prediction at scale, and the result is a language model.

Read next: