Transformers from First Principles // Megha Bose

Prerequisite:

Architecture Overview

The Transformer (Vaswani et al., 2017) discards recurrence entirely and builds sequence-to-sequence processing from three primitives: multi-head self-attention for global information mixing, position-wise feed-forward networks for token-level transformation, and residual connections with layer normalisation for stable training of deep stacks. The original model has an encoder-decoder structure, but the architecture naturally gives rise to encoder-only (BERT) and decoder-only (GPT) variants by selectively enabling or disabling components.

The Encoder

Each encoder layer applies the following operations to an input $X \in \mathbb{R}^{n \times d_{\text{model}}}$ (a sequence of $n$ token representations of dimension $d_{\text{model}}$):

$$X' = \text{LayerNorm}(X + \text{MultiHead}(X, X, X))$$ $$X_{\text{out}} = \text{LayerNorm}(X' + \text{FFN}(X'))$$

The encoder stack applies $N$ such layers sequentially. Before the first layer, token embeddings are added to positional encodings: $X_0 = E_{\text{token}} + E_{\text{pos}}$.

Post-norm vs. pre-norm. The formulation above (LayerNorm outside the residual, i.e., applied to the sum) is the original post-norm scheme. Modern large models predominantly use pre-norm, where LayerNorm is applied to the residual branch input before the sublayer:

$$X_{\text{out}} = X + \text{FFN}(\text{LayerNorm}(X))$$

Pre-norm stabilises training at depth by ensuring that the main signal path through the residual connections is never rescaled, making gradient norms more predictable across layers.

The Decoder

The decoder has three sublayers per block. First, masked multi-head self-attention over the decoder’s own outputs so far, with a causal mask $M$ that sets $e_{ij} = -\infty$ for $j > i$ - preventing position $i$ from attending to future positions. Second, encoder-decoder cross-attention where queries come from the decoder and keys and values from the encoder output. Third, a position-wise feed-forward network, identical in structure to the encoder FFN.

The causal mask is essential for autoregressive generation: at training time, teacher forcing supplies all target tokens simultaneously, and the mask enforces the constraint that predicting $y_i$ may only depend on $y_1, \ldots, y_{i-1}$.

Feed-Forward Networks

Each sublayer’s FFN is applied identically and independently to each token position:

$$\text{FFN}(x) = \max(0,; x W_1 + b_1), W_2 + b_2$$

where $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$, $W_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$, and $d_{ff} = 4 d_{\text{model}}$ in the original paper. The first linear layer expands to a high-dimensional space, the ReLU introduces a nonlinearity, and the second linear layer projects back. The FFN provides a position-wise nonlinear transformation that complements the global information mixing of attention. Without FFNs, a Transformer reduces to a stack of linear operations on attention-weighted sums - insufficient for representing complex functions. Modern variants often replace ReLU with GELU or SwiGLU (which uses a gating mechanism and no bias).

Residual Connections

Both sublayers use residual (skip) connections: the output is $x + \text{Sublayer}(x)$ (or $x + \text{Sublayer}(\text{LayerNorm}(x))$ in pre-norm). Residual connections serve two purposes. First, they provide a gradient highway: gradients flow directly from the loss through the skip path without passing through the sublayer’s weights, mitigating vanishing gradients in deep networks. Second, the network is initialised near the identity function ($x + 0$), which is a stable starting point - the sublayer need only learn a residual correction rather than the full transformation.

Layer Normalisation

LayerNorm normalises across the feature dimension for each token independently:

$$\text{LayerNorm}(x) = \frac{x - \mu}{\sigma + \epsilon} \cdot \gamma + \beta$$

where $\mu = \frac{1}{d}\sum_i x_i$, $\sigma^2 = \frac{1}{d}\sum_i (x_i - \mu)^2$, and $\gamma, \beta \in \mathbb{R}^d$ are learned scale and shift parameters. Unlike BatchNorm, LayerNorm computes statistics over the feature dimension rather than the batch dimension, making it appropriate for variable-length sequences and small-batch or online settings where batch statistics are noisy.

Training: Loss, Optimiser, and Schedule

Transformers are trained with cross-entropy loss over next-token predictions. Label smoothing replaces the one-hot target with a smoothed distribution $(1-\epsilon),\delta_y + \epsilon/V$, typically $\epsilon = 0.1$. This discourages the model from assigning probability approaching 1 to any single token, improving calibration and slightly regularising training.

The original paper uses the Adam optimiser with $\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 10^{-9}$, and a custom learning rate schedule:

$$lrate = d_{\text{model}}^{-0.5} \cdot \min!\left(step^{-0.5},; step \cdot warmup_steps^{-1.5}\right)$$

During the warmup phase ($step < warmup_steps$), the learning rate increases linearly. After warmup it decays as $step^{-0.5}$. This schedule is critical: without warmup, Adam’s adaptive estimates are poorly initialised and large early gradient steps destabilise training, particularly with pre-norm architectures.

Parameter Count

For a Transformer with $n_{\text{layers}}$ layers, model dimension $d_{\text{model}}$, $h$ heads, and $d_{ff} = 4 d_{\text{model}}$, the parameter count of a single layer is:

Multi-head attention: $4 d_{\text{model}}^2$ (three input projections plus output projection, each $d_{\text{model}} \times d_{\text{model}}$)
Feed-forward network: $2 \times d_{\text{model}} \times 4 d_{\text{model}} = 8 d_{\text{model}}^2$

Total per layer: $12 d_{\text{model}}^2$. Summing over all layers gives the rough estimate $12, d_{\text{model}}^2, n_{\text{layers}}$ for the non-embedding parameters. For GPT-3 ($d_{\text{model}} = 12288$, $n_{\text{layers}} = 96$), this yields approximately $175\text{B}$ parameters - consistent with the reported figure.

Encoder-Only, Decoder-Only, Encoder-Decoder

Encoder-only (BERT-style): Uses only the encoder stack. All tokens attend to all other tokens (bidirectional attention). Suitable for discriminative tasks - classification, named entity recognition, question answering - where the full context is available at inference time. Trained with masked language modelling (predict randomly masked tokens) and next-sentence prediction.

Decoder-only (GPT-style): Uses only the decoder’s masked self-attention stack (no cross-attention, since there is no encoder). Tokens attend only to preceding tokens. Suitable for generative tasks and language modelling. Scaled to very large sizes (GPT-3, LLaMA, Mistral), these models acquire strong few-shot generalisation.

Encoder-decoder (T5-style): Retains both components. The encoder builds a full bidirectional representation of the input; the decoder generates output autoregressively, attending to the encoder via cross-attention. Natural for conditional generation tasks: translation, summarisation, question generation.

Examples

BERT for sentence classification. An input sentence is prepended with a [CLS] token. After $N$ encoder layers of bidirectional self-attention, the [CLS] representation aggregates global context. A linear classifier $W \in \mathbb{R}^{d_{\text{model}} \times C}$ maps this representation to class logits. Fine-tuning updates all parameters jointly. BERT-base ($d_{\text{model}} = 768$, $n_{\text{layers}} = 12$) achieves state-of-the-art on GLUE benchmarks with this approach.

GPT for autoregressive text generation. A decoder-only model with $N$ layers of causally masked self-attention is pre-trained on next-token prediction over a large text corpus. At inference, the model generates token by token: at each step, it takes all previously generated tokens as context, computes the full forward pass, and samples the next token from the output distribution. The $12, d_{\text{model}}^2, n_{\text{layers}}$ parameter estimate makes clear why scaling model dimension and depth are the two primary levers for increasing capacity.

T5 for translation. T5 frames translation as a text-to-text problem: the input is “translate English to French: [source sentence]” and the target is the French sentence. The encoder produces a contextualised representation of the source; the decoder generates the target autoregressively using cross-attention to attend to the encoder’s output at each step. The same architecture, training objective, and fine-tuning procedure apply to summarisation, question answering, and any other task expressible as text-in, text-out.

Read Next: