Prerequisite:


Why Attention Has No Sense of Order

Self-attention computes each output token’s representation as a weighted sum of all value vectors, where weights depend only on query-key dot products. There is nothing in that computation that depends on the order in which tokens appear. Formally, self-attention is permutation equivariant: if you permute the input tokens by any permutation $\pi$, the output representations are permuted by the same $\pi$. The model has no way to distinguish “the cat sat on the mat” from “mat the on sat cat the” without an explicit positional signal injected before or during attention. Adding positional encodings breaks this symmetry and gives the model information about the absolute or relative position of each token.

Sinusoidal Positional Encodings

The original Transformer (Vaswani et al., 2017) uses fixed, deterministic encodings defined by:

$$\text{PE}(pos, 2i) = \sin!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

$$\text{PE}(pos, 2i+1) = \cos!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

where $pos$ is the token position and $i$ indexes the embedding dimension. These encodings are added element-wise to the token embeddings before the first attention layer. The frequencies form a geometric progression from $1$ (dimension 0) down to $1/10000$ (dimension $d_{\text{model}}-1$), analogous to a Fourier decomposition across scales.

Key property: relative positions as linear combinations. For any fixed offset $k$, there exists a linear transformation $M_k$ (a rotation matrix) such that $\text{PE}(pos + k) = M_k, \text{PE}(pos)$. This follows from the angle-addition formulas for sine and cosine:

$$\sin(pos + k) = \sin(pos)\cos(k) + \cos(pos)\sin(k)$$

So the model can, in principle, compute relative positional differences via linear operations on the encoding vectors. In practice, sinusoidal encodings extrapolate poorly beyond the training sequence length: the model has never seen positions beyond $T_{\max}$ during training, and the higher-frequency dimensions change so rapidly that their signal becomes uninformative far outside the training range.

Learned Absolute Positional Embeddings

BERT and GPT-2 replace sinusoidal encodings with a learned embedding table $E_{\text{pos}} \in \mathbb{R}^{T_{\max} \times d_{\text{model}}}$, where each row is a trainable vector. At training time, position $pos$ contributes $E_{\text{pos}}[pos]$ to the input representation. This is simpler and slightly more expressive - the model can learn whatever positional features are most useful for the task - but it imposes a hard upper bound: the model cannot process sequences longer than $T_{\max}$ positions seen during training. Extrapolating to longer sequences requires workarounds such as interpolating between learned embeddings, which degrades quality.

Relative Positional Encodings (Shaw et al.)

Rather than encoding the absolute position of each token, relative encodings represent the offset between pairs of tokens. Shaw et al. (2018) modify the key computation in self-attention to incorporate a relative position bias:

$$e_{ij} = \frac{(x_i W^Q)(x_j W^K + r_{ij})^\top}{\sqrt{d_k}}$$

where $r_{ij} \in \mathbb{R}^{d_k}$ is a learned embedding for the relative distance $i - j$, clipped to a maximum range. A parallel modification adds $r_{ij}$ to the value aggregation as well. Because the bias depends only on $i - j$ and not on absolute positions, the model generalises more gracefully to sequence lengths not seen during training - the relative distances remain within the trained range as long as individual sentences are not exceptionally longer than training examples.

RoPE: Rotary Position Embedding

RoPE (Su et al., 2022) achieves relative-position sensitivity without modifying the attention computation structure. The idea is to rotate the query and key vectors by a position-dependent angle before computing dot products. Define a rotation matrix $R_\theta^{pos} \in \mathbb{R}^{d_k \times d_k}$ that rotates each consecutive pair of dimensions by an angle $pos \cdot \theta_i$, where $\theta_i = 10000^{-2i/d_k}$ mirrors the sinusoidal frequency schedule. Then:

$$\text{head score}{ij} = (R\theta^{i}, q_i)^\top (R_\theta^{j}, k_j) = q_i^\top R_\theta^{j-i}, k_j$$

The last equality uses the orthogonality of rotation matrices: $R_\theta^{i\top} R_\theta^j = R_\theta^{j-i}$. Consequently, the attention score depends only on the relative position $j - i$, not on the absolute positions $i$ or $j$ individually. No positional information is added to the value vectors, and the encoding has no learned parameters beyond the frequency schedule.

RoPE has become the dominant position encoding in recent large language models (LLaMA, Mistral, Gemma, GPT-NeoX) for several reasons: it is parameter-free, it integrates seamlessly with existing attention implementations (multiply $Q$ and $K$ by rotation matrices before the dot product), and it exhibits better length extrapolation than learned absolute embeddings when combined with context-window extension techniques such as position interpolation or YaRN.

ALiBi: Attention with Linear Biases

ALiBi (Press et al., 2021) takes a different approach: rather than modifying the query or key vectors, it subtracts a scalar penalty from each attention logit proportional to the distance between the query and key positions:

$$e_{ij} = \frac{q_i k_j^\top}{\sqrt{d_k}} - m \cdot |i - j|$$

where $m$ is a head-specific slope, fixed (not learned) according to a geometric schedule. There are no positional vectors added to the input embeddings at all. The linear bias acts as a soft locality prior: nearby tokens receive higher logits, and distant tokens are penalised proportionally to distance. ALiBi models (MPT, BLOOM) generalise robustly to sequences much longer than those seen during training, often maintaining low perplexity at two to four times the training length.

RoPE vs. ALiBi: A Live Trade-off

Both RoPE and ALiBi provide relative-position sensitivity and outperform absolute encodings on length extrapolation. RoPE preserves the full expressive power of the attention dot product and integrates naturally with position interpolation schemes, making it easier to extend context windows post-training. ALiBi imposes a structural locality bias - beneficial for tasks where proximity strongly predicts relevance, but potentially limiting for tasks requiring long-range retrieval. The choice between them remains an active design decision in foundation model development, with RoPE currently more prevalent in openly released models.

Examples

Context window extension. LLaMA 2 models are trained with a 4,096-token context window using RoPE. To extend this to 32,768 tokens, one approach is position interpolation: rescale the position indices so that the original training range $[0, 4096]$ maps to $[0, 32768]$ by dividing each position by $8$ before applying RoPE rotations. With a short fine-tuning phase on long documents, models recover near-full performance at the extended context length. This works because the rotation angles remain small and within the trained frequency range.

Why LLMs struggle with positional extrapolation. A model with learned absolute embeddings trained to $T_{\max} = 512$ sees position embedding $E[511]$ only a handful of times during training. Positions $512, 513, \ldots$ map to rows of $E_{\text{pos}}$ that were never updated - they remain at their random initialisation. The model’s attention patterns at those positions are effectively random, causing coherent generation to collapse. This is not a problem with RoPE or ALiBi, where the positional signal is computed from the position index at inference time and degrades more gracefully than a lookup into an untrained embedding table.


Read Next: