Positional Encodings - Teaching Attention Where Things Are in a Sequence
Helpful context:
- Attention Mechanisms - Not All Tokens Are Created Equal
- Fourier Analysis - Every Signal Is a Sum of Sines
Attention doesn’t care about order.
Give a self-attention layer the tokens of “dog bites man” or “man bites dog” - it sees the same set of three vectors and produces the same output, just with rows permuted. There is nothing in the attention computation that distinguishes position 1 from position 2 from position 3. Swap any two tokens and the mechanism adapts automatically, producing a correspondingly swapped output.
For a bag-of-words classifier, this might be acceptable. For a language model, it is catastrophic. “The man bit the dog” and “the dog bit the man” have opposite meanings. “She gave him the book” and “She gave the book him” have different grammatical status. Subject, verb, object - these are defined by position in the sentence, not by the words alone.
The fix is positional encodings: inject some representation of position into each token’s vector before attention runs. This is such a simple prescription and so non-trivially executed that the design of positional encodings has become one of the most active areas in large language model research.
The Problem, Precisely
Self-attention computes:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V$$
where $Q = XW_Q$, $K = XW_K$, $V = XW_V$.
If you apply a permutation matrix $P$ to the rows of $X$ - rearranging the tokens - then $Q$ becomes $PQ$, $K$ becomes $PK$, and $V$ becomes $PV$. The attention matrix becomes:
$$\text{softmax}\left(\frac{(PQ)(PK)^\top}{\sqrt{d}}\right) = \text{softmax}\left(\frac{PQK^\top P^\top}{\sqrt{d}}\right) = P \cdot \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) \cdot P^\top.$$
The output is $P \cdot \text{Attention}(Q, K, V) \cdot P^\top \cdot PV = P \cdot \text{Attention}(Q, K, V)$. The output rows are just the input rows permuted the same way. Self-attention is permutation-equivariant: permute the input, get a correspondingly permuted output, with no memory of which position was which.
The solution: before computing attention, add a position-dependent vector to each token embedding so that the same token at different positions has a different representation.
Absolute Sinusoidal Encodings
The original transformer (Vaswani et al. 2017) uses a fixed, deterministic positional encoding. For a sequence position $\text{pos} \in \{0, 1, 2, \ldots\}$ and an embedding dimension $d$, the positional vector $\text{PE}(\text{pos}) \in \mathbb{R}^d$ has entries:
$$\text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d}}\right), \qquad \text{PE}(\text{pos}, 2i+1) = \cos\left(\frac{\text{pos}}{10000^{2i/d}}\right)$$
for $i = 0, 1, \ldots, d/2 - 1$. This vector is added to the token embedding: $\tilde{x}_{\text{pos}} = x_{\text{pos}} + \text{PE}(\text{pos})$.
What This Encodes
Think of the position encoding as a clock with many hands, each ticking at a different frequency.
Dimension pair $(2i, 2i+1)$ encodes position using frequency $\omega_i = 1 / 10000^{2i/d}$.
- At $i = 0$: $\omega_0 = 1$. The sine and cosine complete a full cycle every $2\pi \approx 6.28$ positions. This is a high-frequency “fast hand.”
- At $i = d/2 - 1$: $\omega_{d/2-1} = 10000^{-1} = 0.0001$. The sine and cosine complete a full cycle every $2\pi / 0.0001 \approx 62{,}832$ positions. This is a low-frequency “slow hand.”
Low-dimension pairs vary rapidly across positions; high-dimension pairs vary slowly. Together they tile the position axis across many scales - just as Fourier analysis decomposes a signal into many frequencies.
The choice of $10000$ as the base is somewhat arbitrary; what matters is that the range of wavelengths (from $2\pi$ to $2\pi \times 10000$) covers the practical range of sequence lengths.
The Relative Position Property
A key algebraic property: for any fixed offset $k$, there exists a rotation matrix $R_k$ such that:
$$\text{PE}(\text{pos} + k) = R_k \cdot \text{PE}(\text{pos}).$$
This follows because for a single frequency pair, $(\sin((\text{pos}+k)\omega), \cos((\text{pos}+k)\omega))$ is a 2D rotation of $(\sin(\text{pos} \cdot \omega), \cos(\text{pos} \cdot \omega))$ by angle $k\omega$.
The implication: the dot product $\text{PE}(\text{pos}) \cdot \text{PE}(\text{pos}')$ depends only on the relative offset $\text{pos}' - \text{pos}$, not on the absolute positions. The model can, in principle, compute relative position relationships directly from the dot products of positional encodings, without needing to decode absolute positions first.
Properties: What You Get, What You Don’t
The sinusoidal encoding has no learnable parameters - it is fixed by the formula. This means:
- It generalizes to sequence lengths longer than those seen during training. There is no embedding table to look up; the formula just works for any position.
- It provides a smooth, continuous representation of position.
- It encodes relative position implicitly via the rotation matrix property.
What it does not provide: any learned adaptation to the specific task or data distribution.
Discomfort check. Why can’t you just use one-hot position vectors - a different unit vector for each position? You could, in principle. A one-hot vector for position $t$ in a sequence of length 10,000 would be a 10,000-dimensional vector with a 1 in slot $t$. Three problems: (1) Size - these vectors would be 10,000-dimensional, needing to be projected down, adding parameters. (2) No inductive bias - the model learns, from scratch, that position 3 is “close” to position 4. Sinusoidal encodings build this in: nearby positions have similar encodings. (3) No length generalization - if you train with sequences up to length 512 and then see a sequence of length 1000, you encounter position vectors 513, 514, … that the model has never seen. The formula-based approach has no such failure mode.
Learned Absolute Positional Encodings
The simplest alternative: treat positions like tokens. Learn a separate vector $e_t$ for each position $t$ in the training sequence length, stored in an embedding table. Add $e_t$ to the token embedding at position $t$.
This is what BERT uses. It is also used in GPT-2 and many other early transformer language models.
Advantages: the model freely adapts the position embeddings to whatever representation is most useful for the task. In practice, learned encodings often outperform sinusoidal on standard benchmarks.
Disadvantages: the model cannot extrapolate to lengths longer than it was trained on. If BERT is trained on sequences up to 512 tokens, position 513 has no embedding - you are outside the table. This is a hard limit.
Relative Positional Encodings
Absolute encodings (whether fixed or learned) embed a specific position index. But what a model really needs, most of the time, is to know the relative distance between two positions: is word A near word B or far away?
Shaw et al. (2018) introduced relative positional encodings for self-attention. Instead of adding position information to token embeddings, they add a position-dependent bias directly to the attention score:
$$e_{ij} = \frac{(x_i W_Q)(x_j W_K + a_{i-j})^\top}{\sqrt{d}}$$
where $a_{i-j} \in \mathbb{R}^d$ is a learned vector depending on the clipped relative distance $i - j$. The bias makes attention aware of relative position without requiring the model to decode absolute positions.
T5 (Raffel et al. 2020) uses a simpler version: the attention score at relative distance $i - j$ receives an additive scalar bias $r(i - j)$ drawn from a learned small table. The table is shared across layers and heads.
Relative encodings generalize better to unseen lengths because relative distances within training range remain in-distribution even when absolute positions are new.
ALiBi: Linear Attention Biases
ALiBi (Press et al. 2022) takes the relative bias idea further, eliminating learned parameters entirely.
For each attention head $h$, ALiBi adds a fixed linear penalty to the attention score based on distance:
$$e_{ij}^{(h)} = \frac{q_i \cdot k_j}{\sqrt{d}} - m_h \cdot (i - j)$$
where $m_h$ is a per-head slope (a fixed geometric sequence: $1/2, 1/4, 1/8, \ldots$) and $(i - j) \geq 0$ is the distance (the query is always to the right of the key in causal attention).
The effect: attending to distant positions is penalized. The penalty grows linearly with distance. Different heads apply different slopes: some heads are “short-sighted” (steep penalty), others are “long-sighted” (gentle penalty).
ALiBi has no positional parameters to learn. It adds no overhead during computation. And empirically, models trained with ALiBi generalize remarkably well to sequences longer than those seen during training - the linear extrapolation of the bias naturally handles new distances.
RoPE: Rotary Position Embedding
The dominant positional encoding in modern open-source LLMs is RoPE (Rotary Position Embedding), introduced by Su et al. (2021) and used in LLaMA, Falcon, Mistral, Qwen, GPT-NeoX, and most recent models.
The key idea is to encode position by rotating the query and key vectors.
For a 2D slice $(q_{2i}, q_{2i+1})$ of a query vector at position $m$:
$$\begin{pmatrix} \tilde{q}_{2i} \\ \tilde{q}_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}$$
where $\theta_i = 10000^{-2i/d}$. The same rotation (with position $n$ instead of $m$) is applied to the key vector. The dot product then satisfies:
$$\tilde{q}_m \cdot \tilde{k}_n = q_m^\top R_\theta^{(m-n)} k_n$$
where $R_\theta^{(m-n)}$ is a rotation by angle proportional to the relative offset $m - n$.
This is the elegant property of RoPE: absolute positions are encoded in the rotations, but dot products naturally compute relative position. You encode absolute position, but attention automatically becomes relative. You get both.
Why RoPE Won
Compared to sinusoidal: RoPE is applied to $Q$ and $K$ directly, not just added to embeddings. The rotation affects the dot-product computation itself, not just the input representations. This is a richer and more direct injection of positional information.
Compared to learned absolute: RoPE generalizes to unseen lengths and requires no position embedding table.
Compared to ALiBi: RoPE has richer per-head position information (full rotations in multiple subspaces, not just a scalar bias). In practice, RoPE achieves better performance on standard benchmarks.
Context Length Extension With RoPE
One of the most practically important developments in LLM engineering is extending context windows. A model trained with RoPE at context length 4096 has learned rotation angles up to $4096 \cdot \theta_i$ for each frequency $\theta_i$. At positions beyond 4096, it encounters rotation angles it has never seen during training - degraded performance.
RoPE interpolation (Chen et al. 2023): instead of rotating by $m \cdot \theta_i$, rotate by $(m / s) \cdot \theta_i$ where $s$ is a scaling factor $s = L_{\text{new}} / L_{\text{train}}$. This compresses new positions into the range of trained positions. Combined with fine-tuning on longer sequences, this extends context from 4096 to 32768 tokens.
YaRN (Peng et al. 2023) and LongRoPE (Ding et al. 2024) refine this further, using different scaling for different frequency components - low-frequency dimensions extrapolate well; high-frequency dimensions need interpolation. These techniques have extended context windows from 4K to 128K or even 1M tokens in production models.
Extending RoPE to Long Contexts
RoPE was designed for training lengths up to a few thousand tokens. At longer contexts, the rotation angles grow large and the model sees angle combinations it never encountered during training, causing degradation.
YaRN (Yet Another RoPE Extension) addresses this with three modifications. First, frequency interpolation: scale the position indices by $\text{scale} = L_{\text{train}} / L_{\text{target}}$ so the angles stay within the trained range. But interpolating all frequencies equally is suboptimal - low-frequency components (slow rotation, long-range information) need less scaling than high-frequency components (fast rotation, local information). YaRN applies different interpolation factors per frequency band. Second, dynamic scaling: the effective temperature of the softmax (controlled by $\sqrt{d_k}$) needs adjustment as context grows - YaRN multiplies attention logits by a length-dependent temperature factor. Third, NTK-aware interpolation: distribute the position information more evenly across dimensions using number-theoretic properties of the RoPE basis.
In practice, YaRN can extend a model trained to 4k tokens to handle 128k tokens with fine-tuning on a small amount of long-context data. It is used in gpt-oss-120b (extending to 131k tokens), Kimi K2, and most other frontier long-context models.
NoPE, RNoPE, and Document Masking
NoPE (No Positional Embedding). Some transformer layers can learn useful representations without any positional encoding - relying purely on causal masking and attention patterns to infer relative order. Tokens can still infer position from the causal attention mask (future tokens are masked, so attending to a specific number of past tokens implies a position). NoPE layers have no extrapolation problem because they never saw position-encoded input, but they underperform on short-context reasoning tasks where absolute position matters.
RNoPE (Rotary NoPE). Alternate between RoPE layers (which handle local context and benefit from position information) and NoPE layers (which handle long-range retrieval without extrapolation issues). SmolLM3 uses RNoPE with one NoPE layer every 4 layers, achieving similar short-context performance to pure RoPE while extending better to long contexts. The intuition: local syntactic patterns need precise position information (RoPE layers handle this), while long-range semantic retrieval benefits from position-agnostic attention (NoPE layers handle this).
Document masking (intra-document attention masking). When packing multiple short documents into a single training sequence (to avoid padding waste), standard causal masking lets tokens in document $B$ attend to tokens in document $A$, even though they are unrelated. This cross-document attention degrades performance and, more importantly, makes long-context extension harder - the model learns to attend across arbitrary document boundaries rather than within coherent contexts.
Document masking modifies the attention mask so that token $i$ in document $B$ can only attend to previous tokens in document $B$, not to any token in document $A$. This adds implementation complexity (the attention mask is no longer a simple lower-triangular matrix) and is incompatible with standard FlashAttention without modification. But the benefit is substantial: models with document masking extend more cleanly from 4k to 64k+ tokens because they learn coherent long-range attention within documents rather than spurious cross-document patterns.
ALiBi vs. RoPE: A Direct Comparison
| Property | ALiBi | RoPE |
|---|---|---|
| Parameters | None | None |
| Computation overhead | Negligible scalar bias | Rotation of Q, K |
| Length generalization | Strong (linear extrapolation) | Good with interpolation |
| In-distribution performance | Competitive | State of the art |
| Adoption | Several open models | LLaMA, Falcon, Mistral, most modern LLMs |
Both approaches eliminate learned position parameters. Both offer some degree of length generalization. RoPE dominates in performance and adoption; ALiBi is simpler to implement and arguably more principled for extrapolation.
Summary
| Encoding | Mechanism | Parameters | Extrapolation |
|---|---|---|---|
| Sinusoidal (absolute) | Fixed $\sin/\cos$ at multiple frequencies | None | Good |
| Learned absolute | Embedding table per position | $L \times d$ | None (hard limit) |
| Relative (Shaw/T5) | Learned bias on attention score | Small table | Moderate |
| ALiBi | Fixed linear bias per head slope | None | Strong |
| RoPE | Rotation of Q and K vectors | None | Good + extendable |
| YaRN | Per-frequency RoPE interpolation + temperature scaling | None | Very strong (4k to 128k+) |
| NoPE / RNoPE | No position encoding on select layers; alternate with RoPE layers | None | Strong on long range |
| Document masking | Intra-document attention mask during training | None | Improves long-context coherence |
Positional encodings are the glue between the order-agnostic attention mechanism and the order-dependent structure of language. Absolute encodings are simple but don’t extrapolate. Relative encodings are more principled. RoPE is the current consensus - it encodes absolute positions in rotations while making dot products relative, and it extends gracefully with frequency interpolation.
The choice of positional encoding determines the context length ceiling of an LLM. It is not a minor implementation detail; it is architecture.
Read next: