Attention Mechanisms - Not All Tokens Are Created Equal // Megha Bose

Helpful context:

Read this sentence: “The trophy doesn’t fit in the suitcase because it is too big.”

You immediately know that “it” refers to the trophy - the trophy is what’s too big to fit. You didn’t run through every possible referent and weigh them consciously. Your eye slid back to “trophy” and the meaning snapped into place. You attended selectively to the relevant earlier word.

For two decades, neural networks had no analog of this ability. They processed sequences left to right, compressing everything into a fixed-size hidden state. By the time the model generated an output, the beginning of the input had been smeared into a bottleneck vector and partially lost. Attention mechanisms change this - they give neural networks a structured, differentiable way to look back at any earlier part of the input and decide what to focus on.

What Attention Actually Does

Before attention, sequence-to-sequence models routed all information through a single fixed-dimensional hidden state - the final hidden state of the encoder had to compress the entire input sequence. This is an information bottleneck: a 1000-word document compressed into a 512-dimensional vector. Long-range dependencies were theoretically possible (an RNN can in principle carry information across many steps) but practically difficult (gradients vanish over long distances, and the hidden state has to serve too many purposes at once).

Attention does not solve a problem that was previously unsolvable. It changes where the solution lives. Instead of learning to route information implicitly through hidden states over many timesteps, attention makes routing an explicit, differentiable operation at every layer. The model computes, at each step, a weighted mixture over all positions - and the weights are learned, input-dependent, and directly supervised by the downstream loss.

This explicitness has two consequences. First, gradients flow directly from any output position to any input position, regardless of sequence length - the gradient path length is O(1), not O(n). This is why attention is better at long-range dependencies than RNNs. Second, the attention weights are interpretable: you can look at which positions the model attends to and understand (to some degree) what information it’s using. The implicit routing in RNN hidden states had no such interpretable structure.

The quadratic cost - $O(n^2)$ in sequence length - is the price of this directness. Every position attends to every other position. For short sequences, this is fast. For sequences of length 10,000 (long documents, high-resolution images), it becomes the bottleneck. Most “efficient attention” research is about approximating the full attention matrix while preserving its key properties.

The Original Attention: Bahdanau et al. 2015

The first widely adopted attention mechanism appeared in the context of neural machine translation. The problem was exactly the information bottleneck we identified in seq2seq models: the encoder’s final hidden state cannot hold everything relevant about a long source sentence.

Bahdanau, Cho, and Bengio’s insight was simple: instead of using only the encoder’s final state, let the decoder look at all encoder hidden states $h_1, h_2, \ldots, h_T$ - one per source token - and take a weighted average.

More precisely, when generating the $j$-th output token, the decoder has a hidden state $s_{j-1}$ from the previous output step. The mechanism:

Step 1 - alignment score. Compute a score $e_{ij}$ measuring how relevant encoder state $h_i$ is when the decoder is at step $j$:

$$e_{ij} = a(s_{j-1}, h_i).$$

The function $a$ is a small feedforward network (or a dot product - various choices work). It asks: given what the decoder knows at step $j-1$, how much should it care about position $i$ in the source?

Step 2 - attention weights. Normalize the scores into a probability distribution using softmax:

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{kj})}.$$

The weights $\alpha_{ij}$ sum to 1 over the source positions $i$. They form an attention distribution over the source sentence.

Step 3 - context vector. Compute a weighted average of encoder states:

$$c_j = \sum_i \alpha_{ij} h_i.$$

This context vector $c_j$ is a soft, differentiable summary of the source: it focuses on the positions that are most relevant to generating output token $j$.

The decoder then uses $c_j$ along with $s_{j-1}$ to produce the $j$-th output.

What the Model Actually Learns

The remarkable thing about this mechanism is what emerges from training it. For a French-to-English translation model, you can visualize the attention matrix $\alpha_{ij}$ - which source positions the decoder attended to when generating each output word.

The result is an alignment table, learned automatically without any alignment labels. When generating “cat” in English, the model attends strongly to “chat” in French. When generating “not”, it attends to “ne” and “pas”. The model has discovered correspondence between source and target tokens purely from observing parallel text.

Scaled Dot-Product Attention

Bahdanau’s attention used a learned neural network for the alignment score. A cleaner formulation - the one used in transformers - replaces this with a simple dot product. This version, introduced by Vaswani et al. in 2017, also introduces the now-standard Query, Key, Value framework.

You have three matrices:

$Q \in \mathbb{R}^{n \times d}$: Queries - $n$ row vectors, each asking “what do I need?”
$K \in \mathbb{R}^{m \times d}$: Keys - $m$ row vectors, each advertising “here is what I contain.”
$V \in \mathbb{R}^{m \times d_v}$: Values - $m$ row vectors, each holding the actual content.

The attention computation:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V.$$

Step by step:

$QK^\top$: a matrix of dot products. Entry $(i, j)$ is the inner product of query $i$ with key $j$ - a scalar score measuring compatibility. The result is $n \times m$.

Divide by $\sqrt{d}$: scale down the scores. Without this, the dot products grow with $d$ - each query-key pair is a sum of $d$ terms, and with random initialization the variance is $d$, so the standard deviation is $\sqrt{d}$. Large values push the softmax into saturation, where gradients vanish. Dividing by $\sqrt{d}$ restores variance to 1 and keeps training stable.

$\text{softmax}(\cdot)$: applied row-wise. Each row of $QK^\top/\sqrt{d}$ becomes a probability distribution over the $m$ key positions. This is the $n \times m$ attention matrix.

Multiply by $V$: the output at position $i$ is a weighted average of all value rows, where the weights come from row $i$ of the attention matrix. The result is $n \times d_v$.

Each output position gets a customized mixture of all value vectors, shaped by how strongly its query matched each key.

Discomfort check. The scaling by $\sqrt{d}$ sounds arbitrary. Here is the calculation behind it. Suppose the entries of $Q$ and $K$ are independently sampled from $\mathcal{N}(0, 1)$. Then a single dot product $q \cdot k = \sum_{i=1}^d q_i k_i$ has mean 0 and variance $d$ (each of the $d$ terms $q_i k_i$ has variance 1, and they are independent). So the standard deviation is $\sqrt{d}$. When $d = 512$, a typical dot product has magnitude around 22 - not 1. Softmax of values this large is nearly one-hot: all probability mass goes to the maximum. Dividing by $\sqrt{d}$ brings the typical magnitude back to 1, and softmax stays spread out and informative.

Attention as a Differentiable Database

There is a useful analogy that clarifies the Q, K, V design.

Imagine a key-value database. You have entries $\{(k_1, v_1), (k_2, v_2), \ldots, (k_m, v_m)\}$. You query the database with $q$. In a hard lookup, you return $v_j$ where $k_j = q$ exactly. In a soft lookup, you compute similarity scores between $q$ and all keys, normalize them, and return a weighted average of values.

Attention is a soft, differentiable database lookup:

The keys determine what each entry “is.”
The values determine what each entry “returns.”
The query specifies what you’re looking for.

The difference from a real database: every operation is differentiable with respect to $Q$, $K$, and $V$. You can backpropagate through the lookup. This means $Q$, $K$, $V$ are not fixed - they are linear projections of learned representations, trained end-to-end to make the lookup useful.

Self-Attention

In the attention we have described so far, queries come from the decoder and keys/values come from the encoder. But there is another flavor, equally important: self-attention, where $Q$, $K$, and $V$ all come from the same sequence.

Given a sequence of token representations $X \in \mathbb{R}^{n \times d}$:

$$Q = X W_Q, \quad K = X W_K, \quad V = X W_V.$$

The matrices $W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ are learned projections. Self-attention then computes:

$$\text{Attention}(X W_Q, X W_K, X W_V) = \text{softmax}\left(\frac{X W_Q W_K^\top X^\top}{\sqrt{d}}\right) X W_V.$$

Every token attends to every other token in the same sequence - including itself.

This directly solves the coreference problem from our opening example. In “The trophy doesn’t fit in the suitcase because it is too big”, the token “it” can attend strongly to “trophy” (and weakly to “suitcase”). The attention weight $\alpha_{\text{it}, \text{trophy}}$ will be large; $\alpha_{\text{it}, \text{suitcase}}$ will be small. The model learns this from data, without being given coreference labels.

Multi-Head Attention

A single attention head produces one weighted average of values per position. But a sentence has many simultaneous structure types: syntactic dependencies, semantic relationships, coreference chains, named entities. A single head may not capture all of them at once.

Multi-head attention runs $h$ attention heads in parallel, each with its own learned projections:

$$\text{head}_i = \text{Attention}(Q W_{Q}^i, K W_{K}^i, V W_{V}^i)$$

where $W_Q^i, W_K^i \in \mathbb{R}^{d \times d_k}$ and $W_V^i \in \mathbb{R}^{d \times d_v}$ are head-specific projection matrices. The outputs are concatenated and projected:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O.$$

In the original transformer, $h = 8$ heads, $d = 512$, $d_k = d_v = d/h = 64$. Total computation is roughly the same as a single head of dimension 512.

The heads do not divide responsibilities by fiat - they learn to specialize through training. Empirically, different heads attend to different patterns: one head may track subject-verb dependencies, another may track modifier-noun relationships, another may focus on coreference. Each head operates in a different learned subspace of the representation, and different subspaces capture different relationship types.

Why heads specialize. Running $h$ attention heads in parallel and projecting them down to the model dimension creates $h$ independently learned routing patterns. Do heads specialize? Empirically, yes - and the specialization is partly explained by the projection matrices. Each head projects queries, keys, and values into a lower-dimensional subspace before computing attention. Different heads project into different subspaces, so they “see” different aspects of the representation. The softmax is sensitive to scale (which is why we divide by $\sqrt{d_k}$), so heads that project to subspaces emphasizing different features will learn to attend to different patterns.

Specifically, in trained transformers: some heads attend to adjacent tokens (local syntax), some to the corresponding position in the other sequence (alignment), some to specific token types (e.g., “[CLS]” or punctuation), and some to semantically related tokens across long distances. This is not engineered - it emerges from gradient descent on the language modeling objective. The fact that it emerges consistently across models and languages suggests it reflects real structure in language, not random variation.

Cross-Attention

When you have two sequences - a source and a target - you can use cross-attention:

$$Q = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V.$$

The queries come from the target sequence (e.g., the decoder), but the keys and values come from the source sequence (e.g., the encoder). Each target position attends to source positions.

This is exactly the Bahdanau attention described earlier, reformulated in the Q, K, V framework. It is how a transformer decoder “reads” the encoder when doing translation or summarization: the decoder queries the encoder’s representations using cross-attention at every layer.

Cross-attention differs from self-attention only in the origin of the queries versus keys and values. The mechanics are identical.

The Quadratic Cost

Scaled dot-product attention has a cost: $O(n^2)$ in sequence length, where $n$ is the number of tokens.

The attention matrix $\text{softmax}(QK^\top / \sqrt{d})$ is $n \times n$. Computing it requires $n^2$ dot products. Storing it requires $O(n^2)$ memory. For $n = 1000$, that’s $10^6$ entries per head per layer - manageable. For $n = 100{,}000$, it’s $10^{10}$ entries per head per layer - not manageable without specialized algorithms.

This quadratic scaling is the central computational bottleneck of transformer models. It is why efficient attention - FlashAttention, sparse attention, linear attention - is an active research area. And it is why the context length of a language model is a meaningful constraint, not just an arbitrary limit.

The quadratic cost is not merely a theoretical concern. At sequence length 512 (typical for BERT), the attention matrix has 262,144 entries per head per layer - manageable. At sequence length 32,768 (a long document), it has over 1 billion entries - not manageable with standard GPU memory. This is why document-length tasks require either chunking (attend within windows), sparse attention (attend to a fixed set of positions), or linearized attention (approximate the attention matrix with a lower-rank structure). None of these fully preserves the properties of full attention. The choice of approximation is the central engineering decision in deploying attention to long-context tasks.

Hierarchical Attention

Flat attention attends over all tokens in a sequence simultaneously. But some structures are hierarchically organized: a document is made of paragraphs, each paragraph is made of sentences, each sentence is made of words. Attending over thousands of words at once is expensive and ignores this structure.

Hierarchical attention networks (HAN) - introduced by Yang et al. (2016) for document classification - apply two levels of attention in sequence.

Word-level attention: For each sentence $s$ containing words $w_1, \ldots, w_T$, encode each word with a bidirectional RNN to get annotations $h_1, \ldots, h_T$. Then apply attention:

$$u_t = \tanh(W_w h_t + b_w), \quad \alpha_t = \frac{\exp(u_t^\top u_w)}{\sum_{t'} \exp(u_{t'}^\top u_w)}, \quad s = \sum_t \alpha_t h_t$$

The word-level context vector $u_w$ is a learned query - it represents “which words are informative.” The sentence vector $s$ is the weighted sum of word annotations.

Sentence-level attention: Encode each sentence vector with another bidirectional RNN to get sentence annotations $h_1^s, \ldots, h_L^s$. Apply attention again:

$$u_i^s = \tanh(W_s h_i^s + b_s), \quad \alpha_i^s = \frac{\exp({u_i^s}^\top u_s)}{\sum_{i'} \exp({u_{i'}^s}^\top u_s)}, \quad v = \sum_i \alpha_i^s h_i^s$$

The document vector $v$ is the weighted sum of sentence vectors. This is fed to a classifier.

The two context vectors $u_w$ and $u_s$ are the key design: they are learned end-to-end from labels. A document about sports will have high attention weights on sentences mentioning games and scores; within those sentences, high weights on nouns and verbs that carry semantic content.

Why hierarchy helps. Not all words matter equally, and not all sentences matter equally. A flat attention mechanism attending over all words in a 500-word document is less interpretable and less efficient than a two-level mechanism that first identifies relevant sentences and then attends within them. The hierarchical structure also matches how humans read: skim paragraphs, then read the important ones carefully.

Attention over images. The same idea applies to vision. An image can be divided into a grid of patches. Attention can be computed over patches to identify which spatial regions are relevant for a given query. This is the mechanism used in Vision Transformers (ViT): the image is split into fixed-size patches, each patch is linearly embedded, and a standard transformer with self-attention processes the sequence of patch embeddings. Cross-attention over image patches is used in captioning models and visual question answering: the text decoder attends over spatial positions in the image at each generation step, producing spatially grounded captions. The attention weights in this setting are directly interpretable as a heat map over the image - regions that are linguistically relevant receive high weight.

Summary

Concept	Formula
Alignment score	$e_{ij} = a(s_{j-1}, h_i)$
Attention weight	$\alpha_{ij} = \exp(e_{ij}) / \sum_k \exp(e_{kj})$
Context vector	$c_j = \sum_i \alpha_{ij} h_i$
Scaled dot-product attention	$\text{softmax}(QK^\top / \sqrt{d}) V$
Self-attention	$Q, K, V$ all come from the same sequence
Cross-attention	$Q$ from target; $K, V$ from source
Multi-head attention	$h$ heads in parallel; concatenate and project
Hierarchical attention	Word-level then sentence-level; context vectors learned end-to-end
Attention over images	Self-attention over patch embeddings (ViT); cross-attention for grounded captioning
Complexity	$O(n^2 d)$ in time and $O(n^2)$ in memory

The core idea is elegant: instead of compressing context into a fixed-size vector, attend to all positions and let the model learn which ones matter. The alignment score measures compatibility, softmax normalizes it to a distribution, and the context vector is the resulting weighted average. Everything is differentiable. Everything is learned.

What attention does not provide, on its own, is any notion of order. Feed attention the tokens of a sentence in any permutation and it produces the same result. The next piece - positional encodings - is what restores order to this otherwise orderless mechanism.

Read next:

Positional Encodings - Teaching Attention Where Things Are in a Sequence