Prerequisite:


The Bottleneck Problem in Seq2Seq Models

Classical encoder-decoder models for sequence transduction compress an entire input sequence into a single fixed-length vector - the final hidden state of the encoder. The decoder then conditions every output step on this single representation. This creates a fundamental information bottleneck: as input sequences grow longer, the encoder must pack increasingly more information into the same fixed-capacity vector. Empirically, translation quality for RNN-based seq2seq models degrades sharply beyond roughly 20–30 source tokens.

Formally, given an input sequence $x_1, \ldots, x_T$, the encoder computes hidden states $h_1, \ldots, h_T$ and passes only $h_T$ to the decoder. The decoder generates output $y_t$ conditioned on $h_T$ and its own previous outputs. The decoder has no direct access to intermediate encoder states, so information from early positions is effectively forced through the last hidden state - a lossy compression of the full input.

Bahdanau Additive Attention

Bahdanau et al. (2015) proposed a direct solution: at each decoding step $i$, allow the decoder to selectively attend to every encoder hidden state. Define an alignment energy:

$$e_{ij} = a(s_{i-1},, h_j)$$

where $s_{i-1}$ is the decoder hidden state from the previous step and $h_j$ is the $j$-th encoder hidden state. The alignment function $a$ is a small feed-forward network learned jointly with the rest of the model:

$$a(s, h) = v_a^\top \tanh(W_a s + U_a h)$$

Normalising over all encoder positions yields the attention weights:

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}$$

The context vector for decoding step $i$ is the attention-weighted sum of encoder states:

$$c_i = \sum_{j=1}^{T} \alpha_{ij}, h_j$$

The decoder hidden state update becomes $s_i = f(s_{i-1}, y_{i-1}, c_i)$. Because $\alpha_{ij}$ is differentiable everywhere, the alignment function is learned end-to-end via backpropagation. Each decoding step receives a context vector tailored to what it needs, rather than the same compressed representation - eliminating the bottleneck.

Scaled Dot-Product Attention

The Transformer (Vaswani et al., 2017) replaced additive attention with a more parallelisable dot-product form. Pack queries, keys, and values into matrices $Q \in \mathbb{R}^{n \times d_k}$, $K \in \mathbb{R}^{m \times d_k}$, $V \in \mathbb{R}^{m \times d_v}$. The attention function is then:

$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

The product $QK^\top \in \mathbb{R}^{n \times m}$ computes all pairwise query-key compatibilities in a single batched matrix multiply, making the entire operation highly parallelisable on GPU/TPU hardware.

Why the $\sqrt{d_k}$ scaling? For randomly initialised $q, k \in \mathbb{R}^{d_k}$ with independent zero-mean unit-variance components, the dot product $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$ has mean $0$ and variance $d_k$. Without scaling, dot products grow in magnitude as $d_k$ increases, pushing the softmax into regions where its gradient is near zero - impeding learning. Dividing by $\sqrt{d_k}$ restores unit variance to the pre-softmax logits, keeping gradients healthy regardless of the key dimension.

Multi-Head Attention

A single attention function is limited to attending along one representational subspace at a time. Multi-head attention addresses this by running $h$ independent attention heads in parallel, each projecting into a lower-dimensional space:

$$\text{head}_i = \text{Attention}(QW_i^Q,; KW_i^K,; VW_i^V)$$

where $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$, with $d_k = d_v = d_{\text{model}} / h$.

The head outputs are concatenated and projected back to the model dimension:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h), W^O$$

with $W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}$. Because $d_k = d_{\text{model}} / h$, the total number of floating-point operations is the same as a single full-dimensional attention, but the model can simultaneously attend to multiple distinct aspects of the input - for instance, one head may learn syntactic dependencies while another tracks coreference chains.

Computational Complexity

The dominant cost of self-attention is computing $QK^\top$: for a sequence of length $n$ with model dimension $d$, this is $O(n^2 d)$. The quadratic dependence on sequence length is the primary obstacle to scaling vanilla Transformers to very long contexts. By contrast, an RNN has $O(d^2)$ cost per step and $O(nd^2)$ total, but the sequential dependency prevents parallelisation. Self-attention trades parallel computation for a quadratic memory and compute footprint in sequence length, which motivates the large body of work on efficient attention variants (sparse, linear, block-sparse).

Self-Attention vs. Cross-Attention

Self-attention uses the same sequence for queries, keys, and values: $Q = K = V$. Every token attends to every other token in the same sequence, building contextual representations that mix information across positions freely. This is the mechanism in Transformer encoders and in the masked self-attention layer of decoders.

Cross-attention derives queries from one sequence and keys and values from another. In a Transformer decoder, the queries come from the decoder’s own hidden states and the keys and values come from the encoder’s output. This replaces Bahdanau’s soft alignment, giving each decoding step selective access to the full encoder representation.

Attention Pattern Visualisation

The attention weight matrix $A = \text{softmax}(QK^\top / \sqrt{d_k}) \in \mathbb{R}^{n \times m}$ is interpretable as a soft alignment between query positions (rows) and key positions (columns). Plotting $A$ as a heatmap reveals learned structure: in translation, roughly diagonal patterns correspond to monotone alignment; in language modelling, off-diagonal entries track long-range dependencies such as subject-verb agreement. A word of caution: high attention weight does not straightforwardly imply causal importance for the output. Gradient-based attribution methods frequently disagree with attention-weight heatmaps, and attention should be interpreted as a routing mechanism rather than an explanation of model reasoning.

Examples

Encoder-decoder cross-attention in machine translation. When a Transformer translates “The agreement on the European Economic Area was signed in August 1992,” the decoder generates the French equivalent token by token. At the step producing “économique,” the cross-attention weights concentrate strongly on the source tokens “Economic” and “Area.” This selective weighting means the decoder draws a context vector dominated by those encoder representations, providing the correct semantic signal at that generation step - regardless of the distance between source and target positions.

Self-attention in sequence classification. In a BERT-style encoder fine-tuned for sentiment analysis, the [CLS] token accumulates information from the entire input via self-attention across all layers. In later layers, attention heads often show high weight from [CLS] toward sentiment-bearing adjectives and adverbs, routing the relevant signal into the token whose representation is passed to the classification head. This global, constant-depth information routing - bypassing the vanishing-gradient bottleneck of RNNs - is a central reason attention-based encoders outperform LSTMs on tasks with long-range dependencies.


Read Next: