Helpful context:


“The animal didn’t cross the street because it was too tired.”

What does “it” refer to? The animal. Not the street. Your brain figured this out in milliseconds by holding the subject of the clause in memory while processing the rest of the sentence. This is effortless for you, but it is something a standard feedforward neural network genuinely cannot do. A feedforward network processes one input vector - fixed-size, no sequence, no memory. Each forward pass is independent. There is no mechanism for the computation at token 10 to know what happened at token 1.

Recurrent neural networks add memory. They maintain a hidden state that accumulates information from each step of the sequence. But they come with a fundamental flaw that took a decade to partially solve, and which ultimately drove the field to a completely different architecture.


The Core Idea

An RNN processes a sequence $x_1, x_2, \ldots, x_T$ one element at a time. At each step $t$, it receives the current input $x_t$ and the previous hidden state $h_{t-1}$, and produces a new hidden state:

$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h).$$

This is the vanilla RNN update. The weight matrices $W_{hh} \in \mathbb{R}^{d \times d}$ and $W_{xh} \in \mathbb{R}^{d \times d_x}$ are shared across all time steps - the same transformation is applied at every position. The hidden state $h_t \in \mathbb{R}^d$ is a summary of the entire history $x_1, \ldots, x_t$.

If you need an output at each step (as in sequence labeling), add an output layer:

$$\hat{y}_t = g(W_{hy} h_t + b_y),$$

where $g$ is an appropriate activation (softmax for classification, linear for regression). If you need a single output at the end (as in sentiment classification), use only $\hat{y}_T$.

The initial hidden state $h_0$ is typically the zero vector.

Signal processing perspective. In digital signal processing, a system with no feedback (output depends only on current and past inputs, not past outputs) is called a Finite Impulse Response (FIR) filter - its impulse response has finite length. A system with feedback (output depends on past outputs too) is called an Infinite Impulse Response (IIR) filter. The vanilla RNN is the neural analog of an IIR filter: the hidden state $h_{t-1}$ is a past output fed back into the current computation, giving the system theoretically infinite memory. A 1D convolutional network with a context window of size $k$ is the neural analog of an FIR filter: it uses only the last $k$ inputs and has no feedback. FIR systems are always stable; IIR systems (and RNNs) can become unstable - this is the geometric meaning of vanishing and exploding gradients.


Unrolling Through Time

The key conceptual move is to “unroll” the RNN: view it as a very deep feedforward network where each layer corresponds to one time step, and all layers share the same weights.

For a sequence of length $T$, the unrolled network has $T$ layers. Each layer takes the hidden state from the previous layer and the input at that time step. The hidden state flows from left to right, accumulating information.

This viewpoint makes training clear: you can apply standard backpropagation to the unrolled network. The algorithm is called backpropagation through time (BPTT). Gradients flow backward from the loss through each time step, all the way to the beginning of the sequence.

The gradient of the loss with respect to the initial hidden state is:

$$\frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_T} \cdot \frac{\partial h_T}{\partial h_{T-1}} \cdot \frac{\partial h_{T-1}}{\partial h_{T-2}} \cdots \frac{\partial h_1}{\partial h_0}.$$

Each $\partial h_t / \partial h_{t-1}$ is a Jacobian matrix - the derivative of each element of $h_t$ with respect to each element of $h_{t-1}$. For the vanilla RNN update with tanh nonlinearity:

$$\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(1 - h_t^2) \cdot W_{hh}.$$

The full gradient involves a product of $T$ such matrices.


The Vanishing Gradient Problem

This is where the fundamental difficulty lives.

If you multiply any matrix by itself many times, two things can happen depending on the eigenvalues:

  • If the largest eigenvalue of $W_{hh}$ is less than 1, the product $W_{hh}^T$ shrinks exponentially toward zero as $T$ grows. Gradients vanish.
  • If the largest eigenvalue is greater than 1, the product $W_{hh}^T$ grows exponentially. Gradients explode.

For vanilla RNNs, the gradient involves a product of $T$ terms each involving $W_{hh}$. With $T = 100$ steps and a largest eigenvalue of $0.9$:

$$0.9^{100} \approx 2.7 \times 10^{-5}.$$

The gradient signal from 100 steps ago has essentially vanished. The network receives almost no information about how to update its weights to better process long-range dependencies. In practice: vanilla RNNs cannot learn to use information from more than roughly 10-20 steps back.

This is not a pathological edge case - it is the typical behavior. The vanishing gradient problem was identified and formalized by Hochreiter (1991) and Bengio et al. (1994). It explains why, throughout the 1990s and early 2000s, RNNs failed to learn many tasks that required long-range memory.

Gradient clipping partially addresses exploding gradients: if the gradient norm exceeds a threshold $c$, rescale the gradient:

$$g \leftarrow g \cdot \frac{c}{|g|} \quad \text{if } |g| > c.$$

This prevents catastrophic updates from gradient explosions. But it does not solve vanishing gradients - those are simply absent, and there is nothing to clip.

Discomfort check. RNNs were the state of the art for NLP until approximately 2018. Many tutorials, textbooks, and courses still teach them as the primary model for sequence processing. But in practice, if you are doing NLP in the mid-2020s, you are using a transformer. RNNs are important to understand not because you will use them, but because they show exactly what problem attention was designed to solve. Every limitation of RNNs listed in this post is directly addressed by the transformer architecture. Understanding RNNs makes transformers legible.


Long Short-Term Memory (LSTM)

Hochreiter and Schmidhuber (1997) introduced the Long Short-Term Memory network to solve the vanishing gradient problem. The insight: the vanilla RNN’s hidden state is passed through a nonlinearity at every step, which is why gradients vanish. Replace this with a cell state that flows through time with only additive (not multiplicative) interactions, and gradients can flow without vanishing.

The LSTM has two memory structures: a cell state $C_t$ and a hidden state $h_t$. The cell state is the “long-term memory”; the hidden state is the “working memory” that is exposed to the output.

Four learned gates control the flow of information:

Forget gate - how much of the previous cell state to discard:

$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f).$$

$f_t \in (0,1)^d$. When $f_t \approx 1$: keep everything. When $f_t \approx 0$: forget everything.

Input gate - what new information to write to the cell state:

$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i).$$

Candidate cell state - what content to potentially write:

$$\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C).$$

Cell state update - combine forgetting old information with adding new:

$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t.$$

This is the key equation. The update to $C_t$ is additive: $f_t \odot C_{t-1}$ (old information, gated) plus $i_t \odot \tilde{C}_t$ (new information, gated). No matrix multiplication through time. No squashing nonlinearity on the cell state. Gradients can flow through $C_t$ largely unobstructed.

Output gate - what to expose from the cell state:

$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o).$$

Hidden state - the actual output:

$$h_t = o_t \odot \tanh(C_t).$$

The $\tanh(C_t)$ squashes the cell state to $(-1, 1)$; the output gate selects which parts to expose.

Why does this solve vanishing gradients? The gradient of $C_t$ with respect to $C_{t-1}$ is simply $f_t$. If the forget gate is close to 1 (the network decides to remember everything), the gradient passes through unchanged. The forget gate can learn to stay near 1 for a long stretch, creating a gradient highway over hundreds of steps. The cell state acts like a conveyor belt: information is added and removed, but the belt itself has low friction.


Gated Recurrent Units (GRU)

Cho et al. (2014) introduced the Gated Recurrent Unit as a simpler alternative to the LSTM. Instead of separate cell and hidden states, the GRU has a single hidden state and two gates:

Update gate - how much to update the hidden state:

$$z_t = \sigma(W_z [h_{t-1}, x_t] + b_z).$$

Reset gate - how much of the previous hidden state to use when computing the candidate:

$$r_t = \sigma(W_r [h_{t-1}, x_t] + b_r).$$

Candidate hidden state:

$$\tilde{h}_t = \tanh(W [r_t \odot h_{t-1}, x_t] + b).$$

Update:

$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t.$$

The GRU has fewer parameters than the LSTM (no separate cell state, no output gate). In practice, GRUs often perform comparably to LSTMs, especially on smaller datasets. The choice between them is often a hyperparameter decision rather than a principled one.


Bidirectional RNNs

A standard RNN reads the sequence left to right. For tasks where future context matters - named entity recognition, POS tagging, machine translation encoding - this is a limitation. The word “Washington” might refer to a person or a place; you often need to read ahead to disambiguate.

A bidirectional RNN runs two RNNs over the sequence: one left to right (producing $\overrightarrow{h}_t$) and one right to left (producing $\overleftarrow{h}_t$). At each position, concatenate the two hidden states:

$$h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t] \in \mathbb{R}^{2d}.$$

This doubles the parameter count but gives the model access to the full context at each position. Bidirectional LSTMs were the standard for sequence labeling and encoding tasks throughout the pre-transformer era. BERT (2018) can be seen as a version of this idea taken to its limit: a bidirectional encoder trained on massive data using the attention mechanism.

Note: bidirectional RNNs cannot be used for generation. Generating the next word requires processing left to right only - you cannot look at future tokens you haven’t generated yet.


Sequence-to-Sequence Models

Many important tasks map one sequence to another: machine translation (English sentence → French sentence), summarization (document → summary), question answering (question + context → answer). These tasks have variable-length inputs and outputs, which feedforward networks cannot handle.

The encoder-decoder architecture (Sutskever et al. 2014; Cho et al. 2014) uses two RNNs:

The encoder reads the input sequence and compresses it into a fixed-size context vector: the final hidden state $h_T$ (or cell state $C_T$) of the encoder LSTM.

The decoder generates the output sequence autoregressively: at each step, it takes the previous token and the current hidden state, produces the next token, and uses its own hidden state as input to the next step. The decoder is initialized with the encoder’s final state.

This architecture enabled the first practical neural machine translation systems. Its critical limitation: the entire input sequence is compressed into a single fixed-size vector. For long sequences (paragraphs, documents), this bottleneck causes performance to degrade. The longer the input, the more information must be jammed into the same vector size.

This limitation motivated the attention mechanism: instead of compressing to a single vector, allow the decoder to attend to all encoder states at each decoding step. Attention solves the bottleneck problem and is the key innovation that eventually led to transformers - which eliminated the recurrence entirely.


Limitations That Led to Transformers

By 2017, LSTMs and GRUs had brought sequence modeling a long way. But two fundamental problems remained:

Sequential computation. An RNN of length $T$ requires $T$ sequential computations. You cannot compute $h_5$ until you have $h_4$, which requires $h_3$, and so on. This means you cannot parallelize over the sequence during training. For a sequence of 10,000 tokens, you need 10,000 sequential steps. GPUs are designed for parallel computation; this sequential dependency severely limits how efficiently they can be used.

Long-range dependencies. Even with LSTMs, dependencies spanning hundreds or thousands of tokens are difficult to learn. The cell state helps, but it is still a fixed-size vector that accumulates everything - it is not designed to selectively retrieve specific pieces of information from far back.

The transformer, introduced in 2017, addresses both problems. Self-attention allows any position to directly attend to any other position with a constant number of operations, independent of the sequence length. And attention is fully parallelizable - all positions in the sequence can attend to all others simultaneously.

Understanding why transformers replaced RNNs requires understanding these two limitations precisely. Both lead directly to design choices in the attention mechanism.


Summary

Model Long-Range Memory Parallelizable Parameters vs. LSTM Main Use
Vanilla RNN Poor (vanishing gradients) No Fewer Simple sequences, legacy
LSTM Good (cell state highway) No Baseline Long sequences (pre-2018)
GRU Good No Fewer Often equivalent to LSTM
Bidirectional LSTM Good (full context) Partially $2\times$ Encoding, NER, tagging
Encoder-Decoder Limited (fixed bottleneck) Partially $2\times$ Translation (pre-attention)

RNNs were the solution to a real problem: how do you process sequences with a fixed-architecture neural network? The solution - maintain a hidden state, update it recurrently, train with backpropagation through time - was elegant and worked well enough to enable major advances in NLP from 2013 to 2018. The LSTM’s gating mechanism was a genuine intellectual achievement that solved vanishing gradients and enabled learning over hundreds of time steps.

But the sequential computation bottleneck and the long-range dependency problem were not fully solved. The transformer did not improve on RNNs - it replaced them. Learning RNNs is learning the problem. Learning transformers is learning the solution.


Read next: