Recurrent Neural Networks
Prerequisite:
Motivation: Sequential Data
Feedforward networks assume fixed-size inputs and outputs with no temporal ordering. Language, audio, time series, and video are fundamentally sequential: the meaning of a word depends on those before it; the next price in a financial series depends on recent history. Recurrent Neural Networks (RNNs) address this by maintaining a hidden state $h_t$ that summarizes the history of inputs seen so far, allowing variable-length inputs and outputs.
The Vanilla RNN
At each time step $t$, the RNN takes input $x_t$ and previous hidden state $h_{t-1}$, producing a new hidden state and output:
$$h_t = \tanh(W_{hh},h_{t-1} + W_{xh},x_t + b_h)$$
$$y_t = W_{hy},h_t + b_y$$
where $W_{hh} \in \mathbb{R}^{d \times d}$, $W_{xh} \in \mathbb{R}^{d \times m}$, and $W_{hy} \in \mathbb{R}^{k \times d}$. The key property is parameter sharing across time: the same weight matrices are applied at every step, making the model applicable to sequences of arbitrary length.
The initial hidden state $h_0$ is typically set to zero or learned as a parameter. The output $y_t$ may be produced at every step (many-to-many), only at the final step (many-to-one, for classification), or at multiple steps given a single input (one-to-many, for generation).
Backpropagation Through Time
Training uses BPTT: unroll the RNN for $T$ steps, treating it as a $T$-layer feedforward network with shared weights, and apply standard backpropagation. The gradient of the loss $L = \sum_t L_t$ with respect to $h_s$ requires chaining through time:
$$\frac{\partial L_t}{\partial h_s} = \frac{\partial L_t}{\partial h_t} \prod_{k=s}^{t-1} \frac{\partial h_{k+1}}{\partial h_k}$$
Each factor $\frac{\partial h_{k+1}}{\partial h_k} = \text{diag}(\tanh'(\cdot)) \cdot W_{hh}$ involves the recurrent weight matrix $W_{hh}$.
Vanishing and Exploding Gradients
For long sequences, the product of $T$ Jacobians $\prod_{k} W_{hh}$ grows or shrinks exponentially. Formally, if $W_{hh}$ has singular values bounded away from 1:
- If the largest singular value $\sigma_1 < 1$: gradients $\to 0$ exponentially fast. Parameters in early time steps receive nearly zero gradient - the model cannot learn long-range dependencies.
- If $\sigma_1 > 1$: gradients grow exponentially, causing exploding gradients and numerical instability.
This eigenvalue analysis explains why vanilla RNNs fail on sequences longer than ~20 steps. The vanishing gradient problem is the primary motivation for LSTM and GRU.
Gradient clipping addresses exploding gradients: if $|\nabla J|_2 > \tau$, rescale the gradient to $\tau \cdot \nabla J / |\nabla J|_2$. This simple heuristic stabilizes training without solving the underlying vanishing gradient issue.
Long Short-Term Memory (LSTM)
The LSTM (Hochreiter and Schmidhuber, 1997) introduces a separate cell state $c_t$ and gating mechanisms to control information flow. All gates use sigmoid activations, producing values in $(0,1)$ that act as soft switches.
$$f_t = \sigma(W_f[h_{t-1}, x_t] + b_f) \qquad \text{(forget gate)}$$
$$i_t = \sigma(W_i[h_{t-1}, x_t] + b_i) \qquad \text{(input gate)}$$
$$\tilde{c}t = \tanh(W_c[h{t-1}, x_t] + b_c) \qquad \text{(candidate cell)}$$
$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \qquad \text{(cell update)}$$
$$o_t = \sigma(W_o[h_{t-1}, x_t] + b_o) \qquad \text{(output gate)}$$
$$h_t = o_t \odot \tanh(c_t) \qquad \text{(hidden state)}$$
The forget gate $f_t$ decides what fraction of the previous cell state to retain. When $f_t \approx 1$, information flows unchanged through time - this is the LSTM’s solution to vanishing gradients. The cell state $c_t$ has an approximately linear gradient path, analogous to the residual connections in ResNets. The input gate $i_t$ scales how much new information $\tilde{c}_t$ is written to the cell. The output gate $o_t$ controls how much of the cell state influences the hidden state passed to subsequent layers.
Gated Recurrent Unit (GRU)
The GRU (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state, using two gates instead of three:
$$z_t = \sigma(W_z[h_{t-1}, x_t]) \qquad \text{(update gate)}$$
$$r_t = \sigma(W_r[h_{t-1}, x_t]) \qquad \text{(reset gate)}$$
$$\tilde{h}t = \tanh(W[r_t \odot h{t-1}, x_t]) \qquad \text{(candidate)}$$
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$
The update gate $z_t$ interpolates between the old hidden state and the new candidate - when $z_t \approx 0$, the hidden state is unchanged (similar to the forget+input gate combination in LSTM). The reset gate $r_t$ controls how much previous state influences the candidate. GRUs have fewer parameters than LSTMs and train faster, with comparable performance on many tasks.
Bidirectional RNN
A unidirectional RNN can only use past context to predict at time $t$. A bidirectional RNN runs two independent RNNs - one forward, one backward - and concatenates their hidden states:
$$\overrightarrow{h}t = \text{RNN}{\text{fwd}}(x_t, \overrightarrow{h}{t-1}), \qquad \overleftarrow{h}t = \text{RNN}{\text{bwd}}(x_t, \overleftarrow{h}{t+1})$$
$$h_t = [\overrightarrow{h}_t;, \overleftarrow{h}_t]$$
This is effective for tasks where the full sequence is available at inference time (e.g., text classification, named entity recognition, machine translation encoding). It cannot be used for autoregressive generation, where future tokens are not available.
Encoder-Decoder for Sequence-to-Sequence
For tasks that transform one sequence into another (translation, summarization, speech recognition), the encoder-decoder (seq2seq) architecture uses one RNN to encode the source sequence into a fixed-size context vector $c$, and a second RNN to decode it:
$$c = h_T^{\text{enc}} \qquad \text{(final encoder hidden state)}$$
$$h_t^{\text{dec}} = \text{RNN}{\text{dec}}(y{t-1}, h_{t-1}^{\text{dec}}, c)$$
The bottleneck of compressing the entire source into a single vector $c$ limits performance on long sequences - this limitation motivated the development of attention mechanisms.
Examples
Language modeling perplexity. A character-level RNN trained on Shakespeare produces coherent prose after 100K training steps. Perplexity on a held-out character sequence measures how surprised the model is: $\text{PP} = \exp\left(-\frac{1}{T}\sum_t \log P(x_t | x_{<t})\right)$. A random model over 65 characters has perplexity 65. A well-trained character RNN achieves perplexity around 4-8, meaning on average it is choosing between 4-8 equally likely characters at each step.
Character-level RNN and emergent structure. Karpathy’s visualization of LSTM cell activations during character-level language model inference reveals individual neurons tracking high-level structure: a neuron that fires strongly when the model is inside a quoted string, another that tracks line position, another that detects whether code is inside an if block. These features emerge without supervision - they are learned because they are predictive of the next character. This interpretability result illustrates that RNNs can learn genuinely structured representations despite operating on raw character sequences.
Read Next: