Architectures · Module 11·10 min read

Recurrent Neural Networks

Before the transformer, this was how a model read a sentence: one word at a time, updating a running memory after each one. RNNs are mostly retired now, but understanding what they did — and where they broke — explains why attention took over.

The five-bullet version

  • An RNN reads tokens sequentially, maintaining a single hidden vector that summarizes everything seen so far.
  • At each step, the new token and old hidden state combine into a new hidden state via shared weights — “recurrence.”
  • Long sequences cause vanishing gradients: signal from early tokens decays exponentially through repeated multiplication.
  • LSTM and GRU added gates to preserve information across longer spans — useful but never solved the issue completely.
  • Attention sidestepped the whole problem by letting any position look at any other position directly. Hence the transformer’s dominance.

§ 00 · READING ONE WORD AT A TIMEWhy sequence networks existed

Before 2017, the way you handled text in deep learning was almost always a recurrent neural networkrecurrent neural network. A network that processes a sequence one element at a time, maintaining a hidden state vector that's updated after each element. The 'recurrence' is the loop that feeds the previous hidden state back as input alongside the new token. — an RNN. The architecture had been around since the 1980s. The model reads the sequence token by token, left to right, updating a single hidden vector after each one. That hidden vector is supposed to summarize everything seen so far.

The mental model is a person reading a sentence with their finger. They get to each word, take a quick mental note, and move on. By the end, they have a single mental picture of the sentence — formed by layering the words one at a time.

For its era, this was a real improvement over the alternative (n-grams, bag-of-words). An RNN could in principle remember anything, because the hidden state could encode anything. In practice, “in principle” and “in practice” turned out to be very different.

§ 01 · HIDDEN STATE — THE RUNNING MEMORYThe one-line update rule

The whole RNN fits in one update rule. At step t, you have:

Produce the new hidden state:

hₜ = tanh(W · xₜ + U · hₜ₋₁ + b)

Two weight matrices, W and U, plus a bias b. The same matrices are used at every step— this is the parameter-sharing that gives the architecture its name and its main efficiency. You don’t need different weights for token 1 vs token 1000; one set of weights handles every position.

The tanh squashes the output to (-1, 1) so values stay bounded as the chain extends. (This is also where most of the trouble starts — we’ll come back to it.)

§ 02 · UNROLLING THE RECURRENCESeeing the chain

It helps to draw the RNN unrolled — same weights, drawn out across time. Each box is the same RNN cell, applied at successive positions, with the hidden state flowing left to right.

Lab · unrolled RNNWatch the hidden state evolve token by token
Step1/8
The
x_1
cat
x_2
that
x_3
we
x_4
saw
x_5
yesterday
x_6
was
x_7
happy
x_8
Hidden state h_1 after reading “The
0.08
-0.02
-0.12
-0.22
-0.31
-0.40
-0.48
0.45

Each cell is one dimension of the 8-dimensional hidden state. As the RNN reads more tokens, the contribution of earliertokens decays through repeated multiplication by < 1 — the cell that reflected “The” fades by step 8. That fade is the forgetting problem.

The slider walks one token at a time through “The cat that we saw yesterday was happy.” A grammatically tricky sentence: the subject (cat) is separated from its verb (was) by four intervening words. For the model to know the verb agrees with cat, that information has to survive the chain.

Watch the hidden-state heatmap as you scrub. Notice that the cells which lit up for “The” and “cat” fade by the time you reach “was.” The early signal is still there, in principle — but it’s mostly been overwritten by the four words in between.

§ 03 · WHY RNNS FORGETVanishing gradients, in one paragraph

The forgetting isn’t a bug. It’s baked into the math.

At each step, the hidden state is multiplied by U and passed through tanh. After k steps, the contribution from token t has been multiplied by U approximately k times and squashed by tanh each time.

If U has eigenvalues less than 1, the contribution shrinks exponentially with k. By step 30 you’ve multiplied by something like 0.930 ≈ 0.04 — only 4% of the original signal remains. By step 100, effectively zero.

If U has eigenvalues greater than 1, the contribution explodes instead. Even worse — the gradient becomes numerically uncomputable.

This is the vanishing gradient problemvanishing gradient problem. When the signal (or gradient) flowing through a deep / long chain is repeatedly multiplied by values less than 1, it decays exponentially. The deeper the chain, the smaller the signal at the top — and the harder it is to learn anything that depends on distant information.. In effect, a vanilla RNN cannot reliably learn dependencies more than ~20–30 tokens apart. For paragraphs, articles, or any long-range structure, vanilla RNNs are simply blind.

§ 04 · LSTM, GRU, AND THE MOVE TO ATTENTIONThree rounds of patches, then a replacement

The first response was architectural. The LSTMLSTM. Long Short-Term Memory. A gated RNN variant that maintains a separate 'cell state' alongside the hidden state, with three learned gates (input, forget, output) controlling what flows in, what stays, and what comes out. Designed in 1997 to mitigate vanishing gradients. (Long Short-Term Memory), proposed in 1997, added gates — small networks that decide, at each step, how much of the previous state to keep, how much new input to write in, and how much to output. The gates use a sigmoid (output 0..1), and critically the cell state can pass through unchanged (multiplied by 1.0) when the forget gate decides to keep it. This preserves long- range information.

LSTMs worked. They were the workhorse of NLP from 2014 to 2017, powering Google Translate, speech recognition, summarization, sentiment analysis — everything. They could handle longer dependencies than vanilla RNNs, often into the hundreds of tokens.

The GRUGRU. Gated Recurrent Unit. A simpler gated RNN with two gates instead of LSTM's three. Often performs comparably to LSTM with fewer parameters. (2014) was a streamlined LSTM with two gates instead of three. Often comparable performance, fewer parameters. The two became interchangeable defaults — LSTM more common in big production systems, GRU more common in research papers.

Vanilla RNNtanhx_t, h_(t-1)h_tLSTMinput gateforget gateoutput gatex_t, h_(t-1)h_tGRUreset gateupdate gatex_t, h_(t-1)h_tEach variant adds more machinery to preserve information across the chain.
Fig 1Vanilla RNN, LSTM, GRU. Each step's mechanics is a little more elaborate than the last. The trend was: more machinery to preserve information, then ultimately abandon the chain entirely.

Then in 2017, “Attention Is All You Need” pointed out that you don’t need the chain at all. Attention lets every position look directly at every other position in one step. No hidden state to maintain, no gates to learn, no information passing through 30 intermediate cells. Direct lookup, content-addressable.

Three things attention bought that RNNs couldn’t:

CHECKA vanilla RNN trained on language modeling is asked to predict the next word given a 50-token sentence. It does well on sentences of length 5–15 but degrades sharply past 30 tokens. What's the most likely cause?

§ 05 · TAKING THIS FORWARDWhy this lesson is here

Two reasons RNNs are worth understanding even though you’ll probably never train one:

For more on what replaced the RNN, read the Attention and Transformer lessons. They’re where the story goes from here.

§ · GOING DEEPERWhy LSTMs survived, why they didn't, and Mamba

Vanilla RNNs suffer from vanishing gradients: when you backpropagate through many time steps, the signal shrinks exponentially. Hochreiter & Schmidhuber’s 1997 LSTM introduced a gated cell with explicit memory; the gradient could flow through the cell state without decay. LSTMs dominated sequence modeling from roughly 2014–2017 and made modern speech recognition, machine translation, and image captioning possible.

Transformers replaced RNNs for most sequence tasks because attention is parallelizable across time and recurrence isn’t. But the limitations of attention — quadratic cost, no compressed state — opened a door back. State Space Models (Mamba, Gu & Dao 2023) are essentially modern, learnable RNNs that compete with transformers at long context, with linear inference cost. The dust hasn’t settled — hybrid architectures combining attention and SSM layers are the current frontier.

§ · FURTHER READINGReferences & deeper sources

  1. Hochreiter, Schmidhuber (1997). Long Short-Term Memory · Neural Computation
  2. Cho et al. (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches (GRU) · SSST
  3. Bengio, Simard, Frasconi (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult · IEEE TNN
  4. Vaswani et al. (2017). Attention Is All You Need · NeurIPS
  5. Gu, Dao (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces · COLM 2024

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.