Recurrent Neural Networks
Before the transformer, this was how a model read a sentence: one word at a time, updating a running memory after each one. RNNs are mostly retired now, but understanding what they did — and where they broke — explains why attention took over.
The five-bullet version
- An RNN reads tokens sequentially, maintaining a single hidden vector that summarizes everything seen so far.
- At each step, the new token and old hidden state combine into a new hidden state via shared weights — “recurrence.”
- Long sequences cause vanishing gradients: signal from early tokens decays exponentially through repeated multiplication.
- LSTM and GRU added gates to preserve information across longer spans — useful but never solved the issue completely.
- Attention sidestepped the whole problem by letting any position look at any other position directly. Hence the transformer’s dominance.
§ 00 · READING ONE WORD AT A TIMEWhy sequence networks existed
Before 2017, the way you handled text in deep learning was almost always a recurrent neural networkrecurrent neural network. A network that processes a sequence one element at a time, maintaining a hidden state vector that's updated after each element. The 'recurrence' is the loop that feeds the previous hidden state back as input alongside the new token. — an RNN. The architecture had been around since the 1980s. The model reads the sequence token by token, left to right, updating a single hidden vector after each one. That hidden vector is supposed to summarize everything seen so far.
The mental model is a person reading a sentence with their finger. They get to each word, take a quick mental note, and move on. By the end, they have a single mental picture of the sentence — formed by layering the words one at a time.
For its era, this was a real improvement over the alternative (n-grams, bag-of-words). An RNN could in principle remember anything, because the hidden state could encode anything. In practice, “in principle” and “in practice” turned out to be very different.
§ 01 · HIDDEN STATE — THE RUNNING MEMORYThe one-line update rule
The whole RNN fits in one update rule. At step t, you have:
xₜ— the current input token’s vector.hₜ₋₁— the hidden state from the previous step.
Produce the new hidden state:
hₜ = tanh(W · xₜ + U · hₜ₋₁ + b)
Two weight matrices, W and U, plus a bias b. The same matrices are used at every step— this is the parameter-sharing that gives the architecture its name and its main efficiency. You don’t need different weights for token 1 vs token 1000; one set of weights handles every position.
The tanh squashes the output to (-1, 1) so values stay bounded as the chain extends. (This is also where most of the trouble starts — we’ll come back to it.)
§ 02 · UNROLLING THE RECURRENCESeeing the chain
It helps to draw the RNN unrolled — same weights, drawn out across time. Each box is the same RNN cell, applied at successive positions, with the hidden state flowing left to right.
Each cell is one dimension of the 8-dimensional hidden state. As the RNN reads more tokens, the contribution of earliertokens decays through repeated multiplication by < 1 — the cell that reflected “The” fades by step 8. That fade is the forgetting problem.
The slider walks one token at a time through “The cat that we saw yesterday was happy.” A grammatically tricky sentence: the subject (cat) is separated from its verb (was) by four intervening words. For the model to know the verb agrees with cat, that information has to survive the chain.
Watch the hidden-state heatmap as you scrub. Notice that the cells which lit up for “The” and “cat” fade by the time you reach “was.” The early signal is still there, in principle — but it’s mostly been overwritten by the four words in between.
§ 03 · WHY RNNS FORGETVanishing gradients, in one paragraph
The forgetting isn’t a bug. It’s baked into the math.
At each step, the hidden state is multiplied by U and passed through tanh. After k steps, the contribution from token t has been multiplied by U approximately k times and squashed by tanh each time.
If U has eigenvalues less than 1, the contribution shrinks exponentially with k. By step 30 you’ve multiplied by something like 0.930 ≈ 0.04 — only 4% of the original signal remains. By step 100, effectively zero.
If U has eigenvalues greater than 1, the contribution explodes instead. Even worse — the gradient becomes numerically uncomputable.
This is the vanishing gradient problemvanishing gradient problem. When the signal (or gradient) flowing through a deep / long chain is repeatedly multiplied by values less than 1, it decays exponentially. The deeper the chain, the smaller the signal at the top — and the harder it is to learn anything that depends on distant information.. In effect, a vanilla RNN cannot reliably learn dependencies more than ~20–30 tokens apart. For paragraphs, articles, or any long-range structure, vanilla RNNs are simply blind.
§ 04 · LSTM, GRU, AND THE MOVE TO ATTENTIONThree rounds of patches, then a replacement
The first response was architectural. The LSTMLSTM. Long Short-Term Memory. A gated RNN variant that maintains a separate 'cell state' alongside the hidden state, with three learned gates (input, forget, output) controlling what flows in, what stays, and what comes out. Designed in 1997 to mitigate vanishing gradients. (Long Short-Term Memory), proposed in 1997, added gates — small networks that decide, at each step, how much of the previous state to keep, how much new input to write in, and how much to output. The gates use a sigmoid (output 0..1), and critically the cell state can pass through unchanged (multiplied by 1.0) when the forget gate decides to keep it. This preserves long- range information.
LSTMs worked. They were the workhorse of NLP from 2014 to 2017, powering Google Translate, speech recognition, summarization, sentiment analysis — everything. They could handle longer dependencies than vanilla RNNs, often into the hundreds of tokens.
The GRUGRU. Gated Recurrent Unit. A simpler gated RNN with two gates instead of LSTM's three. Often performs comparably to LSTM with fewer parameters. (2014) was a streamlined LSTM with two gates instead of three. Often comparable performance, fewer parameters. The two became interchangeable defaults — LSTM more common in big production systems, GRU more common in research papers.
Then in 2017, “Attention Is All You Need” pointed out that you don’t need the chain at all. Attention lets every position look directly at every other position in one step. No hidden state to maintain, no gates to learn, no information passing through 30 intermediate cells. Direct lookup, content-addressable.
Three things attention bought that RNNs couldn’t:
- Constant path length between any two positions. A token at position 1 and a token at position 1,000 are one attention step apart, not 1,000 sequential steps. Gradients (and information) flow directly between them.
- Embarrassingly parallel. Computing the next state in an RNN requires the previous state — strictly sequential. Attention can be computed for all positions simultaneously. GPUs eat this for breakfast.
- Better scaling. Once researchers had a parallel architecture and unlimited training data, model size could grow. The transformer scaled to billions of parameters; RNNs effectively peaked at hundreds of millions.
§ 05 · TAKING THIS FORWARDWhy this lesson is here
Two reasons RNNs are worth understanding even though you’ll probably never train one:
- They make the transformer’s design choices legible. Every transformer feature — attention, residual connections, layer norm, the lack of recurrence — is in dialogue with RNN limitations. Knowing what RNNs couldn’t do explains what the transformer was built to fix.
- The recurrent paradigm isn’t over. Mamba, RWKV, and other state-space variants are betting that the right form of recurrence — with structured state transitions instead of single matrix multiplies — can compete with attention at scale while keeping the O(1)-per-token property. The historical baggage of vanilla RNNs is exactly what those teams are trying to un-inherit.
For more on what replaced the RNN, read the Attention and Transformer lessons. They’re where the story goes from here.
§ · GOING DEEPERWhy LSTMs survived, why they didn't, and Mamba
Vanilla RNNs suffer from vanishing gradients: when you backpropagate through many time steps, the signal shrinks exponentially. Hochreiter & Schmidhuber’s 1997 LSTM introduced a gated cell with explicit memory; the gradient could flow through the cell state without decay. LSTMs dominated sequence modeling from roughly 2014–2017 and made modern speech recognition, machine translation, and image captioning possible.
Transformers replaced RNNs for most sequence tasks because attention is parallelizable across time and recurrence isn’t. But the limitations of attention — quadratic cost, no compressed state — opened a door back. State Space Models (Mamba, Gu & Dao 2023) are essentially modern, learnable RNNs that compete with transformers at long context, with linear inference cost. The dust hasn’t settled — hybrid architectures combining attention and SSM layers are the current frontier.
§ · FURTHER READINGReferences & deeper sources
- (1997). Long Short-Term Memory · Neural Computation
- (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches (GRU) · SSST
- (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult · IEEE TNN
- (2017). Attention Is All You Need · NeurIPS
- (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces · COLM 2024
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.