Architectures · Module 18·9 min read

Seq2Seq

The architecture that put neural machine translation on the map. Two RNNs back-to-back — encoder reads the source language, decoder emits the target. Then attention turned its weakness into the seed of the transformer.

Brain Drip EditorsUpdated May 2026·10 references

The five-bullet version

Seq2seq is two RNNs glued together — an encoder that reads the input sequence and a decoder that emits the output sequence.
The handoff is a single fixed-size vector: the encoder’s final hidden state.
That bottleneck is the architecture’s biggest flaw — everything about a long source has to fit in one vector.
Attention (added in 2014) lets the decoder look back at every encoder hidden state, not just the final one.
Once you have attention, the encoder/decoder split becomes optional — and the transformer arrives.

§ 00 · TWO RNNS, GLUEDThe encoder–decoder template

Before 2014, neural machine translation was rough. The architecture that changed that — seq2seqseq2seq. Sequence-to-sequence. A pair of RNNs (encoder + decoder) joined by a single fixed-size vector. The encoder ingests the input sequence; the decoder generates the output sequence. The 2014 architecture that made neural machine translation competitive. — has a one-line description: two RNNs glued together.

The encoder reads the source language token by token, producing a hidden state at each step. The final hidden state summarizes the whole input.

The decoder starts from that final hidden state and generates the target language token by token. At each step it takes the previous decoder state plus the previously emitted target token and produces the next.

Lab · encoder–decoderEnglish → French · one token at a time

step 0 / 7

Encoder · reads source

Thecatsat

building h_0

Decoder · emits target

————

waiting on encoder

§ 01 · THE BOTTLENECKOne vector to summarize a sentence

The whole architecture has one obvious flaw. The encoder’s final hidden state is the only thing the decoder sees of the source. For a five-word sentence, it might be enough. For a fifty-word sentence, that single vector is being asked to hold a great deal of information.

Empirically, plain seq2seq performance degraded sharply as source sentences got longer. Translation accuracy on 30-word sentences was much worse than on 10-word sentences, even when the model had plenty of capacity overall. The bottleneck was real and it was the architecture, not the data.

§ 02 · ATTENTION RESCUES ITLet the decoder peek at every encoder state

Bahdanau, Cho, and Bengio in 2014 added an attention mechanismattention mechanism. Originally proposed to solve the seq2seq bottleneck: instead of relying only on the encoder's final hidden state, let the decoder, at each step, compute a weighted average of all the encoder's hidden states. The weights are learned per-decoder-step. on top of seq2seq. The fix is simple: instead of forcing the decoder to start from a single vector, give it access to allthe encoder’s hidden states (one per source token). At each decoder step, the decoder learns to weight those encoder states — typically focusing on a small number of source tokens relevant to the current target token.

The effect is dramatic. Attention-equipped seq2seq doesn’t degrade as sentences get longer. The decoder no longer has to remember everything — it can look back. Translation quality on long sentences caught up to short ones.

The mechanism: for decoder step t, compute a similarity score between the decoder state and every encoder state. Softmax to get attention weights. Compute a weighted average of the encoder states (the “context vector”) and feed it to the decoder alongside the previous state. Now the decoder has direct access to whichever source tokens it needs.

Fig 1What learned attention looks like. The model has discovered word alignments without anyone explicitly teaching them.

§ 03 · WHAT SEQ2SEQ MADE POSSIBLETranslation, summarization, and the rest

The encoder–decoder template generalizes beyond translation. Any sequence-in / sequence-out task fits the shape:

Translation — English → French, the original application.
Summarization — long document → short summary.
Speech recognition — audio frames → text.
Text-to-speech — text → audio frames.
Image captioning — image features → caption text.
Code translation — Python → JavaScript.

Every one of these became dramatically better with attention-equipped seq2seq, between roughly 2014 and 2017.

§ 04 · LEGACY AND SUCCESSORSWhere the architecture went

In 2017, “Attention Is All You Need” pushed the logic to its conclusion. If attention is doing the heavy lifting, do you even need the RNNs? It turned out no. The transformer kept the encoder–decoder shape but replaced every RNN cell with stacks of attention layers. Parallelizable, much better at long sequences, much easier to scale.

Modern systems still have encoder–decoder descendants:

T5, BART — encoder–decoder transformers. T5 frames everything as text-to-text. Still strong on summarization, translation, structured prediction.
Decoder-only models(GPT, Llama) — drop the encoder. The model just reads the prompt and continues. Works because the “encoding” happens implicitly inside the decoder’s self-attention over the prompt.
Speech / image models — Whisper and modern image captioners are still architecturally encoder–decoder, with the encoder consuming audio frames or image patches and the decoder emitting tokens.

CHECKA team is building a Python-to-JavaScript code translator. Which architecture should they start with?

§ 05 · TAKING THIS FORWARDWhere to read next

Seq2seq is the architectural ancestor of the transformer, and reading it gives you the mental model that makes attention feel obvious. From here: the Transformer lesson covers what happened when attention ate the rest of the architecture. The RNN lesson covers what seq2seq inherited from its building blocks. The lineage runs in both directions.

§ · GOING DEEPERFrom encoder-decoder LSTMs to transformer translation

Sutskever et al.’s 2014 paper introduced the encoder-decoder seq2seq framework: an LSTM encoder reads the source sentence into a single fixed-size context vector, an LSTM decoder generates the target sentence conditioned on it. It worked, but the fixed bottleneck couldn’t hold long source sentences. Bahdanau et al. (2014) fixed this by adding attention between encoder and decoder — the decoder could selectively look at any source position, not just the final hidden state.

Attention turned out to be more important than the LSTM scaffolding. Vaswani et al.’s 2017 transformer replaced the recurrence entirely and made encoder-decoder attention plus self-attention the only mechanism. Google’s 2016 production NMT (Wu et al.) bridged the two eras at scale — eight-layer LSTM encoder-decoder with attention, served at Google Translate’s volumes. Today every translation system uses transformers, but the encoder-decoder shape Sutskever introduced is still the template.

§ · FURTHER READINGReferences & deeper sources

Sutskever, Vinyals, Le (2014). Sequence to Sequence Learning with Neural Networks · NeurIPS
Cho et al. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation · EMNLP
Bahdanau, Cho, Bengio (2014). Neural Machine Translation by Jointly Learning to Align and Translate · ICLR
Luong, Pham, Manning (2015). Effective Approaches to Attention-based Neural Machine Translation · EMNLP
Wu et al. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.