Architectures · Module 05·12 min read

Attention

One operation, used over and over: every token decides what to look at, then takes a weighted average of what it found. From this one idea, every modern language model is built.

The five-bullet version

  • Attention answers one question per token: which other tokens should I look at, and how much?
  • Each token produces three vectors: a query (what I want), a key (what I have), a value (what I’ll hand over).
  • The weights come from query·key dot products, scaled and softmaxed.
  • Multi-head splits the operation into parallel “asks” — different heads learn different relationships.
  • This single operation, stacked, is the entire transformer. No recurrence, no convolution.

§ 00 · THE LOOKUP PROBLEMWhy a model needs to ask questions of its own input

Read this sentence: The cat sat on the warm stone. When you get to warm, you know it’s modifying stone. Not cat. Not sat. Your eyes and brain don’t do anything special — you just know. A model can’t. The word warmin isolation is just a vector pointing at “a property somewhere on the heat dimension.” To know what it attaches to, the model has to look at the other words.

Before attention, models did this by passing information left-to-right: one word at a time, accumulating state. That’s a recurrent network — it works, but it forgets, and it’s sequential, so it can’t use a GPU efficiently. Attention is a different idea: at every position, look at every other position in parallel and decide, based on the content, which ones matter.

§ 01 · QUERIES, KEYS, VALUESThree projections of the same thing

Start with a sequence of token vectors x₁, x₂, …, xₙ. For each position i, learn three linear projections of the vector:

Three matrices Wq, Wk, Wv — all learned, all the same shape. The key idea that takes a moment to accept: the same input xᵢ produces all three. Every token is simultaneously asking, advertising, and providing.

§ 02 · SCALED DOT-PRODUCTThe whole formula on one line

Compute the attention output for position i in three steps:

  1. Score every other position. The score from i looking at j is the dot product sᵢⱼ = qᵢ · kⱼ. High dot product means the query and key vectors point in similar directions — the question matches the advertised label.
  2. Normalize. Scale by √dₖ (the dimension of the key vectors), then take softmax along the row: αᵢⱼ = softmax(sᵢⱼ / √dₖ). Now the weights for position i sum to 1.
  3. Weighted sum of values. outᵢ = Σⱼ αᵢⱼ · vⱼ. The output is a mixture of the values, weighted by how much position i attended to each.

Written compactly: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V. One operation. The whole transformer is this, stacked.

Lab 02 · Live

Attention, Visualized

Click any word. The colored bars show how strongly it attends to other words. Different “heads” specialize in different patterns.

The0%cat76%sat15%on0%the0%mat10%because0%it0%was0%tired0%
it attends most strongly to cat (76% of its weight).

Hover the heatmap rows — each row is a single query, asking the sequence what to look at. The dark cells are the values that flowed through to the output for that position.

§ 03 · WHY MULTIPLE HEADSDifferent questions, asked in parallel

A single attention layer learns one set of Wq, Wk, Wv. It can only ask one kind of question per token. But a token has many things it might want to know: which word is my subject? Which adjective modifies me? Which noun did the pronoun refer to? Lumping all of these into one query muddles them.

Multi-head attentionmulti-head attention. Run h independent attention operations in parallel, each with its own Q/K/V projections, on slices of the input. Concatenate the outputs. Each 'head' can specialize in a different relationship. runs h independent attention operations in parallel — typically h = 8, 16, or 32 — each with its own learned projections, each operating on a smaller slice of the embedding dimension. Then concatenate the outputs.

Lab · attention patternsOne head, one sentence — pick what kind of head this is

Subject↔verb — the verb 'sat' attends back to its subject 'cat'.

Thecatsatonthewarmstone.The
cat
sat
on
the
warm
stone
.

Rows = query (the token asking the question). Columns = key (who provides the answer). Darker cells = more attention weight. Real LLM heads aren’t this clean — but many are surprisingly close.

When researchers probe trained transformers, they find heads that specialize: one head attends from each pronoun to its antecedent; another from verbs to their subjects; another almost always copies from the previous token. Not every head is interpretable, but enough are to confirm: the parallelism is buying real diversity, not just redundancy.

§ 04 · WHAT CHANGED BECAUSE OF THISWhy “Attention Is All You Need” was the right title

Three things attention does, that recurrence and convolution don’t do as well:

Input embeddings + positional encodingLayerNormMulti-Head Self-Attentionsoftmax(QKᵀ/√d)·V — h heads in parallel+residualLayerNormFeed-Forward Network+× N layers
Fig 1One transformer layer. Attention is the new operation, but it ships embedded in a residual block with LayerNorm and an MLP. Stack the block; you have the model.

There’s a downside, and you’ll meet it everywhere in production AI: attention is quadraticquadratic complexity. Attention computes scores between every pair of tokens. For a sequence of length n, that's n² operations and n² memory — which gets uncomfortable past 4k tokens, painful past 32k, and motivates a whole subfield of approximate attention variants. in sequence length. Computing every pair of dot products is work. Most of the engineering since 2017 has been about making this manageable: KV caching, FlashAttention, sliding-window attention, sparse attention, and now hybrid recurrent-attention architectures.

CHECKIn multi-head attention with 8 heads and embedding dimension 512, what's the dimension of each head's Q/K/V projection?

§ 05 · TAKING THIS FORWARDWhat comes after attention

Attention is the centerpiece, but it’s embedded in a layer that also has a residual connection, LayerNorm, and an MLP. Stack 32–96 of those layers and you have a modern LLM. The transformer lesson breaks the full block apart; FlashAttention and KV cache lessons cover the engineering that makes attention fast enough to serve at scale.

One mental model worth keeping: attention is a content-addressable memory operation. Queries ask, keys advertise, values transfer. Anywhere you need information to flow between elements based on what they contain (not where they sit), attention is the tool. That’s why it appears in vision (ViT), in retrieval (RAG, next two lessons), in protein folding (AlphaFold), and in graph networks. The mechanism generalizes.

§ · GOING DEEPERMulti-head, FlashAttention, and the long-context arc

Multi-headattention is the practical move that made the original transformer work at scale. Instead of one query per token, you compute h queries, keys, and values in parallel, each at d/h dimensions. Different heads learn different relationships — Voita et al. (2019) showed heads specialize for positional, syntactic, and rare-word lookups in trained models. Concatenating their outputs and projecting back is mathematically a sum of low-rank approximations; computationally, it’s a single big matrix multiplication that hardware loves.

Modern attention engineering is mostly about avoiding the O(n²) materialization of the attention matrix. FlashAttention (Dao et al. 2022) tiles the computation and never writes the attention matrix to HBM — it’s the same math, 2–4× wall-clock faster on long sequences. Sparse attention (Sparse Transformers, Longformer, BigBird) restricts which positions each token attends to, trading exactness for linear scaling. These are the two paths to long context; almost every long-context model uses one or both.

§ · FURTHER READINGReferences & deeper sources

  1. Vaswani et al. (2017). Attention Is All You Need · NeurIPS
  2. Bahdanau, Cho, Bengio (2014). Neural Machine Translation by Jointly Learning to Align and Translate · ICLR
  3. Dao, Fu, Ermon, Rudra, Ré (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness · NeurIPS
  4. Voita, Talbot, Moiseev, Sennrich, Titov (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting · ACL
  5. Beltagy, Peters, Cohan (2020). Longformer: The Long-Document Transformer · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.