Attention
One operation, used over and over: every token decides what to look at, then takes a weighted average of what it found. From this one idea, every modern language model is built.
The five-bullet version
- Attention answers one question per token: which other tokens should I look at, and how much?
- Each token produces three vectors: a query (what I want), a key (what I have), a value (what I’ll hand over).
- The weights come from query·key dot products, scaled and softmaxed.
- Multi-head splits the operation into parallel “asks” — different heads learn different relationships.
- This single operation, stacked, is the entire transformer. No recurrence, no convolution.
§ 00 · THE LOOKUP PROBLEMWhy a model needs to ask questions of its own input
Read this sentence: The cat sat on the warm stone. When you get to warm, you know it’s modifying stone. Not cat. Not sat. Your eyes and brain don’t do anything special — you just know. A model can’t. The word warmin isolation is just a vector pointing at “a property somewhere on the heat dimension.” To know what it attaches to, the model has to look at the other words.
Before attention, models did this by passing information left-to-right: one word at a time, accumulating state. That’s a recurrent network — it works, but it forgets, and it’s sequential, so it can’t use a GPU efficiently. Attention is a different idea: at every position, look at every other position in parallel and decide, based on the content, which ones matter.
§ 01 · QUERIES, KEYS, VALUESThree projections of the same thing
Start with a sequence of token vectors x₁, x₂, …, xₙ. For each position i, learn three linear projections of the vector:
qᵢ = Wq · xᵢ— the queryquery. What this position is looking for in other positions. Encodes the question 'who is relevant to me?': what this position wants.kᵢ = Wk · xᵢ— the keykey. What this position offers as an identifier. Other positions compare their queries against keys to decide relevance.: what this position offers as a label.vᵢ = Wv · xᵢ— the valuevalue. The actual content this position will hand over when it gets attended to. Distinct from the key — keys are 'labels for sorting', values are 'the payload itself'.: what this position will hand over if attended to.
Three matrices Wq, Wk, Wv — all learned, all the same shape. The key idea that takes a moment to accept: the same input xᵢ produces all three. Every token is simultaneously asking, advertising, and providing.
§ 02 · SCALED DOT-PRODUCTThe whole formula on one line
Compute the attention output for position i in three steps:
- Score every other position. The score from
ilooking atjis the dot productsᵢⱼ = qᵢ · kⱼ. High dot product means the query and key vectors point in similar directions — the question matches the advertised label. - Normalize. Scale by
√dₖ(the dimension of the key vectors), then take softmax along the row:αᵢⱼ = softmax(sᵢⱼ / √dₖ). Now the weights for positionisum to 1. - Weighted sum of values.
outᵢ = Σⱼ αᵢⱼ · vⱼ. The output is a mixture of the values, weighted by how much positioniattended to each.
Written compactly: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V. One operation. The whole transformer is this, stacked.
Attention, Visualized
Click any word. The colored bars show how strongly it attends to other words. Different “heads” specialize in different patterns.
Hover the heatmap rows — each row is a single query, asking the sequence what to look at. The dark cells are the values that flowed through to the output for that position.
§ 03 · WHY MULTIPLE HEADSDifferent questions, asked in parallel
A single attention layer learns one set of Wq, Wk, Wv. It can only ask one kind of question per token. But a token has many things it might want to know: which word is my subject? Which adjective modifies me? Which noun did the pronoun refer to? Lumping all of these into one query muddles them.
Multi-head attentionmulti-head attention. Run h independent attention operations in parallel, each with its own Q/K/V projections, on slices of the input. Concatenate the outputs. Each 'head' can specialize in a different relationship. runs h independent attention operations in parallel — typically h = 8, 16, or 32 — each with its own learned projections, each operating on a smaller slice of the embedding dimension. Then concatenate the outputs.
Subject↔verb — the verb 'sat' attends back to its subject 'cat'.
Rows = query (the token asking the question). Columns = key (who provides the answer). Darker cells = more attention weight. Real LLM heads aren’t this clean — but many are surprisingly close.
When researchers probe trained transformers, they find heads that specialize: one head attends from each pronoun to its antecedent; another from verbs to their subjects; another almost always copies from the previous token. Not every head is interpretable, but enough are to confirm: the parallelism is buying real diversity, not just redundancy.
§ 04 · WHAT CHANGED BECAUSE OF THISWhy “Attention Is All You Need” was the right title
Three things attention does, that recurrence and convolution don’t do as well:
- Direct long-range dependencies. A token at position 1,000 can attend to position 1 in one step. No information has to be passed through 999 intermediate steps, losing detail along the way.
- Embarrassingly parallel.Computing attention for all positions doesn’t require any sequential dependency. GPUs eat this for breakfast. Training throughput improved by an order of magnitude over LSTMs at the same accuracy.
- Compositional. Stack
Llayers of attention and each token’s representation has been informed by a complex mixture of every other token’s representations, mediated by learned, content-dependent weights. The depth of composition is what gives modern LLMs their abstraction.
There’s a downside, and you’ll meet it everywhere in production AI: attention is quadraticquadratic complexity. Attention computes scores between every pair of tokens. For a sequence of length n, that's n² operations and n² memory — which gets uncomfortable past 4k tokens, painful past 32k, and motivates a whole subfield of approximate attention variants. in sequence length. Computing every pair of dot products is n² work. Most of the engineering since 2017 has been about making this manageable: KV caching, FlashAttention, sliding-window attention, sparse attention, and now hybrid recurrent-attention architectures.
§ 05 · TAKING THIS FORWARDWhat comes after attention
Attention is the centerpiece, but it’s embedded in a layer that also has a residual connection, LayerNorm, and an MLP. Stack 32–96 of those layers and you have a modern LLM. The transformer lesson breaks the full block apart; FlashAttention and KV cache lessons cover the engineering that makes attention fast enough to serve at scale.
One mental model worth keeping: attention is a content-addressable memory operation. Queries ask, keys advertise, values transfer. Anywhere you need information to flow between elements based on what they contain (not where they sit), attention is the tool. That’s why it appears in vision (ViT), in retrieval (RAG, next two lessons), in protein folding (AlphaFold), and in graph networks. The mechanism generalizes.
§ · GOING DEEPERMulti-head, FlashAttention, and the long-context arc
Multi-headattention is the practical move that made the original transformer work at scale. Instead of one query per token, you compute h queries, keys, and values in parallel, each at d/h dimensions. Different heads learn different relationships — Voita et al. (2019) showed heads specialize for positional, syntactic, and rare-word lookups in trained models. Concatenating their outputs and projecting back is mathematically a sum of low-rank approximations; computationally, it’s a single big matrix multiplication that hardware loves.
Modern attention engineering is mostly about avoiding the O(n²) materialization of the attention matrix. FlashAttention (Dao et al. 2022) tiles the computation and never writes the attention matrix to HBM — it’s the same math, 2–4× wall-clock faster on long sequences. Sparse attention (Sparse Transformers, Longformer, BigBird) restricts which positions each token attends to, trading exactness for linear scaling. These are the two paths to long context; almost every long-context model uses one or both.
§ · FURTHER READINGReferences & deeper sources
- (2017). Attention Is All You Need · NeurIPS
- (2014). Neural Machine Translation by Jointly Learning to Align and Translate · ICLR
- (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness · NeurIPS
- (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting · ACL
- (2020). Longformer: The Long-Document Transformer · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.