Architectures · Module 05·12 min read

Attention

One operation, used over and over: every token decides what to look at, then takes a weighted average of what it found. From this one idea, every modern language model is built.

Brain Drip EditorsUpdated May 2026·17 references

The five-bullet version

Attention answers one question per token: which other tokens should I look at, and how much?
Each token produces three vectors: a query (what I want), a key (what I have), a value (what I’ll hand over).
The weights come from query·key dot products, scaled and softmaxed.
Multi-head splits the operation into parallel “asks” — different heads learn different relationships.
This single operation, stacked, is the entire transformer. No recurrence, no convolution.

§ 00 · THE LOOKUP PROBLEMWhy a model needs to ask questions of its own input

Read this sentence: The cat sat on the warm stone. When you get to warm, you know it’s modifying stone. Not cat. Not sat. Your eyes and brain don’t do anything special — you just know. A model can’t. The word warmin isolation is just a vector pointing at “a property somewhere on the heat dimension.” To know what it attaches to, the model has to look at the other words.

Before attention, models did this by passing information left-to-right: one word at a time, accumulating state. That’s a recurrent network — it works, but it forgets, and it’s sequential, so it can’t use a GPU efficiently. Attention is a different idea: at every position, look at every other position in parallel and decide, based on the content, which ones matter.

§ 01 · QUERIES, KEYS, VALUESThree projections of the same thing

Start with a sequence of token vectors x₁, x₂, …, xₙ. For each position i, learn three linear projections of the vector:

qᵢ = Wq · xᵢ — the queryquery. What this position is looking for in other positions. Encodes the question 'who is relevant to me?': what this position wants.
kᵢ = Wk · xᵢ — the keykey. What this position offers as an identifier. Other positions compare their queries against keys to decide relevance.: what this position offers as a label.
vᵢ = Wv · xᵢ — the valuevalue. The actual content this position will hand over when it gets attended to. Distinct from the key — keys are 'labels for sorting', values are 'the payload itself'.: what this position will hand over if attended to.

Three matrices Wq, Wk, Wv — all learned, all the same shape. The key idea that takes a moment to accept: the same input xᵢ produces all three. Every token is simultaneously asking, advertising, and providing.

§ 02 · SCALED DOT-PRODUCTThe whole formula on one line

Compute the attention output for position i in three steps:

Score every other position. The score from i looking at j is the dot product sᵢⱼ = qᵢ · kⱼ. High dot product means the query and key vectors point in similar directions — the question matches the advertised label.
Normalize. Scale by √dₖ (the dimension of the key vectors), then take softmax along the row: αᵢⱼ = softmax(sᵢⱼ / √dₖ). Now the weights for position i sum to 1.
Weighted sum of values. outᵢ = Σⱼ αᵢⱼ · vⱼ. The output is a mixture of the values, weighted by how much position i attended to each.

Written compactly: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V. One operation. The whole transformer is this, stacked.

NOTE 02Why divide by √dₖ? Without it, the dot products grow with the dimension, pushing softmax into saturated regions where its gradient is near zero. √dₖ keeps the variance of the scores roughly constant regardless of dimension, so training stays well-behaved. A small detail, but training without it diverges.

Lab 02 · Live

Attention, Visualized

Click any word. The colored bars show how strongly it attends to other words. Different “heads” specialize in different patterns.

The0%cat76%sat15%on0%the0%mat10%because0%it0%was0%tired0%

it attends most strongly to cat (76% of its weight).

Hover the heatmap rows — each row is a single query, asking the sequence what to look at. The dark cells are the values that flowed through to the output for that position.

§ 03 · WHY MULTIPLE HEADSDifferent questions, asked in parallel

A single attention layer learns one set of Wq, Wk, Wv. It can only ask one kind of question per token. But a token has many things it might want to know: which word is my subject? Which adjective modifies me? Which noun did the pronoun refer to? Lumping all of these into one query muddles them.

Multi-head attentionmulti-head attention. Run h independent attention operations in parallel, each with its own Q/K/V projections, on slices of the input. Concatenate the outputs. Each 'head' can specialize in a different relationship. runs h independent attention operations in parallel — typically h = 8, 16, or 32 — each with its own learned projections, each operating on a smaller slice of the embedding dimension. Then concatenate the outputs.

Lab · attention patternsOne head, one sentence — pick what kind of head this is

Subject↔verb — the verb 'sat' attends back to its subject 'cat'.

Thecatsatonthewarmstone.The

cat

sat

the

warm

stone

Rows = query (the token asking the question). Columns = key (who provides the answer). Darker cells = more attention weight. Real LLM heads aren’t this clean — but many are surprisingly close.

When researchers probe trained transformers, they find heads that specialize: one head attends from each pronoun to its antecedent; another from verbs to their subjects; another almost always copies from the previous token. Not every head is interpretable, but enough are to confirm: the parallelism is buying real diversity, not just redundancy.

§ 04 · WHAT CHANGED BECAUSE OF THISWhy “Attention Is All You Need” was the right title

Three things attention does, that recurrence and convolution don’t do as well:

Direct long-range dependencies. A token at position 1,000 can attend to position 1 in one step. No information has to be passed through 999 intermediate steps, losing detail along the way.
Embarrassingly parallel.Computing attention for all positions doesn’t require any sequential dependency. GPUs eat this for breakfast. Training throughput improved by an order of magnitude over LSTMs at the same accuracy.
Compositional. Stack Llayers of attention and each token’s representation has been informed by a complex mixture of every other token’s representations, mediated by learned, content-dependent weights. The depth of composition is what gives modern LLMs their abstraction.

Fig 1One transformer layer. Attention is the new operation, but it ships embedded in a residual block with LayerNorm and an MLP. Stack the block; you have the model.

There’s a downside, and you’ll meet it everywhere in production AI: attention is quadraticquadratic complexity. Attention computes scores between every pair of tokens. For a sequence of length n, that's n² operations and n² memory — which gets uncomfortable past 4k tokens, painful past 32k, and motivates a whole subfield of approximate attention variants. in sequence length. Computing every pair of dot products is n² work. Most of the engineering since 2017 has been about making this manageable: KV caching, FlashAttention, sliding-window attention, sparse attention, and now hybrid recurrent-attention architectures.

CHECKIn multi-head attention with 8 heads and embedding dimension 512, what's the dimension of each head's Q/K/V projection?

§ 05 · TAKING THIS FORWARDWhat comes after attention

Attention is the centerpiece, but it’s embedded in a layer that also has a residual connection, LayerNorm, and an MLP. Stack 32–96 of those layers and you have a modern LLM. The transformer lesson breaks the full block apart; FlashAttention and KV cache lessons cover the engineering that makes attention fast enough to serve at scale.

One mental model worth keeping: attention is a content-addressable memory operation. Queries ask, keys advertise, values transfer. Anywhere you need information to flow between elements based on what they contain (not where they sit), attention is the tool. That’s why it appears in vision (ViT), in retrieval (RAG, next two lessons), in protein folding (AlphaFold), and in graph networks. The mechanism generalizes.

§ · GOING DEEPERMulti-head, FlashAttention, and the long-context arc

Multi-headattention is the practical move that made the original transformer work at scale. Instead of one query per token, you compute h queries, keys, and values in parallel, each at d/h dimensions. Different heads learn different relationships — Voita et al. (2019) showed heads specialize for positional, syntactic, and rare-word lookups in trained models. Concatenating their outputs and projecting back is mathematically a sum of low-rank approximations; computationally, it’s a single big matrix multiplication that hardware loves.

Modern attention engineering is mostly about avoiding the O(n²) materialization of the attention matrix. FlashAttention (Dao et al. 2022) tiles the computation and never writes the attention matrix to HBM — it’s the same math, 2–4× wall-clock faster on long sequences. Sparse attention (Sparse Transformers, Longformer, BigBird) restricts which positions each token attends to, trading exactness for linear scaling. These are the two paths to long context; almost every long-context model uses one or both.

§ · FURTHER READINGReferences & deeper sources

Vaswani et al. (2017). Attention Is All You Need · NeurIPS
Bahdanau, Cho, Bengio (2014). Neural Machine Translation by Jointly Learning to Align and Translate · ICLR
Dao, Fu, Ermon, Rudra, Ré (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness · NeurIPS
Voita, Talbot, Moiseev, Sennrich, Titov (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting · ACL
Beltagy, Peters, Cohan (2020). Longformer: The Long-Document Transformer · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.