Transformers, From First Principles
Strip away the diagrams, the residuals, the layer norms. What remains is one operation, repeated. The transformer is, at its core, a learnable way to ask: which of these things should I look at?
The five-bullet version
- A transformer replaces sequential recurrence with parallel attention — every token sees every other token at once.
- Each token produces a query, a key, and a value. All three are linear projections of the same embedding.
- The core operation is
softmax(QKᵀ/√d) · V— a weighted sum where the weights come from how much each query matches each key. - Multiple heads run this in parallel; different heads specialize on different patterns.
- Stack 12, 24, or 96 of these blocks with residuals and layer norms — that’s a modern LLM.
§ 00 · WHY ATTENTION WONThe architecture that ate everything
Before 2017, sequence modeling meant recurrent networksrecurrent neural networks. Networks that process sequences one element at a time, maintaining a hidden state. LSTMs and GRUs are RNN variants. Slow to train because of their sequential nature. — LSTMs, GRUs — that read text one word at a time. They worked, but they were slow to train (you can’t parallelize what’s inherently sequential) and they forgot. Long-range dependencies were a known weakness. The transformer, introduced in a paper called “Attention Is All You Need,” replaced the recurrence with something parallelizable: every token sees every other token, all at once.
Eight years later, the same architecture runs in your phone, in datacenter clusters with hundreds of thousands of GPUs, and quietly underneath every chatbot, image generator, and protein folder you’ve heard of. It is the most consequential architectural idea since the convolution.1
§ 01 · THE Q, K, V TRINITYThree projections of the same input
For each token, the transformer computes three things: a query, a key, and a value. All three are just linear projections of the token’s embedding — three different matrices, same input.
- Query: “what am I looking for?”
- Key: “what do I have to offer?”
- Value: “if you pick me, here’s what you get.”
Every token sends out its query and asks every other token’s key: do you match what I’m looking for? The match score is a dot product. Tokens whose keys score highly contribute their values to the output. Everything is differentiable, so the model learns by gradient descent which projections produce useful queries and keys.2
Attention is a soft, differentiable lookup table.— A view that has helped many students
§ 02 · ONE MATRIX MULTIPLYWhere the magic actually lives
The score between every query and every key, all at once, is one matrix multiply: Q · Kᵀ. The lab below shows it in three dimensions. Real models use 64 or 128 dimensions per head, but the operation is identical.
The One Equation: Q · Kᵀ
Attention is one matrix multiply. Click a cell on the right to see which row of Q and which row of K combined to make it. Drag any Q value to watch the result update.
After this multiply, you divide by √d (a scaling trick to keep gradients well-behaved), apply a softmax (so each row sums to 1, becoming a probability distribution over which tokens to attend to), and multiply by the value matrix. That whole thing is scaled dot-product attentionscaled dot-product attention. The full attention operation: softmax(QKᵀ / √d) · V. The single most-used equation in modern AI.. Every transformer everywhere is built from this.
# Scaled dot-product attention, in PyTorch-ish pseudocode. def attention(Q, K, V): scores = Q @ K.transpose(-2, -1) # (n, n) scores = scores / sqrt(Q.shape[-1]) # scale weights = softmax(scores, dim=-1) # (n, n), rows sum to 1 return weights @ V # (n, d)
Real transformers don’t run one attention. They run many in parallel, each with its own learned Q, K, and V projections. These are headsattention heads. Parallel attention computations within a single layer. Each head has its own learned projections and tends to specialize on a particular pattern., and the surprising empirical fact about them is that they specialize. One head in a trained model might track grammatical subjects; another might attend to the start of the sentence; a third might link pronouns to their referents.
§ 03 · SAMPLING THE OUTPUTFrom logits to language
After the final transformer layer, you have a vector per position. A linear projection turns each vector into logitslogits. Raw, unnormalized scores produced by a model — one per possible next token. Apply softmax to turn them into probabilities. — one number per token in the vocabulary. That’s a probability distribution over what comes next.
Now you have to sample. You could pick the token with the highest probability (greedy), but that produces text that’s flat and repetitive. So we reshape the distribution with temperature and truncate its tail with top-p. Move the knobs in the lab below.
Temperature & Top-p, Live
Two knobs control how an LLM samples its next token. Temperature reshapes the distribution; top-p truncates its tail.
At temperature → 0 the model becomes deterministic. At temperature → ∞it becomes uniform random. Most LLM products run somewhere between 0.5 and 1.0 with top-p around 0.9. The exact knobs matter more than people realize for the “feel” of a model.
For a more direct view of what each knob does to the distribution itself — and to actually sample a token live — try the studio below. The faded bars show the original probabilities; the solid bars show what survives after your top-k and top-p cutoffs.
§ 04 · PUTTING IT TOGETHERThe full block, end to end
A transformer block is: attention, then a feed-forward network, with residual connections and layer norm around each. You stack 12, 32, 96 of these. That’s the whole architecture. There are variants — encoder-only (BERT), decoder-only (GPT), encoder-decoder (T5), mixture-of-experts (switches in the FFN) — but the core block is identical.
And to get from this block to a model that can write essays, write code, and reason about physics? You train it on a substantial fraction of the public internet, for months, on tens of thousands of accelerators. The architecture is the easy part now.
Train a Tiny Neural Network
A 2-layer network with 4 hidden units learns XOR — the toy problem that famously can’t be solved by a single layer. Press play and watch the loss curve.
| x₁ | x₂ | target | predicted |
|---|---|---|---|
| 0 | 0 | 0 | 0.000 |
| 0 | 1 | 1 | 0.000 |
| 1 | 0 | 1 | 0.000 |
| 1 | 1 | 0 | 0.000 |
References & further reading
- Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762.
- Bahdanau, D., Cho, K. & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473.
- Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.
- Karpathy, A. (2023). Let’s build GPT: from scratch, in code, spelled out. YouTube / nanoGPT.
- Anthropic (2022). In-context Learning and Induction Heads.
- Alammar, J. (2018). The Illustrated Transformer.
§ · GOING DEEPERModern transformer engineering
The 2017 transformer is almost a museum piece. Production LLMs ship with a handful of upgrades that nearly every frontier release has adopted: RoPE for position encoding (rotates Q and K vectors by position-dependent angles, generalizes to unseen lengths better than sinusoidal), GQA for KV-cache savings (share keys/values across attention head groups), SwiGLU in the FFN (gated linear unit variant that outperforms ReLU at scale), and RMSNorm in place of LayerNorm (cheaper, no mean-centering, no measurable quality loss).
Two more architectural threads: mixture-of-experts replaces the dense FFN with many specialized FFNs and a router that picks two-of-N per token — total params grow, per-token compute stays roughly flat. And the rise of long-context models depends as much on these engineering choices (RoPE interpolation, position scaling tricks like YaRN, attention sinks) as on raw architecture. Reading the Llama 2/3, Mistral, and DeepSeek-V3 technical reports back-to-back is the fastest way to learn what production transformers look like in 2026.
§ · FURTHER READINGReferences & deeper sources
- (2017). Attention Is All You Need · NeurIPS
- (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) · arXiv
- (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints · EMNLP
- (2020). GLU Variants Improve Transformer (SwiGLU) · arXiv
- (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.