Architectures · Module 03·14 min read

Transformers, From First Principles

Strip away the diagrams, the residuals, the layer norms. What remains is one operation, repeated. The transformer is, at its core, a learnable way to ask: which of these things should I look at?

The five-bullet version

  • A transformer replaces sequential recurrence with parallel attention — every token sees every other token at once.
  • Each token produces a query, a key, and a value. All three are linear projections of the same embedding.
  • The core operation is softmax(QKᵀ/√d) · V — a weighted sum where the weights come from how much each query matches each key.
  • Multiple heads run this in parallel; different heads specialize on different patterns.
  • Stack 12, 24, or 96 of these blocks with residuals and layer norms — that’s a modern LLM.

§ 00 · WHY ATTENTION WONThe architecture that ate everything

Before 2017, sequence modeling meant recurrent networksrecurrent neural networks. Networks that process sequences one element at a time, maintaining a hidden state. LSTMs and GRUs are RNN variants. Slow to train because of their sequential nature. — LSTMs, GRUs — that read text one word at a time. They worked, but they were slow to train (you can’t parallelize what’s inherently sequential) and they forgot. Long-range dependencies were a known weakness. The transformer, introduced in a paper called “Attention Is All You Need,” replaced the recurrence with something parallelizable: every token sees every other token, all at once.

Eight years later, the same architecture runs in your phone, in datacenter clusters with hundreds of thousands of GPUs, and quietly underneath every chatbot, image generator, and protein folder you’ve heard of. It is the most consequential architectural idea since the convolution.1

§ 01 · THE Q, K, V TRINITYThree projections of the same input

For each token, the transformer computes three things: a query, a key, and a value. All three are just linear projections of the token’s embedding — three different matrices, same input.

Every token sends out its query and asks every other token’s key: do you match what I’m looking for? The match score is a dot product. Tokens whose keys score highly contribute their values to the output. Everything is differentiable, so the model learns by gradient descent which projections produce useful queries and keys.2

Attention is a soft, differentiable lookup table.— A view that has helped many students

§ 02 · ONE MATRIX MULTIPLYWhere the magic actually lives

The score between every query and every key, all at once, is one matrix multiply: Q · Kᵀ. The lab below shows it in three dimensions. Real models use 64 or 128 dimensions per head, but the operation is identical.

Lab 05 · Live

The One Equation: Q · Kᵀ

Attention is one matrix multiply. Click a cell on the right to see which row of Q and which row of K combined to make it. Drag any Q value to watch the result update.

Q · queries
the
cat
sat
0.80
0.20
0.10
0.10
0.70
0.30
0.30
0.40
0.90
×
Kᵀ · keys (transposed)
the
cat
sat
0.90
0.20
0.30
0.10
0.80
0.20
0.20
0.10
0.80
=
attention scores
0.76
0.33
0.36
0.22
0.61
0.41
0.49
0.47
0.89
Hover a result cell to see which inputs combined to make it. Scroll over Q to edit.

After this multiply, you divide by √d (a scaling trick to keep gradients well-behaved), apply a softmax (so each row sums to 1, becoming a probability distribution over which tokens to attend to), and multiply by the value matrix. That whole thing is scaled dot-product attentionscaled dot-product attention. The full attention operation: softmax(QKᵀ / √d) · V. The single most-used equation in modern AI.. Every transformer everywhere is built from this.

# Scaled dot-product attention, in PyTorch-ish pseudocode.
def attention(Q, K, V):
    scores = Q @ K.transpose(-2, -1)     # (n, n)
    scores = scores / sqrt(Q.shape[-1])  # scale
    weights = softmax(scores, dim=-1)    # (n, n), rows sum to 1
    return weights @ V                   # (n, d)

Real transformers don’t run one attention. They run many in parallel, each with its own learned Q, K, and V projections. These are headsattention heads. Parallel attention computations within a single layer. Each head has its own learned projections and tends to specialize on a particular pattern., and the surprising empirical fact about them is that they specialize. One head in a trained model might track grammatical subjects; another might attend to the start of the sentence; a third might link pronouns to their referents.

LAB 05
Attention Pattern Inspector
Different heads learn different jobs. Click one to see its pattern light up.
Thecatsatonthematbecauseitwastired
query (each row)
key (each column)
Previous-token. looks at the token immediately before.
TryOn the cat sentence, switch to “Subject-tracker” — watch the row for “it” light up the column for “cat”.

§ 03 · SAMPLING THE OUTPUTFrom logits to language

After the final transformer layer, you have a vector per position. A linear projection turns each vector into logitslogits. Raw, unnormalized scores produced by a model — one per possible next token. Apply softmax to turn them into probabilities. — one number per token in the vocabulary. That’s a probability distribution over what comes next.

Now you have to sample. You could pick the token with the highest probability (greedy), but that produces text that’s flat and repetitive. So we reshape the distribution with temperature and truncate its tail with top-p. Move the knobs in the lab below.

Lab 07 · Live

Temperature & Top-p, Live

Two knobs control how an LLM samples its next token. Temperature reshapes the distribution; top-p truncates its tail.

cat
38.5%
dog
28.5%
horse
12.8%
robot
9.5%
tree
5.2%
stone
2.6%
song
1.9%
dream
1.1%
Temperature1.00
low = greedy · high = chaotic
Top-p (nucleus)1.00
keep tokens until p sums to this

At temperature → 0 the model becomes deterministic. At temperature → ∞it becomes uniform random. Most LLM products run somewhere between 0.5 and 1.0 with top-p around 0.9. The exact knobs matter more than people realize for the “feel” of a model.

For a more direct view of what each knob does to the distribution itself — and to actually sample a token live — try the studio below. The faded bars show the original probabilities; the solid bars show what survives after your top-k and top-p cutoffs.

LAB 02
Sampling Studio
Reshape the next-token distribution, then sample from it.
PromptThe cat sat on the ___”
balanced
no cutoff
cumulative mass
mat
39.5%
couch
21.7%
floor
13.2%
chair
9.7%
windowsill
5.4%
rug
4.0%
table
2.7%
stairs
1.5%
chair_b
1.1%
fence
0.9%
rocket
0.3%
philosophy
0.1%
TrySet temperature to 0.1 — distribution collapses to “mat”. Set to 2.0 — even “rocket” has a real shot.

§ 04 · PUTTING IT TOGETHERThe full block, end to end

A transformer block is: attention, then a feed-forward network, with residual connections and layer norm around each. You stack 12, 32, 96 of these. That’s the whole architecture. There are variants — encoder-only (BERT), decoder-only (GPT), encoder-decoder (T5), mixture-of-experts (switches in the FFN) — but the core block is identical.

And to get from this block to a model that can write essays, write code, and reason about physics? You train it on a substantial fraction of the public internet, for months, on tens of thousands of accelerators. The architecture is the easy part now.

Input embeddings + positional encodingLayerNormMulti-Head Self-Attentionsoftmax(QKᵀ/√d)·V — h heads in parallel+residualLayerNormFeed-Forward Network+× N layers
Fig 1One transformer block. Modern frontier LLMs stack ~80 of these, with rotary position embeddings, grouped-query attention, and SwiGLU FFNs as common upgrades.
TRY ITOpen the train-a-net lab below. The XOR network is the smallest neural network worth training. The transformer is just this idea, scaled up by ten million.
Lab 06 · Live

Train a Tiny Neural Network

A 2-layer network with 4 hidden units learns XOR — the toy problem that famously can’t be solved by a single layer. Press play and watch the loss curve.

LOSS1.0000
epoch 0lr 0.5
XOR predictions
x₁x₂targetpredicted
0000.000
0110.000
1010.000
1100.000
learning rate:
You have read the transformer.You won’t have implemented one yet — that’s the next module — but the picture in your head should now be: queries, keys, values, one matmul, softmax, repeat.

References & further reading

  1. Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762.
  2. Bahdanau, D., Cho, K. & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473.
  3. Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.
  4. Karpathy, A. (2023). Let’s build GPT: from scratch, in code, spelled out. YouTube / nanoGPT.
  5. Anthropic (2022). In-context Learning and Induction Heads.
  6. Alammar, J. (2018). The Illustrated Transformer.

§ · GOING DEEPERModern transformer engineering

The 2017 transformer is almost a museum piece. Production LLMs ship with a handful of upgrades that nearly every frontier release has adopted: RoPE for position encoding (rotates Q and K vectors by position-dependent angles, generalizes to unseen lengths better than sinusoidal), GQA for KV-cache savings (share keys/values across attention head groups), SwiGLU in the FFN (gated linear unit variant that outperforms ReLU at scale), and RMSNorm in place of LayerNorm (cheaper, no mean-centering, no measurable quality loss).

Two more architectural threads: mixture-of-experts replaces the dense FFN with many specialized FFNs and a router that picks two-of-N per token — total params grow, per-token compute stays roughly flat. And the rise of long-context models depends as much on these engineering choices (RoPE interpolation, position scaling tricks like YaRN, attention sinks) as on raw architecture. Reading the Llama 2/3, Mistral, and DeepSeek-V3 technical reports back-to-back is the fastest way to learn what production transformers look like in 2026.

§ · FURTHER READINGReferences & deeper sources

  1. Vaswani et al. (2017). Attention Is All You Need · NeurIPS
  2. Su, Lu, Pan, Murtadha, Wen, Liu (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) · arXiv
  3. Ainslie et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints · EMNLP
  4. Shazeer (2020). GLU Variants Improve Transformer (SwiGLU) · arXiv
  5. Touvron et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.