Attention Mechanism Evolution

One-Line Summary: The journey from every attention head having its own memory to groups sharing compressed memory — a relentless drive to make attention cheaper without making it dumber.

Prerequisites: 02-positional-encoding-evolution.md, 03-flash-attention-and-hardware-aware-computing.md

What Is the Attention Mechanism Evolution?

Imagine a classroom where every student takes their own complete set of notes for every lecture. That is Multi-Head Attention — thorough but expensive. Now imagine pairs of students sharing notes, then whole study groups sharing a single set, and finally the entire class compressing all notes into a slim cheat sheet that somehow captures everything. That is the trajectory of attention mechanisms from 2017 to 2024.

Attention is the core operation of the Transformer. It determines how each token in a sequence "looks at" every other token to decide what information to gather. The original Multi-Head Attention (MHA) formulation gave each head its own Query, Key, and Value projections — maximally expressive but maximally memory-hungry. Every subsequent innovation has been a carefully negotiated trade: give up some redundancy in exchange for dramatically lower memory and faster inference, while fighting to keep quality intact.

This evolution matters because attention's memory cost is the single biggest bottleneck for serving LLMs at scale. The KV cache — the stored Key and Value tensors from previous tokens — grows linearly with sequence length and model size. At 128K context with a 175B-parameter model, naive KV caches consume tens of gigabytes per request. The architectural innovations traced here are the field's answer to that crisis.

How It Works

Evolution of Attention: From MHA to MLA:

MHA (2017)              MQA (2019)            GQA (2023)             MLA (2024)
Each head: own KV       All heads: 1 KV       Groups share KV        Compress KV to latent
 
Q1 Q2 Q3 Q4            Q1 Q2 Q3 Q4           Q1 Q2  Q3 Q4          Q1 Q2 Q3 Q4
│  │  │  │              │  │  │  │             │  │   │  │           │  │  │  │
K1 K2 K3 K4            K  K  K  K             K1 K1  K2 K2          Latent ──▶ K,V
V1 V2 V3 V4            V  V  V  V             V1 V1  V2 V2          (compress then
                                                                      decompress)
KV Cache: 100%          KV Cache: 1/H          KV Cache: 1/G         KV Cache: 1/57x
(baseline)              (e.g., 1/96)           (e.g., 1/8)           (DeepSeek V2)
 
◄───────── Memory savings increase ──────────────────────────────►
◄───────── Implementation complexity increases ──────────────────►

Multi-Head Attention (MHA) — The Original (2017)

Vaswani et al.'s "Attention Is All You Need" introduced MHA with 8 heads, each computing its own Q, K, and V projections from the input. For a model with dimension d_model = 512 and 8 heads, each head operates on d_k = d_v = 64 dimensions. The computation is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V, performed independently per head, then concatenated and linearly projected. This design lets different heads learn different relationship patterns — one head might track syntactic dependencies while another tracks semantic similarity. MHA was used in the original Transformer, BERT (12 heads), GPT-2 (12-25 heads), GPT-3 (96 heads), and remains the baseline against which all variants are measured. The cost: KV cache scales as 2 * n_layers * n_heads * d_head * seq_len per request.

Noam Shazeer's 2019 paper proposed a radical simplification: keep multiple Query heads but share a single Key and single Value projection across all heads. In a 96-head model, this reduces KV cache by 96x. The intuition is that Keys and Values carry contextual information that is largely redundant across heads — it is the Queries that need diversity to ask different questions. MQA was adopted by PaLM (540B, 2022), Falcon (40B, 2023), and StarCoder. In practice, MQA achieves 1.5-2x faster inference with modest quality degradation (typically 0.5-1% on benchmarks). The trade-off was sometimes too aggressive: some tasks showed noticeable quality loss, motivating the next step.

Grouped-Query Attention (GQA) — The Compromise (2023)

Ainslie et al. (2023) proposed GQA as the middle ground: instead of all heads sharing one KV pair (MQA) or each having its own (MHA), groups of heads share KV projections. With 32 heads and 8 KV groups, each group of 4 query heads shares one K and one V. This achieves most of MQA's memory savings (4-8x KV cache reduction) while recovering nearly all of MHA's quality. GQA became the de facto standard for open models: LLaMA 2 (70B, 8 KV heads for 64 query heads), LLaMA 3 (8B-405B), Mistral 7B, Gemma, and Qwen 2. The paper showed GQA-8 matches MHA quality while matching MQA speed — the rare win-win.

Multi-head Latent Attention (MLA) — Compression (2024)

DeepSeek V2 introduced MLA, the most radical rethinking since the original. Instead of storing full K and V tensors, MLA compresses them into a low-dimensional latent vector via a learned down-projection. At inference, these latent vectors are up-projected back to full K and V. The compression ratio is dramatic: DeepSeek V2 achieves a 57x reduction in KV cache compared to standard MHA. For DeepSeek V3 (671B parameters), this means serving long-context requests with a fraction of the memory that would otherwise be required. MLA also absorbs the RoPE positional encoding into the projection, avoiding the need to store separate positional information. The quality impact is minimal — DeepSeek V2/V3 match or exceed comparably-sized models with standard attention.

Why It Matters

Enabling Long-Context Inference

Without attention efficiency innovations, 128K-token contexts would be impractical for production serving. A naive MHA implementation for a 70B model at 128K context would require roughly 40 GB of KV cache per request. GQA reduces this to approximately 5-10 GB; MLA could bring it under 1 GB. This is what makes million-token contexts feasible.

Democratizing Deployment

GQA and MLA make it possible to serve large models on fewer GPUs. A model that would require 8 GPUs for KV cache alone under MHA might fit on 2 with GQA. This directly translates to lower serving costs — the difference between a viable product and an unaffordable one.

The Design Pattern: Graceful Degradation

The attention evolution established a broader principle in LLM architecture: find the redundancy, share it, compress it, and verify quality holds. This same pattern appears in quantization, pruning, and distillation. Attention was the proving ground.

Key Technical Details

MHA (2017): Each head has independent Q, K, V. KV cache = 2 * L * H * d_head * seq_len. Used in GPT-3 (96 heads, 128 dim each).
MQA (2019): All heads share 1 K, 1 V. KV cache reduced by factor of H (e.g., 96x for PaLM). ~0.5-1% quality loss.
GQA (2023): G groups of heads share K, V. Typical config: 8 KV groups for 32-64 query heads. 4-8x cache reduction.
MLA (2024): Compress KV to latent vector of dimension c << d_model. DeepSeek V2 achieves 57x cache reduction.
LLaMA 2 70B: 64 query heads, 8 KV heads (GQA-8). LLaMA 3 continues GQA across all sizes.
Mistral 7B: 32 query heads, 8 KV heads (GQA-4). Sliding window attention further reduces memory.
DeepSeek V3 (671B): MLA with latent dimension enabling massive context at minimal memory cost.
The evolution happened in just 7 years (2017-2024), driven primarily by inference cost pressure at scale.

Common Misconceptions

"MQA is strictly worse than MHA." MQA trades minimal quality (often <1% on benchmarks) for massive inference speedup. For many applications, the trade-off is clearly worth it, and GQA recovers nearly all quality anyway.
"These attention variants change how the model learns." The quality differences are primarily visible at inference time through KV cache efficiency. Training compute is similar — the gains are almost entirely about serving cost and latency.
"Newer always means better." MHA is still appropriate for small models where KV cache is not a bottleneck. GQA is overkill for a 125M-parameter model. The right choice depends on model size and deployment constraints.
"MLA makes GQA obsolete." MLA is more complex to implement and has been validated primarily on DeepSeek's architecture. GQA remains the safe, well-understood default for most teams. MLA may become standard, but adoption takes time.

Connections to Other Concepts

06-kv-cache-and-serving-optimization.md — KV cache is the primary motivation for these innovations; PagedAttention and vLLM manage whatever cache remains after architectural compression
03-flash-attention-and-hardware-aware-computing.md — the complementary hardware-level optimization; FlashAttention makes any attention variant faster, while GQA/MLA reduce what needs to be stored
02-positional-encoding-evolution.md — MLA absorbs positional encoding into its latent projections, showing how architectural innovations interact and co-evolve
07-long-context-techniques.md — the direct beneficiary; without efficient attention, million-token contexts are economically infeasible
04-mixture-of-experts-evolution.md — a parallel compression strategy; MoE reduces compute per token while attention variants reduce memory per token
01-deepseek-v2-and-mla.md — the full story of how MLA was developed and its impact on the DeepSeek model family