Core Concepts · Module 10·9 min read

KV Cache

An LLM emits one token at a time. Without a small but crucial trick, every token would cost more than the last — by the thousandth word you’d be paying a thousand times more compute. The KV cache is why ChatGPT doesn’t crawl to a halt mid-paragraph.

The five-bullet version

  • LLMs generate one token at a time — autoregressively — each new token attending over every previous token.
  • Naively, each new token would recompute attention over the entire context: O(n²) per step, O(n³) per sequence.
  • The keys and values of past tokens never change once computed. Cache them.
  • With caching, each new token costs O(n) — only the new K and V are computed; the rest are looked up.
  • The cache itself takes serious memory: GB-scale for long contexts. Compression and reuse strategies are an active engineering area.

§ 00 · THE AUTOREGRESSIVE COST TRAPWhy one token at a time is dangerous

An LLM emits tokens one at a time. After it generates token t, that token gets appended to the context, and the model runs again to produce token t + 1. To produce a 200-token answer, the model runs 200 times.

That’s a problem, because each run involves attention, and attentionattention. The transformer's core operation: each token computes a weighted sum over all other tokens in the context, with weights determined by query-key dot products. See the Attention lesson for the full mechanics. is quadratic in sequence length. The math says: at decode step t, you compute attention with t queries against t keys — that’s operations. Across the full generation: 1² + 2² + 3² + … + 200²≈ 2.7 million operations, just for the attention step in a single layer. With 32 layers and 1,000 tokens of context, you’re burning compute at a rate that scales cubically with output length.

For a 4,000-token generation, naive autoregression would be orders of magnitude slower than necessary. ChatGPT doesn’t do this. Neither does any production LLM. The reason is a small observation that turns out to be enormous.

§ 01 · WHAT GETS RECOMPUTED (AND WHY)Spotting the redundancy

Recall how attention works at one position. For a token at position t, you compute three projections of its embedding:

Then attention computes softmax(qₜ · Kᵀ / √d) · V where K and V are matrices of all the keys and values up to and including position t.

Here’s the key observation. At decode step t + 1, the new token contributes:

Every previous key and value is unchanged. k₁, k₂, …, kₜare the keys for tokens that already exist. They were computed from those tokens’ embeddings, which are fixed. Nothing about position t + 1 changes them.

And yet, naive autoregression recomputes them every step. That’s the redundancy. Cache them once and reuse.

§ 02 · CACHE K AND V, SKIP RECOMPUTATIONThe trick in one sentence

The KV cacheKV cache. A per-layer memory of all keys and values computed during the current generation, indexed by token position. Lets autoregressive decoding skip recomputing past tokens' attention contributions. is a per-layer storage of all the keys and values from previous tokens. At every decode step, the model:

  1. Computes q, k, v for just the newest token.
  2. Appends the new k and v to the cache.
  3. Uses the full cached K and V matrices plus the new q for attention.
  4. Generates one token. Repeat.

Cost per step drops from O(n²) to O(n) — a single dot product against a cached matrix instead of recomputing the matrix from scratch.

Lab · autoregressive decodingStep through generation · toggle KV cache on/off
Decode step
0/5
KV cache
Work this step1 token · O(1)
Key/value rows · one row per token in context
p1
cached
prompt
p2
cached
prompt
p3
cached
prompt

With cache: only the new token’s K and V are computed; old rows are kept in memory and reused. Without cache: every step redoes the full prefix from scratch. By step 8 you’ve done 1 + 2 + 3 + … + 8 = 36 token-worth of work without the cache, versus 8 with it.

Switch cache off and watch every step turn orange — the model is recomputing everything. Switch it on and watch the prefix go gray (cached) with only the new step in accent — a single token of work, no matter how long the context.

§ 03 · WHAT THE CACHE ACTUALLY COSTSMemory, not compute

Trading compute for memory is the deal you’ve struck. Now the question: how much memory?

Per layer, you store one key matrix and one value matrix, each of shape [context_length × num_heads × head_dim]. For a typical modern LLM (32 layers, 32 heads, head dim 128, bf16):

per layer per token = 2 (K, V) × 32 (heads) × 128 (d_head) × 2 (bf16)
                    = 16 KB per token per layer
total per token     = 16 KB × 32 layers = 512 KB
8k context          = 8000 × 512 KB ≈ 4 GB
32k context         = 32000 × 512 KB ≈ 16 GB
128k context        = ≈ 65 GB

That’s per-request. A model that serves 100 concurrent users with 32k context each is staring at 1.6 TB of KV cache, separate from the model weights. KV cache memory dominates inference cost for any reasonable batch size and context length.

Three implications, all visible in modern inference engines:

§ 04 · MODERN VARIANTSCompression, sharing, eviction

Most of the post-2023 engineering around inference is about reducing KV cache cost. The big themes:

MHA (baseline)100%GQA (groups=4)25%MLA (DeepSeek)10%MQA (single KV)3%Sliding window5% (const, not %)Relative KV cache footprint for the same model at the same context length.
Fig 1Relative KV cache footprint for the same model architecture under different attention variants. Modern frontier models almost universally use GQA or MLA.

For a working engineer: when you see “Llama 3.1 70B with 128k context” advertised, every bit of that is GQA + clever cache management. Without those, the same model would need 4–8× the VRAM for the same context length.

CHECKYou serve a chat model and notice latency growing linearly with conversation length. What's the most likely cause?

§ 05 · TAKING THIS FORWARDWhere to look next

Two practical follow-ups:

The deeper story: as context windows have grown from 2k to 128k to 1M and beyond, the engineering challenge has shifted from can we run this model at all to can we serve this model at scale with reasonable cost. The KV cache is at the center of that shift. Every order-of-magnitude reduction in cache footprint directly translates to either lower cost per request or longer contexts at the same cost.

§ · GOING DEEPERFrom MHA to GQA to MLA — the cache-size arms race

The KV cache stores the keys and values from every prior token so the next-token forward pass doesn’t recompute them. For a 70B model with full multi-head attention at 32k context, the cache is on the order of tens of GB per request — often the dominant cost on the serving GPU. Three architecture moves attack this.

MQA (Shazeer 2019) shares one K/V projection across all heads — minimal cache, mild quality regression.GQA(Ainslie et al. 2023) is the compromise every major open model uses: group heads, share K/V within each group, recover most of MHA’s quality at a fraction of the memory. MLA (DeepSeek-V2 2024) uses a low-rank latent projection that compresses K/V further and reportedly outperforms GQA at the same memory budget. On the serving side, PagedAttention (vLLM, Kwon et al. 2023) treats KV cache as a paged virtual address space — eliminates memory fragmentation, enables prefix sharing.

§ · FURTHER READINGReferences & deeper sources

  1. Shazeer (2019). Fast Transformer Decoding: One Write-Head is All You Need (MQA) · arXiv
  2. Ainslie et al. (2023). GQA: Training Generalized Multi-Query Transformer Models · EMNLP
  3. DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (MLA) · arXiv
  4. Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) · SOSP
  5. Dao (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.