Core Concepts · Module 10·9 min read

KV Cache

An LLM emits one token at a time. Without a small but crucial trick, every token would cost more than the last — by the thousandth word you’d be paying a thousand times more compute. The KV cache is why ChatGPT doesn’t crawl to a halt mid-paragraph.

Brain Drip EditorsUpdated May 2026·11 references

The five-bullet version

LLMs generate one token at a time — autoregressively — each new token attending over every previous token.
Naively, each new token would recompute attention over the entire context: O(n²) per step, O(n³) per sequence.
The keys and values of past tokens never change once computed. Cache them.
With caching, each new token costs O(n) — only the new K and V are computed; the rest are looked up.
The cache itself takes serious memory: GB-scale for long contexts. Compression and reuse strategies are an active engineering area.

§ 00 · THE AUTOREGRESSIVE COST TRAPWhy one token at a time is dangerous

An LLM emits tokens one at a time. After it generates token t, that token gets appended to the context, and the model runs again to produce token t + 1. To produce a 200-token answer, the model runs 200 times.

That’s a problem, because each run involves attention, and attentionattention. The transformer's core operation: each token computes a weighted sum over all other tokens in the context, with weights determined by query-key dot products. See the Attention lesson for the full mechanics. is quadratic in sequence length. The math says: at decode step t, you compute attention with t queries against t keys — that’s t² operations. Across the full generation: 1² + 2² + 3² + … + 200²≈ 2.7 million operations, just for the attention step in a single layer. With 32 layers and 1,000 tokens of context, you’re burning compute at a rate that scales cubically with output length.

For a 4,000-token generation, naive autoregression would be orders of magnitude slower than necessary. ChatGPT doesn’t do this. Neither does any production LLM. The reason is a small observation that turns out to be enormous.

§ 01 · WHAT GETS RECOMPUTED (AND WHY)Spotting the redundancy

Recall how attention works at one position. For a token at position t, you compute three projections of its embedding:

qₜ = Wq · xₜ — the query (what this position wants)
kₜ = Wk · xₜ — the key (what this position offers)
vₜ = Wv · xₜ — the value (what this position will hand over)

Then attention computes softmax(qₜ · Kᵀ / √d) · V where K and V are matrices of all the keys and values up to and including position t.

Here’s the key observation. At decode step t + 1, the new token contributes:

A new query qₜ₊₁ (compute it once, throw it away).
A new key kₜ₊₁ (append it to K).
A new value vₜ₊₁ (append it to V).

Every previous key and value is unchanged. k₁, k₂, …, kₜare the keys for tokens that already exist. They were computed from those tokens’ embeddings, which are fixed. Nothing about position t + 1 changes them.

NOTE 01Why are old K and V invariant? Because the transformer is causalduring inference — each token’s representation only depends on tokens that came before it, never after. The key and value of token k₅are determined entirely by the prompt through token 5 — generating token 6, 7, 8 doesn’t reach back and change them.

And yet, naive autoregression recomputes them every step. That’s the redundancy. Cache them once and reuse.

§ 02 · CACHE K AND V, SKIP RECOMPUTATIONThe trick in one sentence

The KV cacheKV cache. A per-layer memory of all keys and values computed during the current generation, indexed by token position. Lets autoregressive decoding skip recomputing past tokens' attention contributions. is a per-layer storage of all the keys and values from previous tokens. At every decode step, the model:

Computes q, k, v for just the newest token.
Appends the new k and v to the cache.
Uses the full cached K and V matrices plus the new q for attention.
Generates one token. Repeat.

Cost per step drops from O(n²) to O(n) — a single dot product against a cached matrix instead of recomputing the matrix from scratch.

Lab · autoregressive decodingStep through generation · toggle KV cache on/off

Decode step

0/5

KV cache

Work this step1 token · O(1)

Key/value rows · one row per token in context

cached

prompt

cached

prompt

cached

prompt

With cache: only the new token’s K and V are computed; old rows are kept in memory and reused. Without cache: every step redoes the full prefix from scratch. By step 8 you’ve done 1 + 2 + 3 + … + 8 = 36 token-worth of work without the cache, versus 8 with it.

Switch cache off and watch every step turn orange — the model is recomputing everything. Switch it on and watch the prefix go gray (cached) with only the new step in accent — a single token of work, no matter how long the context.

§ 03 · WHAT THE CACHE ACTUALLY COSTSMemory, not compute

Trading compute for memory is the deal you’ve struck. Now the question: how much memory?

Per layer, you store one key matrix and one value matrix, each of shape [context_length × num_heads × head_dim]. For a typical modern LLM (32 layers, 32 heads, head dim 128, bf16):

per layer per token = 2 (K, V) × 32 (heads) × 128 (d_head) × 2 (bf16)
                    = 16 KB per token per layer
total per token     = 16 KB × 32 layers = 512 KB
8k context          = 8000 × 512 KB ≈ 4 GB
32k context         = 32000 × 512 KB ≈ 16 GB
128k context        = ≈ 65 GB

That’s per-request. A model that serves 100 concurrent users with 32k context each is staring at 1.6 TB of KV cache, separate from the model weights. KV cache memory dominates inference cost for any reasonable batch size and context length.

Three implications, all visible in modern inference engines:

Context length is expensive at serve time, even when it’s “free” on a single-user demo. Doubling context doubles cache memory.
Memory bandwidth is the bottleneck, not floating- point throughput. Most of inference is loading the cache and the weights into the compute units; the math itself is fast.
Cache reuse becomes a serious feature. Identical prompt prefixes shared across requests can share their cache, saving both memory and compute on the prefix.

§ 04 · MODERN VARIANTSCompression, sharing, eviction

Most of the post-2023 engineering around inference is about reducing KV cache cost. The big themes:

Grouped-Query Attention (GQA). Instead of every attention head having its own K and V, share a K/V pair across groups of heads. Reduces cache size by the group factor (typically 4-8×) at minimal quality cost. Used by Llama 2/3, Mistral, and most modern open models.
Multi-Query Attention (MQA). Extreme version of GQA — one K and V shared across all heads. Maximum cache savings, some quality loss. The original technique that motivated GQA.
Multi-head Latent Attention (MLA). The DeepSeek-V2 variant: compress the K and V via a low-rank projection, store the compressed version. Decompresses on the fly during attention. Even more aggressive savings than GQA.
Sliding-window attention. Limit attention to the last w tokens. Cache grows only to w, regardless of sequence length. Trades long-range awareness for memory.
PagedAttention (vLLM). Manage the cache like virtual memory — split it into pages, page out cold blocks, share pages across requests with identical prefixes. The serving-stack breakthrough that made high-throughput LLM serving practical.

Fig 1Relative KV cache footprint for the same model architecture under different attention variants. Modern frontier models almost universally use GQA or MLA.

For a working engineer: when you see “Llama 3.1 70B with 128k context” advertised, every bit of that is GQA + clever cache management. Without those, the same model would need 4–8× the VRAM for the same context length.

CHECKYou serve a chat model and notice latency growing linearly with conversation length. What's the most likely cause?

§ 05 · TAKING THIS FORWARDWhere to look next

Two practical follow-ups:

Prompt prefix caching.If your application reuses the same system prompt across requests, modern inference servers (vLLM, TGI, sgLang) can detect this and cache the K/V for the shared prefix. Result: dramatically faster TTFT (time to first token) on repetitive workloads. Free if you write your prompts to actually share a prefix; basically free even when you don’t.
FlashAttention.A complementary technique — a re-implementation of attention itself that’s memory-efficient and respects GPU memory hierarchy. Doesn’t replace KV caching (they solve different problems) but works together with it. Most serving stacks now use both.

The deeper story: as context windows have grown from 2k to 128k to 1M and beyond, the engineering challenge has shifted from can we run this model at all to can we serve this model at scale with reasonable cost. The KV cache is at the center of that shift. Every order-of-magnitude reduction in cache footprint directly translates to either lower cost per request or longer contexts at the same cost.

§ · GOING DEEPERFrom MHA to GQA to MLA — the cache-size arms race

The KV cache stores the keys and values from every prior token so the next-token forward pass doesn’t recompute them. For a 70B model with full multi-head attention at 32k context, the cache is on the order of tens of GB per request — often the dominant cost on the serving GPU. Three architecture moves attack this.

MQA (Shazeer 2019) shares one K/V projection across all heads — minimal cache, mild quality regression.GQA(Ainslie et al. 2023) is the compromise every major open model uses: group heads, share K/V within each group, recover most of MHA’s quality at a fraction of the memory. MLA (DeepSeek-V2 2024) uses a low-rank latent projection that compresses K/V further and reportedly outperforms GQA at the same memory budget. On the serving side, PagedAttention (vLLM, Kwon et al. 2023) treats KV cache as a paged virtual address space — eliminates memory fragmentation, enables prefix sharing.

§ · FURTHER READINGReferences & deeper sources

Shazeer (2019). Fast Transformer Decoding: One Write-Head is All You Need (MQA) · arXiv
Ainslie et al. (2023). GQA: Training Generalized Multi-Query Transformer Models · EMNLP
DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (MLA) · arXiv
Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) · SOSP
Dao (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.