One-Line Summary: How the field borrowed operating system concepts — virtual memory, paging, demand allocation — to solve the memory crisis of storing every token's past for every concurrent request.
Prerequisites: 01-attention-mechanism-evolution.md, 03-flash-attention-and-hardware-aware-computing.md
What Is the KV Cache Problem?
Imagine you are a librarian. Every visitor to the library opens several books and leaves them spread across a desk — they need to reference earlier pages as they read. With one visitor, this is fine. With a hundred visitors simultaneously, every desk is covered, the library runs out of space, and half the visitors are turned away. Now imagine the books are getting longer every year. That is the KV cache problem in LLM serving.
When a Transformer generates text, each layer must attend to all previous tokens. Rather than recomputing the Key and Value projections for every past token at every generation step, models cache these tensors — the KV cache. This converts an O(N^2) recomputation into an O(N) lookup, but the memory cost is substantial: for each request, the KV cache grows linearly with sequence length and proportionally with model depth and head dimension. For a 175B-parameter model processing a 128K-token sequence, the KV cache alone can consume tens of gigabytes.
The evolution of KV cache management is a story of borrowing increasingly sophisticated ideas from systems engineering — moving from naive pre-allocation to virtual memory, paging, and hardware-assisted demand allocation — to make LLM serving economically viable.
How It Works
KV Cache Management: From Naive to PagedAttention:
Naive Pre-Allocation PagedAttention (vLLM)
════════════════════ ════════════════════
GPU Memory: GPU Memory:
┌────────────────────┐ ┌────────────────────┐
│ Model Weights │ │ Model Weights │
├────────────────────┤ ├────────────────────┤
│ Req 1 KV [████░░░░]│ ◀─ wasted │ Page Pool: │
│ Req 2 KV [██░░░░░░]│ ◀─ wasted │ ┌──┬──┬──┬──┬──┬──┐│
│ Req 3 KV [█████░░░]│ ◀─ wasted │ │R1│R2│R1│R3│R2│R1││
│ │ │ └──┴──┴──┴──┴──┴──┘│
│ ░ = pre-allocated │ │ Pages allocated on │
│ but unused │ │ demand, freed when │
│ │ │ request completes │
│ [FRAGMENTED] │ │ │
├────────────────────┤ │ [COMPACT] │
│ Can't fit Req 4! │ │ Room for Req 4! │
└────────────────────┘ └────────────────────┘
~4-5 concurrent requests ~15+ concurrent requests
87% memory wasted on Near-zero waste
short sequences 2-4x throughput gain
Copy-on-Write (shared prefix):
Req A: [System Prompt Pages] ──▶ [Unique Pages A]
Req B: [Same Physical Pages] ──▶ [Unique Pages B]
▲ shared until divergenceNaive Pre-Allocation -- Contiguous Memory
Early serving systems (and many simple implementations today) pre-allocate a contiguous block of GPU memory for each request's KV cache, sized for the maximum possible sequence length. If the maximum is 4096 tokens and the model has 80 layers with 128-dimensional KV per head, that is a fixed allocation regardless of how many tokens the request actually uses. For a 70B model with GQA (8 KV heads), this means roughly 2 * 80 * 8 * 128 * 4096 * 2 bytes (FP16) per request — about 10 GB per slot. With 80 GB of GPU memory, you can serve perhaps 4-5 concurrent requests after accounting for model weights. Utilization is terrible: a request using 500 tokens wastes 87% of its allocation. Worse, memory fragmentation prevents even this low concurrency from being achieved reliably.
PagedAttention and vLLM — Virtual Memory for KV Cache (June 2023)
Kwon et al. from UC Berkeley introduced PagedAttention, implemented in the vLLM serving system. The key insight was directly borrowed from operating system virtual memory: instead of contiguous allocation, divide KV cache into fixed-size "pages" (blocks of, say, 16 tokens). Each request gets a page table mapping logical token positions to physical memory locations. Pages are allocated on demand as the sequence grows and freed immediately when the request completes. Non-contiguous physical pages can serve a logically contiguous sequence.
The results were dramatic: 2-4x throughput improvement over naive serving (HuggingFace text-generation-inference at the time) through near-complete elimination of memory waste. A further innovation was copy-on-write for shared prefixes: if multiple requests share the same system prompt, their KV caches for that prefix share physical pages until one request diverges — exactly like fork() in Unix. vLLM rapidly became the standard open-source serving engine, with adoption by Anyscale, AWS, and numerous production deployments.
Continuous Batching — Maximizing GPU Utilization
Orca (Yu et al. 2022) introduced continuous batching (also called iteration-level scheduling): instead of waiting for an entire batch of requests to finish before starting the next batch, new requests are inserted into the batch as soon as a slot opens. This eliminated the "convoy effect" where short requests waited for long ones. Combined with PagedAttention, continuous batching keeps the GPU busy at nearly 100% utilization. vLLM adopted continuous batching as a core feature. The combination of paged memory + continuous batching is now the baseline for any serious LLM serving system.
vAttention — Hardware-Assisted Demand Paging (2024)
Prabhu et al. introduced vAttention, which pushed the OS analogy further by using the GPU's own hardware virtual memory system. Instead of software-managed page tables (as in vLLM), vAttention allocates KV cache in contiguous virtual address spaces and relies on hardware page fault handling to map physical memory on demand. This eliminates the software overhead of page table lookups during attention computation, achieving 1.97x throughput over vLLM on certain workloads. The approach also simplifies integration with FlashAttention and other optimized kernels, which expect contiguous memory layouts.
Architectural Solutions — Compressing the Cache Itself
While serving systems optimize how KV caches are managed in memory, architectural innovations reduce the size of the cache itself. Multi-Query Attention reduces KV cache by the number of heads (up to 96x). Grouped-Query Attention typically achieves 4-8x reduction. Multi-head Latent Attention (DeepSeek) achieves 57x compression by storing latent vectors instead of full KV pairs. These approaches, detailed in 01-attention-mechanism-evolution.md, are complementary to memory management — they reduce the total data while PagedAttention and vAttention optimize its storage.
Why It Matters
The Economics of Serving
LLM inference is the dominant cost for AI companies. A single H100 GPU costs roughly $2-3 per hour in cloud pricing. If naive KV cache management lets you serve 5 concurrent requests and PagedAttention lets you serve 15, your cost per request drops by 3x. At the scale of millions of daily requests, this difference is millions of dollars per year.
Enabling Long Context in Production
Long-context models are useless if their KV caches cannot fit in memory during serving. A 128K-token request on a 70B model with MHA would require roughly 300+ GB of KV cache — more than any single GPU. GQA reduces this to manageable levels; paged allocation ensures memory is used efficiently; together they make 128K-token inference practical on 4-8 GPUs.
The Systems-ML Convergence
KV cache optimization exemplifies a broader trend: the most impactful advances in LLM deployment are increasingly systems engineering rather than ML research. PagedAttention did not change the model at all — it changed how memory was managed, borrowing directly from OS design. This suggests that future breakthroughs may come as much from systems thinking as from architecture innovation.
Key Technical Details
- KV cache per token per layer (FP16): 2 * n_kv_heads * d_head * 2 bytes. For LLaMA 2 70B (GQA-8, d_head=128): 4 KB/token/layer, ~320 KB/token total (80 layers).
- LLaMA 2 70B at 4096 tokens: ~1.3 GB KV cache per request. At 128K tokens: ~40 GB per request.
- PagedAttention (vLLM, Jun 2023): 2-4x throughput over naive serving. Near-zero memory waste. Copy-on-write for shared prefixes.
- Page size: typically 16-64 tokens per block. Smaller pages = less waste, but more metadata overhead.
- vAttention (2024): 1.97x throughput over vLLM on tested workloads. Hardware demand paging eliminates software page table overhead.
- Continuous batching (Orca, 2022): Iteration-level scheduling. Eliminates convoy effect. Now standard in all serving frameworks.
- MLA compression (DeepSeek V2, 2024): 57x KV cache reduction through learned latent compression.
- Memory breakdown for serving a 70B model on 8x H100 (80GB each): ~140 GB for weights, ~320 GB remaining for KV cache across all GPUs.
Common Misconceptions
-
"KV cache is a small overhead." For long sequences, KV cache can exceed model weights in memory consumption. At 128K tokens, KV cache for a 70B model can be 3-4x the size of the model weights.
-
"PagedAttention approximates attention." PagedAttention computes exact attention — it only changes how memory is allocated and accessed. The mathematical computation is identical.
-
"Bigger GPUs solve the KV cache problem." Bigger GPUs help but do not solve the fundamental scaling issue. Doubling GPU memory doubles your capacity, but doubling sequence length doubles the cache per request. Efficient management is still essential.
-
"KV cache optimization only matters for long-context models." Even at 4K tokens, memory management determines how many concurrent requests a GPU can serve. The throughput gains from PagedAttention apply regardless of sequence length.
-
"State space models eliminate this problem entirely." SSMs like Mamba have fixed-size state (no growing cache), but they have not yet matched Transformer quality at scale. For now, optimizing KV cache remains essential for production systems.
Connections to Other Concepts
01-attention-mechanism-evolution.md describes the architectural approaches to reducing KV cache size (MQA, GQA, MLA) — complementary to the memory management approach here. 03-flash-attention-and-hardware-aware-computing.md shares the philosophy of hardware-aware optimization, and FlashAttention's tiling must interoperate with paged memory layouts. 07-long-context-techniques.md is the capability that KV cache optimization enables — without efficient serving, long context is a paper feature. 05-state-space-models-and-mamba.md represents the radical alternative: no KV cache at all. 09-speculative-decoding-and-inference-speedups.md interacts with KV cache management because the draft model's KV cache must also be managed efficiently.
Further Reading
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) — the vLLM paper.
- Yu et al., "Orca: A Distributed Serving System for Transformer-Based Generative Models" (2022) — continuous batching.
- Prabhu et al., "vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention" (2024) — hardware demand paging.
- Pope et al., "Efficiently Scaling Transformer Inference" (2022) — analysis of memory bottlenecks in serving.
- DeepSeek-AI, "DeepSeek-V2" (2024) — MLA as an architectural solution to KV cache compression.