One-Line Summary: Sliding window attention restricts each token's attention to a fixed-size local window of neighboring tokens, reducing the quadratic memory cost of full attention to linear while preserving long-range information flow through layer stacking -- where each additional layer extends the effective receptive field by tokens.

Prerequisites: Self-attention mechanism, multi-head attention, the quadratic cost problem of standard attention (), KV cache for autoregressive inference, and an understanding of receptive fields from convolutional neural networks.

What Is Sliding Window Attention?

Standard self-attention lets every token attend to every other token in the sequence. This is powerful but expensive: for a sequence of length , attention requires memory and compute. Double the sequence length, and you quadruple the cost.

flowchart TD
    L1["the local window of W tokens per layer"]
    L2["with the effective receptive field growing"]
    L1 --> L2

Sliding window attention takes a pragmatic approach borrowed from convolutional neural networks: each token only attends to its nearest neighbors. Think of it like reading a book through a magnifying glass that shows exactly words at a time.

The key realization is that in a deep transformer, these local windows compound. Just as stacking convolutional layers builds a larger receptive field, stacking attention layers with window size means information can propagate tokens across layers.

Mistral 7B used with 32 layers, creating an effective receptive field of tokens while keeping per-layer memory costs fixed. The model outperformed the significantly larger Llama 2 13B while running roughly twice as fast at inference.

How It Works

flowchart LR
    S1["Rolling buffer KV cache diagram"]
    S2["fixed-size cache"]
    S3["position modulo addressing"]
    S1 --> S2
    S2 --> S3

The Attention Mask

In standard causal self-attention, token attends to all tokens . With sliding window attention, token attends only to tokens . The attention computation becomes:

This is implemented via a banded attention mask -- a diagonal band of width in the attention matrix. Tokens outside the window receive masking before the softmax.

The memory reduction is immediate: instead of an attention matrix, you store . For and , this is an 8x reduction. Compute scales identically.

Effective Receptive Field Through Layer Stacking

At layer 1, token can access tokens in . But the tokens at position themselves attended to tokens in at the previous layer. At layer 2, token has indirect access to tokens as far back as .

Generalizing, at layer , the effective receptive field extends back by positions:

Information "flows" through intermediate tokens across successive layers, analogous to how information propagates through the layers of a deep CNN.

Rolling Buffer KV Cache

The KV cache stores previously computed key-value pairs to avoid recomputation during autoregressive inference. In full attention, this cache grows linearly with sequence length -- at token 100K, you store 100K KV pairs per layer.

Sliding window attention enables a rolling buffer (circular buffer) of fixed size :

Cache position = token_position % W
# Token 0 → slot 0, Token 1 → slot 1, ...
# Token W → slot 0 (overwrites token 0), Token W+1 → slot 1, ...

This bounds KV cache memory at exactly regardless of sequence length. For Mistral 7B (, , 32 layers, 8 KV heads in GQA), this is a fixed ~2GB whether you have generated 100 tokens or 1 million.

Pre-fill Chunking

For long input prompts, the pre-fill phase can be memory-intensive. Sliding window attention enables chunked pre-fill: the prompt is processed in chunks of size , with each chunk attending to itself and the KV cache from the previous chunk.

For a 32K prompt with , this means 8 chunks of 4K each, rather than one 32K x 32K attention computation.

Comparison with Other Efficient Attention Methods

  • Longformer: Local windows + designated global attention tokens. More complex masking, additional design decisions about which tokens are global.
  • BigBird: Local windows + random connections. Theoretical connectivity guarantees but harder to implement efficiently.
  • Sparse Transformers: Strided patterns where every -th token attends globally. Effective but requires careful stride tuning.
  • Linear Attention: via kernel approximation, but typically with quality degradation on language tasks.

Sliding window's advantage is simplicity: one hyperparameter (), straightforward masking, and natural compatibility with existing FlashAttention kernels.

Why It Matters

  1. Linear memory scaling: Memory grows as instead of , enabling much longer sequences on the same hardware.
  2. Fixed-size inference cache: Constant memory regardless of tokens generated -- ideal for streaming and multi-turn conversations.
  3. Effective long-range modeling: Effective receptive field of reaches ~131K tokens for Mistral 7B.
  4. Superior quality-efficiency tradeoff: Mistral 7B (60.1% MMLU) outperforms Llama 2 13B (55.4%) at 2x inference speed.
  5. Composability: Combines with GQA, FlashAttention, and attention sinks for compounding efficiency gains.

Key Technical Details

  • Window size: in Mistral 7B. Larger windows improve quality but increase per-layer cost.
  • Mistral 7B benchmarks: 60.1% MMLU, 52.2% HellaSwag, 75.2% ARC-Challenge.
  • Attention pattern sparsity: Full-attention models concentrate most mass in local windows anyway -- sliding window formalizes this.
  • Inference speedup: Reduces per-layer compute from to , ~2x wall-clock speedup for long sequences.
  • Hybrid approaches: Mixtral alternates sliding window and full attention layers for both local detail and global context.
  • FlashAttention compatibility: The banded mask pattern integrates naturally with FlashAttention's tiling.
  • Position encoding: RoPE works naturally with sliding window, as relative positions within the window are preserved.

Common Misconceptions

  • "Sliding window attention can't handle long-range dependencies." Through layer stacking, information propagates across the full effective receptive field. The model learns to relay important information through intermediate tokens.
  • "The window size must match the context length." Mistral 7B processes 32K+ tokens with a 4K window, relying on stacked layers for long-range propagation.
  • "This is the same as Longformer or BigBird." Those combine local windows with global or random attention. Sliding window (as in Mistral) is purely local, relying entirely on layer stacking.
  • "You lose the first tokens with the rolling buffer." True and intentional. For applications needing persistent access to initial tokens, attention sinks can be combined with the rolling buffer.

Connections to Other Concepts

  • self-attention.md: Sliding window is a strict subset -- same mechanism, restricted key-value set.
  • kv-cache.md: The rolling buffer transforms a growing cache into a fixed-size circular buffer.
  • sparse-attention.md: Sliding window is one pattern; others include strided, local+global, and random.
  • grouped-query-attention.md: Mistral 7B combines sliding window with GQA for compounding memory savings.
  • attention-sinks.md: StreamingLLM adds persistent sink tokens to the rolling buffer, preventing perplexity degradation.

Further Reading

  1. "Mistral 7B" (Jiang et al., 2023, arXiv:2310.06825) -- Introduces sliding window attention with rolling buffer KV cache in practice.
  2. "Longformer: The Long-Document Transformer" (Beltagy et al., 2020, arXiv:2004.05150) -- Combines sliding windows with global attention tokens for document-level tasks.
  3. "Generating Long Sequences with Sparse Transformers" (Child et al., 2019, arXiv:1904.10509) -- Explores sparse attention patterns including local windows and strided attention.