One-Line Summary: Applying self-attention to images requires careful handling of 2D spatial structure, patch size tradeoffs, and the quadratic cost of attention over thousands of visual tokens -- design choices that fundamentally shape every vision Transformer.

Prerequisites: Self-attention, Vision Transformer (ViT), computational complexity, convolutional neural networks, positional encoding

What Is Attention in Vision?

Imagine a security guard monitoring a wall of surveillance screens. With a few screens (say 4), the guard can easily compare every pair of feeds to spot coordinated activity. With 100 screens, comparing all pairs becomes overwhelming -- there are 4,950 pairs to track. With 1,000 screens, it's 499,500 pairs. This is the fundamental challenge of self-attention in vision: images naturally decompose into many more tokens than text sequences, and the pairwise cost grows quadratically.

A image with patches produces 196 tokens -- manageable. But a medical image with the same patch size yields 4,096 tokens, and attention cost scales as . At (common in pathology), the 16,384 tokens make standard attention infeasible. This section covers the core design decisions that make attention work for images: how to encode spatial position in 2D, how patch size trades off resolution for efficiency, why the quadratic cost is particularly painful for vision, and how windowed attention provides a practical escape.

How It Works

2D Positional Encoding

Unlike text, image tokens have spatial structure in two dimensions. Several encoding strategies exist:

Learnable 1D embeddings (ViT): Flatten patches into a 1D sequence and learn a separate embedding for each position. Surprisingly, this works well -- the model learns 2D structure from data. Dosovitskiy et al. (2020) reported negligible benefit from explicit 2D encodings.

Learnable 2D embeddings: Assign separate embeddings for row and column indices. A patch at position receives . This reduces parameters from to .

Sinusoidal 2D embeddings: Extend the original sinusoidal encoding from Vaswani et al. (2017) to two dimensions, using different frequency bands for height and width.

Relative position bias (Swin, CoAtNet): Instead of absolute positions, encode the relative offset between pairs of tokens as a bias added to the attention logits:

where is a learnable bias table indexed by relative position. This approach generalizes better to different resolutions and is the dominant choice in modern architectures.

Rotary Position Embedding (RoPE): Originally from NLP, RoPE has been adapted for 2D images by applying separate rotations for x and y coordinates. Used in EVA and some recent ViT variants, RoPE supports variable resolution without interpolation.

Patch Size Tradeoffs

Patch size determines the number of tokens and directly controls the resolution-efficiency tradeoff:

Patch SizeTokens (224x224)Tokens (512x512)Attention FLOPs Ratio
32x32492561x
16x16196102416x
14x14256136927x
8x87844096256x

Key observations:

  • Smaller patches preserve finer detail but dramatically increase compute. ViT-B/16 is 4x more expensive than ViT-B/32.
  • Larger patches lose spatial information. For dense prediction tasks (segmentation, detection), or patches are the practical minimum.
  • Non-square patches are rarely used because standard image augmentations assume square spatial structure.
  • The patch embedding projection is a single convolution with kernel = stride = . Some architectures use overlapping patches (stride < kernel) for slightly better features at marginally higher cost.

The Quadratic Cost Problem

Standard self-attention computes pairwise similarities among all tokens:

For a image with : , yielding pairwise interactions per head. This is comparable to typical NLP sequence lengths and remains tractable.

For a image: , yielding interactions -- roughly the cost of the 224-pixel case. This makes standard ViT impractical for high-resolution inputs without modification.

Memory is an equally critical bottleneck. The attention matrix per head consumes bytes in float32. For and 12 heads: MB just for attention weights.

Windowed and Efficient Attention

Several strategies address the quadratic cost:

Windowed attention (Swin Transformer): Restrict attention to local windows (typically ), reducing cost to . Shifted windows in alternating layers restore cross-window communication.

Dilated/strided attention: Attend to every -th token, reducing the effective sequence length by a factor of . Used in some efficient Transformer variants.

Linear attention: Replace softmax attention with kernel-based approximations, reducing complexity to . Methods like Performer and linear attention variants achieve this but often sacrifice accuracy.

Flash Attention (Dao et al., 2022): Not a change to the attention pattern but an IO-aware implementation that fuses the attention computation to avoid materializing the matrix in GPU HBM. Reduces memory from to and provides 2-4x wall-clock speedup. This has become the default attention implementation.

Neighborhood attention (NAT, Hassani et al., 2023): Each token attends to its nearest spatial neighbors. More flexible than rigid windows; can be implemented efficiently with NATTEN kernels.

Multi-Scale Attention

Some architectures vary the attention scope across layers:

  • Early layers: Local/windowed attention (cheap, captures local structure)
  • Later layers: Global attention (expensive, captures semantics)

This mirrors the receptive field growth in CNNs and is used in architectures like CrossFormer and MaxViT (which alternates between block and grid attention within each stage).

Why It Matters

  1. Determines practical resolution limits: The attention mechanism directly constrains the maximum image resolution a vision Transformer can handle.
  2. Dictates architecture design: Every major design difference between ViT, Swin, and their successors comes down to how they handle attention's cost and spatial structure.
  3. Affects what features are learned: Global attention from layer 1 (ViT) produces different representations than progressive local-to-global attention (Swin, CNN hybrids).
  4. Enables new applications: Efficient attention mechanisms make Transformers viable for video (spatiotemporal tokens), 3D medical imaging, and gigapixel pathology slides.

Key Technical Details

  • Flash Attention reduces peak memory from to by computing attention in tiles and never materializing the full attention matrix. It is now the standard implementation in PyTorch 2.0+.
  • Relative position biases generalize better to unseen resolutions than absolute embeddings. When transferring ViT from 224 to 384 resolution, absolute embeddings require interpolation; relative biases need no modification if the window size stays constant.
  • For dense prediction, most architectures use and upsample features with FPN or decoder heads. Using directly would yield 3,136 tokens for a image -- feasible but expensive.
  • The attention pattern in early ViT layers often resembles local convolutions (attending primarily to nearby patches), while later layers exhibit global patterns. This has been empirically confirmed by Raghu et al. (2021).

Common Misconceptions

  • "Self-attention gives every patch global context from layer 1." While theoretically true, in practice early-layer attention weights are sharply peaked on local neighbors. The effective receptive field grows gradually across layers, similar to CNNs, just with a softer boundary.
  • "Windowed attention sacrifices too much quality." Swin Transformer matches or exceeds ViT's accuracy with windowed attention plus shifting. The combination of local attention with cross-window communication is sufficient for strong performance on nearly all benchmarks.
  • "You need special 2D positional encodings for vision." The original ViT with simple 1D learnable embeddings achieved top results. The model's patch layout is fixed, so 1D position indices implicitly encode 2D structure. That said, relative position bias does provide a measurable improvement (~1%).

Connections to Other Concepts

  • vision-transformer.md: Uses global self-attention with 1D positional embeddings -- the simplest attention-in-vision design.
  • swin-transformer.md: The most influential instance of windowed attention for vision.
  • hybrid-cnn-transformer.md: CNN early layers avoid the attention cost problem entirely at high resolutions.
  • masked-image-modeling.md: MAE's 75% masking dramatically reduces the number of tokens the encoder must attend to, sidestepping the quadratic cost during pre-training.
  • vision-transformer-scaling.md: Efficient attention mechanisms determine how far vision Transformers can scale in resolution and sequence length.

Further Reading

  • Vaswani et al., "Attention Is All You Need" (2017) -- Original self-attention mechanism.
  • Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022) -- The dominant efficient attention implementation.
  • Raghu et al., "Do Vision Transformers See Like Convolutional Neural Networks?" (2021) -- Analysis of ViT attention patterns versus CNN features.
  • Hassani et al., "Neighborhood Attention Transformer" (2023) -- Flexible local attention with NATTEN kernels.
  • Tu et al., "MaxViT: Multi-Axis Vision Transformer" (2022) -- Alternating block and grid attention for multi-scale processing.