One-Line Summary: The Swin Transformer computes self-attention within local windows and shifts those windows between layers to achieve hierarchical feature maps and linear computational complexity with respect to image size.

Prerequisites: Vision Transformer (ViT), self-attention, feature pyramids, object detection, semantic segmentation

What Is the Swin Transformer?

Imagine reading a newspaper by focusing on one column at a time, then shifting your gaze half a column over to catch information that spans column boundaries. You never try to read the entire page at once -- that would be overwhelming -- but by alternating your reading window, you eventually process every cross-column connection. The Swin Transformer applies this principle to image patches: it restricts attention to small local windows, then shifts those windows in alternating layers so information flows across boundaries.

The Swin Transformer, introduced by Liu et al. (2021) at Microsoft Research Asia, addresses two fundamental limitations of the original ViT: (1) ViT's quadratic complexity with image size makes it impractical for high-resolution dense prediction tasks, and (2) ViT produces single-scale features, while tasks like object detection and segmentation need multi-scale feature pyramids. Swin solves both problems, becoming the first general-purpose Transformer backbone competitive with CNNs on detection, segmentation, and classification simultaneously.

How It Works

Hierarchical Feature Maps via Patch Merging

Swin builds a feature pyramid analogous to a CNN backbone:

  • Stage 1: patch embedding, producing tokens with dimension
  • Stage 2: Patch merging (concatenating neighboring patches and projecting), producing tokens with dimension
  • Stage 3: tokens, dimension
  • Stage 4: tokens, dimension

This produces feature maps at 4 scales, directly compatible with FPN, UNet, and other multi-scale architectures.

Window-Based Multi-Head Self-Attention (W-MSA)

Instead of global self-attention over all tokens, Swin partitions the feature map into non-overlapping windows of size (default ). Attention is computed independently within each window:

where is a learnable relative position bias. The complexity of global self-attention is where . Window attention reduces this to:

which is linear in image size for a fixed window size .

Shifted Window Multi-Head Self-Attention (SW-MSA)

Window attention alone would isolate each window. Swin alternates between two configurations:

  • Layer : Regular window partition (W-MSA)
  • Layer : Windows shifted by pixels (SW-MSA)

The shift creates cross-window connections. To handle the uneven windows at image boundaries efficiently, Swin uses a cyclic shift with attention masking, avoiding any padding overhead.

Relative Position Bias

Rather than absolute positional embeddings, Swin uses a learnable relative position bias indexed from a bias table of size . This relative encoding is critical -- removing it drops ImageNet accuracy by ~1.2%.

Model Variants

VariantCLayersParamsImageNet Top-1
Swin-T96[2,2,6,2]29M81.3%
Swin-S96[2,2,18,2]50M83.0%
Swin-B128[2,2,18,2]88M83.5%
Swin-L192[2,2,18,2]197M86.3% (ImageNet-22K pre-train)

Why It Matters

  1. General-purpose backbone: Swin was the first Transformer to serve as a drop-in replacement for CNN backbones across classification, detection, and segmentation.
  2. Linear complexity: The window attention mechanism makes Swin practical for high-resolution inputs (1024x1024 and beyond) needed in medical imaging and satellite imagery.
  3. State-of-the-art dense prediction: Swin-L achieved 58.7 box AP on COCO object detection and 53.5 mIoU on ADE20K semantic segmentation at the time of publication.
  4. Influenced subsequent designs: Swin's hierarchical window approach inspired Swin V2, CSwin, and many other efficient Transformer architectures.

Key Technical Details

  • Default window size is , yielding tokens per window -- small enough for efficient attention.
  • The cyclic shift implementation avoids padding by rolling the feature map and using attention masks, adding negligible overhead.
  • Swin uses pre-norm (LayerNorm before attention) rather than post-norm.
  • Training uses AdamW, cosine schedule, 20-epoch warmup, and augmentations similar to DeiT (RandAugment, Mixup, CutMix, random erasing).
  • Swin-B with input pre-trained on ImageNet-22K reaches 86.4% top-1 on ImageNet-1K.
  • Swin V2 (Liu et al., 2022) scales to 3 billion parameters and resolution using log-spaced continuous relative position bias and residual post-normalization.

Common Misconceptions

  • "Shifted windows are just a different form of dilated convolution." Dilated convolutions sample sparse pixels at fixed offsets. Shifted window attention computes full pairwise attention among all tokens within the shifted window -- it captures arbitrary relationships, not just fixed spatial patterns.
  • "Swin doesn't have global attention, so it can't model long-range dependencies." Information propagates across the entire image through successive shifted-window layers. After a few layers, the effective receptive field spans the full image, similar to how stacked convolutions eventually reach global scope.
  • "Window size of 7 is always optimal." The optimal window size depends on the task and resolution. Larger windows (e.g., 12 or 16) can improve accuracy at the cost of more compute, and some tasks benefit from adaptive window sizes.

Connections to Other Concepts

  • vision-transformer.md: Swin addresses ViT's quadratic cost and single-scale limitation by introducing local windows and hierarchical structure.
  • attention-in-vision.md: Swin's windowed attention is a key instance of the broader design space for efficient attention in vision.
  • hybrid-cnn-transformer.md: Swin's hierarchical design mirrors CNN feature pyramids, making it a natural bridge between CNN and Transformer paradigms.
  • vision-transformer-scaling.md: Swin V2 demonstrates how windowed Transformers scale to billions of parameters and very high resolutions.

Further Reading

  • Liu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (2021) -- The original Swin paper.
  • Liu et al., "Swin Transformer V2: Scaling Up Capacity and Resolution" (2022) -- Scaling Swin to 3B parameters.
  • Dong et al., "CSwin Transformer: A General Vision Transformer Backbone with Cross-Shaped Window Self-Attention" (2022) -- Cross-shaped windows as an alternative to shifted windows.
  • Yang et al., "Focal Self-attention for Local-Global Interactions in Vision Transformers" (2021) -- Another approach to bridging local and global attention.