One-Line Summary: The Vision Transformer splits an image into fixed-size patches, treats each patch as a token, and processes the sequence with a standard Transformer encoder to perform image classification.
Prerequisites: Self-attention, positional encoding, Transformer architecture, image classification, transfer learning
What Is the Vision Transformer?
Imagine cutting a photograph into a grid of small squares, then reading those squares left-to-right, top-to-bottom like words in a sentence. A language model could then "read" the image the same way it reads text. That is essentially what the Vision Transformer (ViT) does: it converts a 2D image into a 1D sequence of patch embeddings and feeds them into a Transformer encoder originally designed for natural language.
Formally, an image is reshaped into a sequence of flattened patches, each of size . Each patch is linearly projected into a -dimensional embedding. A learnable class token is prepended, and learnable 1D positional embeddings are added before the sequence enters layers of multi-head self-attention.
ViT was introduced by Dosovitskiy et al. (2020) at Google Brain and demonstrated that a pure Transformer, with no convolutional layers whatsoever, can match or exceed state-of-the-art CNNs when pre-trained on sufficient data.
How It Works
Patch Embedding
Given a image and patch size :
Each patch is flattened to a vector of length and projected through a linear layer to produce the patch embedding. This linear projection is equivalent to a single convolutional layer with kernel size and stride both equal to .
Class Token and Positional Embeddings
A learnable token embedding is prepended, yielding a sequence of length . Learnable 1D positional embeddings are added element-wise. The authors found that 2D-aware positional embeddings offered negligible improvement over simple 1D embeddings.
Transformer Encoder
The sequence passes through identical blocks, each consisting of:
- Layer normalization
- Multi-head self-attention (MSA)
- Residual connection
- Layer normalization
- MLP (two linear layers with GELU activation)
- Residual connection
Classification Head
The final representation of the token is fed to a linear classification head during pre-training and a smaller MLP head during fine-tuning.
Model Variants
| Variant | Layers | Hidden Dim | Heads | Params |
|---|---|---|---|---|
| ViT-Base (ViT-B/16) | 12 | 768 | 12 | 86M |
| ViT-Large (ViT-L/16) | 24 | 1024 | 16 | 307M |
| ViT-Huge (ViT-H/14) | 32 | 1280 | 16 | 632M |
The notation ViT-B/16 means Base model with patches. ViT-B/32 uses patches, yielding only 49 tokens for a 224-pixel image and running roughly 4x faster.
Why It Matters
- Architectural unification: ViT showed that the same Transformer architecture used for language can handle vision, enabling shared tooling and multi-modal models.
- Scalability: ViT scales more gracefully than CNNs -- performance improves log-linearly with compute and data without saturating as quickly.
- Global receptive field: Every patch attends to every other patch from layer 1, unlike CNNs that build receptive fields gradually.
- Foundation for multi-modal AI: Models like CLIP, DALL-E, and GPT-4V build directly on ViT-style vision encoders.
Key Technical Details
- Pre-trained on JFT-300M (Google internal), ViT-H/14 achieved 88.55% top-1 on ImageNet, exceeding the prior CNN SOTA.
- When trained only on ImageNet-1K (1.28M images), ViT-B/16 underperforms a comparable ResNet -- large-scale pre-training is critical.
- Fine-tuning at higher resolution (e.g., 384 or 512) is common; positional embeddings are bilinearly interpolated to handle the longer sequence.
- Training uses learning rate warmup, cosine decay, and heavy regularization (dropout, stochastic depth) for the smaller data regimes.
- Inference throughput for ViT-B/16 is comparable to ResNet-50 on modern GPU hardware despite higher FLOPs, thanks to Transformer-optimized kernels.
Common Misconceptions
- "ViT has no inductive bias for images, so it can never work on small datasets." While ViT lacks the translation equivariance of convolutions, techniques like DeiT, strong augmentation, and distillation have closed the gap on ImageNet-1K training. The inductive bias gap narrows with more data but is not an absolute barrier.
- "The class token is essential." Global average pooling over all patch tokens works equally well or better in many settings; the class token is a design choice inherited from BERT, not a requirement.
- "Patch size doesn't matter much." Reducing patch size from 16 to 14 increases the sequence length from 196 to 256, raising compute quadratically. Patch size is a critical accuracy-efficiency tradeoff.
Connections to Other Concepts
deit.md: Demonstrated that ViT can be trained effectively on ImageNet-1K alone using distillation and strong augmentation.swin-transformer.md: Introduced hierarchical features and windowed attention to address ViT's quadratic cost and single-resolution limitation.attention-in-vision.md: ViT's core mechanism; understanding 2D positional encoding and patch-size tradeoffs deepens understanding of ViT design choices.dino.md: Uses ViT as the backbone for self-supervised learning, revealing that attention heads learn semantic segmentation without labels.vision-transformer-scaling.md: Quantifies how much data ViT needs to overtake CNNs and how performance grows with model size.
Further Reading
- Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (2020) -- The original ViT paper.
- Vaswani et al., "Attention Is All You Need" (2017) -- The Transformer architecture that ViT adapts.
- Steiner et al., "How to Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers" (2021) -- Practical training recipes for ViT.
- Beyer et al., "Better plain ViT baselines for ImageNet-1k" (2022) -- Shows that simple tweaks make ViT competitive on ImageNet-1K without distillation.