Vision Transformer
In 2020, a paper out of Google asked: what if you ignored almost everything we know about images and just tokenized them into patches? The Vision Transformer’s answer was that, given enough data, you don’t need convolution.
The five-bullet version
- ViT cuts an image into fixed-size patches (e.g. 16×16), flattens each patch into a vector, treats the sequence of patch vectors as if they were word tokens.
- A standard transformer encoder takes it from there. Add a learnable positional encoding for patch coordinates.
- No convolution. No translation invariance baked in. The model has to learn what convolutional networks got for free.
- With enough pretraining data (JFT-300M scale), ViT matches and exceeds ConvNet accuracy.
- The basis for CLIP, DINO, SAM, and most modern multimodal models.
§ 00 · TREAT AN IMAGE AS A SEQUENCEThe whole idea
Transformers were built for sequences. Words follow words; the model attends across them. Images are 2D arrays of pixels — different topology. Through the 2010s, every successful vision model relied on convolutions, which exploit 2D locality.
The Vision TransformerVision Transformer. A 2020 architecture (Dosovitskiy et al.) that applies a standard transformer encoder to images by treating fixed-size image patches as tokens. Replaces convolution entirely; outperforms ConvNets at scale. (ViT) proposed something unintuitive: stop thinking of images as 2D arrays. Cut the image into patches. Linearize each patch into a vector. Treat the resulting sequence of patch vectors as if they were word embeddings. Run a standard transformer encoder over them. At the end, classify (or whatever your downstream task is).
No convolutions. No 2D-specific operations. Just patches and attention.
§ 01 · PATCHES AS TOKENSFrom image to sequence
Concretely: a 224×224 RGB image becomes a 14×14 grid of 16×16-pixel patches — 196 patches total. Each patch is 16×16×3 = 768 numbers, flattened into a vector. Run a learned linear projection on each patch vector to produce the “token embedding.” Prepend a special [CLS] token (à la BERT) whose final embedding is used for classification.
Add a learnable position embedding per patch position — without it, the transformer wouldn’t know which patch came from where in the image. The transformer is permutation-invariant; position information has to be added explicitly.
Hovered patch (orange) attends most strongly to its neighbors, plus a synthetic far-field spike — a stylized version of what real ViT heads do. Some learn local features (like a conv would); others learn long-range associations (like text-attention heads).
§ 02 · NO INDUCTIVE BIAS, LOTS OF DATAThe data efficiency trade-off
Convolutional networks bake in two inductive biases:
- Locality. Conv filters only look at a small spatial neighborhood. Nearby pixels matter more than distant ones.
- Translation equivariance. A feature detected in one location can be detected anywhere. Parameter sharing across space.
These biases are correct for natural images and let ConvNets learn efficiently from modest data. ViT has neither. Every pair of patches is connected at every layer; the model must learn the locality structure (or not) from data.
The consequence: ViTs are data-hungry. On ImageNet alone (1.3M images), ViTs underperform comparable ConvNets. Pretrain on JFT- 300M (300M images) or larger, and ViTs match — then surpass — ConvNets. The crossover is around 100M images.
§ 03 · WHAT ATTENTION SEES IN AN IMAGEInterpreting trained heads
When you probe a trained ViT, you find different heads doing different jobs:
- Local heads, especially in early layers, attend to neighboring patches — recovering something like a learned convolution.
- Long-range heads, especially in deeper layers, attend across the whole image. A patch on the left of a face attends to a patch on the right. Convolutions can’t do this in early layers — they have to wait until enough downsampling has happened for the receptive field to reach.
- Object-aware heads attend to patches in the same object regardless of position. In the famous DINO visualizations, you can see attention maps that effectively segment objects from background — without ever being trained to do segmentation.
§ 04 · WHERE VIT NOW SITSThe vision foundation
ViTs (and direct descendants) are the dominant architecture for modern vision foundation models:
- CLIP — ViT image encoder paired with a text transformer, trained on 400M image-text pairs. Produces aligned embeddings that power image search, zero-shot classification, and multimodal LLM inputs.
- DINO / DINOv2 — self-supervised ViTs. No labels; remarkably strong embeddings for downstream tasks.
- SAM — segmentation-anything; ViT image encoder plus a small prompt-conditioned decoder.
- Multimodal LLMs— GPT-4o, Claude, Gemini, Llama- 3.2 vision all use ViT-style image encoders that feed patch embeddings into the LLM’s context.
ConvNets are still useful — they remain competitive at modest scales and are often easier to deploy. But the dominant trajectory of vision ML since 2022 runs through ViT.
§ 05 · TAKING THIS FORWARDWhere vision is heading
Three threads worth following:
- Unified vision-language backbones. A single transformer that handles text and image tokens in the same stack. The trend in 2026 frontier multimodal models.
- Video transformers. Extend ViT to spatio-temporal patches. Sora, Veo, and the open-source video generators all use patch-token transformers.
- Mixture-of-experts vision encoders. Scaling ViT past 10B params benefits from MoE the same way LLMs do. Still early; promising.
§ · GOING DEEPERWhy ViTs took over and what came after
The Vision Transformer paper (Dosovitskiy et al. 2020) made one surprising claim: with enough data, treating an image as a sequence of patch embeddings and running it through a standard transformer beats CNN inductive biases. The original result required pretraining on a private 300M-image dataset (JFT-300M); on ImageNet-1k alone, ViTs underperformed ResNets. Scale was load-bearing.
Three follow-ups widened the lead. Swin (Liu et al. 2021) introduced shifted windowed attention — locality back, but learned. MAE (He et al. 2021) made self-supervised pretraining on images work by masking 75% of patches and reconstructing them. DINO (Caron et al. 2021) and DINOv2 (Oquab et al. 2023) showed that self-supervised ViT features cluster by semantic object — emergent object discovery without labels. Today, ViT-style architectures are the foundation under CLIP, SAM, Stable Diffusion’s text encoder, and every multi-modal LLM.
§ · FURTHER READINGReferences & deeper sources
- (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT) · ICLR
- (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows · ICCV
- (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO) · ICCV
- (2021). Masked Autoencoders Are Scalable Vision Learners (MAE) · CVPR
- (2023). DINOv2: Learning Robust Visual Features without Supervision · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.