Architectures · Module 21·9 min read

Vision Transformer

In 2020, a paper out of Google asked: what if you ignored almost everything we know about images and just tokenized them into patches? The Vision Transformer’s answer was that, given enough data, you don’t need convolution.

The five-bullet version

  • ViT cuts an image into fixed-size patches (e.g. 16×16), flattens each patch into a vector, treats the sequence of patch vectors as if they were word tokens.
  • A standard transformer encoder takes it from there. Add a learnable positional encoding for patch coordinates.
  • No convolution. No translation invariance baked in. The model has to learn what convolutional networks got for free.
  • With enough pretraining data (JFT-300M scale), ViT matches and exceeds ConvNet accuracy.
  • The basis for CLIP, DINO, SAM, and most modern multimodal models.

§ 00 · TREAT AN IMAGE AS A SEQUENCEThe whole idea

Transformers were built for sequences. Words follow words; the model attends across them. Images are 2D arrays of pixels — different topology. Through the 2010s, every successful vision model relied on convolutions, which exploit 2D locality.

The Vision TransformerVision Transformer. A 2020 architecture (Dosovitskiy et al.) that applies a standard transformer encoder to images by treating fixed-size image patches as tokens. Replaces convolution entirely; outperforms ConvNets at scale. (ViT) proposed something unintuitive: stop thinking of images as 2D arrays. Cut the image into patches. Linearize each patch into a vector. Treat the resulting sequence of patch vectors as if they were word embeddings. Run a standard transformer encoder over them. At the end, classify (or whatever your downstream task is).

No convolutions. No 2D-specific operations. Just patches and attention.

§ 01 · PATCHES AS TOKENSFrom image to sequence

Concretely: a 224×224 RGB image becomes a 14×14 grid of 16×16-pixel patches — 196 patches total. Each patch is 16×16×3 = 768 numbers, flattened into a vector. Run a learned linear projection on each patch vector to produce the “token embedding.” Prepend a special [CLS] token (à la BERT) whose final embedding is used for classification.

Add a learnable position embedding per patch position — without it, the transformer wouldn’t know which patch came from where in the image. The transformer is permutation-invariant; position information has to be added explicitly.

Lab · 14×14 patchesHover a patch · see where its attention goes

Hovered patch (orange) attends most strongly to its neighbors, plus a synthetic far-field spike — a stylized version of what real ViT heads do. Some learn local features (like a conv would); others learn long-range associations (like text-attention heads).

§ 02 · NO INDUCTIVE BIAS, LOTS OF DATAThe data efficiency trade-off

Convolutional networks bake in two inductive biases:

These biases are correct for natural images and let ConvNets learn efficiently from modest data. ViT has neither. Every pair of patches is connected at every layer; the model must learn the locality structure (or not) from data.

The consequence: ViTs are data-hungry. On ImageNet alone (1.3M images), ViTs underperform comparable ConvNets. Pretrain on JFT- 300M (300M images) or larger, and ViTs match — then surpass — ConvNets. The crossover is around 100M images.

§ 03 · WHAT ATTENTION SEES IN AN IMAGEInterpreting trained heads

When you probe a trained ViT, you find different heads doing different jobs:

Input patchesAttention rolloutObject mask emergesSelf-supervised ViTs learn to attend to objects — without ever seeing a segmentation label.
Fig 1DINO attention maps. Self-supervised ViTs develop segmentation-like attention as an emergent property of pretraining on raw images.

§ 04 · WHERE VIT NOW SITSThe vision foundation

ViTs (and direct descendants) are the dominant architecture for modern vision foundation models:

ConvNets are still useful — they remain competitive at modest scales and are often easier to deploy. But the dominant trajectory of vision ML since 2022 runs through ViT.

CHECKYou have 5,000 labeled images of factory parts (10 classes). Limited training data, but accurate classification matters. Best starting point?

§ 05 · TAKING THIS FORWARDWhere vision is heading

Three threads worth following:

§ · GOING DEEPERWhy ViTs took over and what came after

The Vision Transformer paper (Dosovitskiy et al. 2020) made one surprising claim: with enough data, treating an image as a sequence of patch embeddings and running it through a standard transformer beats CNN inductive biases. The original result required pretraining on a private 300M-image dataset (JFT-300M); on ImageNet-1k alone, ViTs underperformed ResNets. Scale was load-bearing.

Three follow-ups widened the lead. Swin (Liu et al. 2021) introduced shifted windowed attention — locality back, but learned. MAE (He et al. 2021) made self-supervised pretraining on images work by masking 75% of patches and reconstructing them. DINO (Caron et al. 2021) and DINOv2 (Oquab et al. 2023) showed that self-supervised ViT features cluster by semantic object — emergent object discovery without labels. Today, ViT-style architectures are the foundation under CLIP, SAM, Stable Diffusion’s text encoder, and every multi-modal LLM.

§ · FURTHER READINGReferences & deeper sources

  1. Dosovitskiy et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT) · ICLR
  2. Liu et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows · ICCV
  3. Caron et al. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO) · ICCV
  4. He et al. (2021). Masked Autoencoders Are Scalable Vision Learners (MAE) · CVPR
  5. Oquab et al. (2023). DINOv2: Learning Robust Visual Features without Supervision · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.