Zero-Shot Classification

One-Line Summary: Zero-shot classification recognizes visual categories never seen during training by using natural language descriptions as class prototypes in a shared vision-language embedding space.

Prerequisites: CLIP, embedding spaces, cosine similarity, transfer learning, softmax classification

What Is Zero-Shot Classification?

Consider how you can recognize a pangolin even if you have never seen one in person -- someone describes it as "an armored mammal covered in overlapping scales that rolls into a ball," and you match that description against what you see. Zero-shot classification works on the same principle: instead of learning from labeled images of each category, the model uses text descriptions as stand-ins for visual prototypes.

Formally, zero-shot classification is the task of assigning an image $x$ to one of $C$ categories ${y_{1}, \dots, y_{C}}$ when none of these categories appeared in the training set. The model has learned a mapping from images and text into a shared embedding space during pretraining (typically on image-caption pairs), and at inference time it computes:

$\overset{y}{^} = ar g max_{c \in {1, \dots, C}} \frac{e x p ( sim ( v , t _{c} ) / τ )}{\sum _{j = 1}^{C} e x p ( sim ( v , t _{j} ) / τ )}$

where $v$ is the image embedding, $t_{c}$ is the text embedding for class $c$ , and $τ$ is a temperature parameter.

How It Works

The Text-as-Classifier Paradigm

Traditional classifiers learn a weight vector $w_{c}$ for each class from labeled data. In zero-shot classification, the text encoder generates $t_{c}$ from a natural language description of class $c$ , effectively replacing $w_{c}$ with a semantically meaningful prototype. This swap is possible because the shared embedding space preserves cross-modal semantic structure.

Prompt Engineering

The text input for each class significantly affects accuracy. Common strategies include:

Simple templates:

"a photo of a {class}"
"a {class}"
"this is a {class}"

Context-enriched templates:

"a photo of a {class}, a type of food"
"a centered satellite photo of {class}"
"a black and white photo of a {class}"

Prompt ensembling: CLIP uses 80 handcrafted templates per dataset and averages the resulting text embeddings:

$t_{c} = \frac{1}{M} \sum_{m = 1}^{M} f_{text} (template_{m} (class_{c}))$

This ensembling improved ImageNet zero-shot accuracy from 68.3% to 76.2% for CLIP ViT-L/14@336px.

Generalized Zero-Shot Learning (GZSL)

In the generalized setting, the test set contains both seen and unseen categories. This is harder because models are biased toward seen classes. Calibrated stacking addresses this by subtracting a bias term from seen-class scores:

$\overset{y}{^} = ar g max_{c} sim (v, t_{c}) - γ \cdot 1 [c \in S]$

where $S$ is the set of seen classes and $γ$ is a calibration constant.

Hierarchy of Zero-Shot Approaches

Attribute-based (classic): Map images and classes to a shared attribute space (e.g., "has stripes," "is furry"). Limited by predefined attributes.
Embedding-based (2013-2019): Use Word2Vec or GloVe to embed class names, learn a mapping from visual features. Limited by text embedding quality.
Vision-language pretraining (2021+): CLIP, ALIGN, SigLIP learn aligned spaces from web-scale data. Dominant paradigm.

SigLIP: Improving the Contrastive Objective

SigLIP (Google, 2023) replaces CLIP's softmax-based contrastive loss with a sigmoid loss applied to each image-text pair independently:

$L = - \frac{1}{N ^{2}} \sum_{i, j} lo g σ (z_{ij} (- 1)^{1 [i \neq = j]} (v_{i} \cdot t_{j} - b))$

This removes the need for all-to-all communication within a batch, enabling larger effective batch sizes. SigLIP ViT-B/16 achieves 78.2% zero-shot ImageNet top-1 accuracy, outperforming CLIP ViT-B/16 at 71.1%.

Why It Matters

Eliminates per-task annotation: Deploying a classifier for a new set of categories requires only writing text descriptions, not collecting and labeling thousands of images.
Scales to thousands of classes: Text prototypes can be generated for any number of categories at negligible cost, while labeled datasets become exponentially harder to build.
Enables rapid prototyping: Product teams can evaluate whether a visual classification task is feasible in minutes rather than weeks.
Handles evolving taxonomies: When new categories emerge (e.g., a new product line), the system adapts immediately through new text descriptions without retraining.

Key Technical Details

ImageNet zero-shot benchmarks (top-1 accuracy): CLIP ViT-L/14@336px: 76.2%; OpenCLIP ViT-G/14: 80.1%; SigLIP ViT-SO400M: 83.1%; EVA-CLIP ViT-18B: 83.8%
Domain sensitivity: Zero-shot accuracy drops sharply on specialized domains -- CLIP achieves 76.2% on ImageNet but only 58.8% on EuroSAT (satellite) and 43.3% on DTD (textures)
Few-shot hybrid: Adding even 1-4 labeled examples per class (few-shot) via linear probing or adapter tuning often boosts accuracy by 10-20 points over pure zero-shot
Compute at inference: Zero-shot classification requires only one forward pass per image plus precomputed text embeddings, making it faster than ensemble methods
Label granularity: Performance degrades with fine-grained classes; zero-shot distinguishing dog breeds (Stanford Dogs) is much harder than distinguishing broad categories (CIFAR-10)
Embedding normalization: Both image and text embeddings must be L2-normalized before similarity computation; skipping this degrades accuracy by 10+ points

Common Misconceptions

"Zero-shot means no training at all." The model is extensively pretrained on hundreds of millions of image-text pairs. "Zero-shot" refers specifically to the target classification task -- no examples from those categories were used during pretraining in a labeled sense.
"Zero-shot classification works equally well for all domains." Performance varies enormously by domain. CLIP's accuracy can range from 95%+ on simple benchmarks (CIFAR-10) to below 50% on specialized domains (fine-grained medical, satellite imagery).
"Text embeddings are a drop-in replacement for trained classifiers." On in-distribution data, a linear probe trained on even 16 labeled examples per class typically outperforms zero-shot classification by 5-15 percentage points. Zero-shot is powerful when labeled data is unavailable, not when it is plentiful.
"Any text description will work." The text must be phrased in a style similar to the pretraining captions. Technical jargon, long descriptions, and negations often degrade performance compared to simple noun-phrase templates.

Connections to Other Concepts

clip.md: The dominant model for zero-shot classification; provides the shared embedding space and contrastive training framework.
open-vocabulary-detection.md: Extends zero-shot classification from whole images to localized regions within images.
image-captioning.md: The inverse task -- generating text from images rather than matching images to text categories.
vision-foundation-models.md: Zero-shot capability is a key evaluation metric for foundation models like CLIP, SigLIP, and EVA-CLIP.
transfer-learning.md: Zero-shot classification is an extreme form of transfer where no task-specific adaptation occurs.