Generative Model Metrics

One-Line Summary: Generative model quality is measured by FID (distribution distance, lower is better), Inception Score (diversity and quality), CLIP Score (text-image alignment), LPIPS (perceptual similarity), and KID (unbiased small-sample alternative to FID).

Prerequisites: Generative Models, Convolutional Neural Networks, Feature Extraction, Probability and Statistics Basics

What Are Generative Model Metrics?

Judging whether a generated image is "good" is notoriously subjective. One person might praise the detail; another might notice a distorted hand. Generative model metrics attempt to replace subjective judgment with reproducible, quantitative scores. They answer questions like: Does the distribution of generated images match real images? Are generated images diverse? Do they align with text prompts?

Technically, generative model metrics compare statistics of generated samples against reference datasets or evaluate alignment with conditioning inputs. No single metric captures all aspects of generation quality (fidelity, diversity, novelty, alignment), so multiple metrics are reported together. The field struggles with the fact that these metrics are imperfect proxies for human preference -- a tension that drives ongoing research.

How It Works

Frechet Inception Distance (FID)

FID (Heusel et al., 2017) is the most widely used metric for unconditional and class-conditional image generation.

Procedure:

Extract 2048-dimensional features from the penultimate layer of InceptionV3 for both real and generated images.
Fit a multivariate Gaussian $N (μ_{r}, Σ_{r})$ to real features and $N (μ_{g}, Σ_{g})$ to generated features.
Compute the Frechet distance (Wasserstein-2 distance between Gaussians):

$F I D = ∥ μ_{r} - μ_{g} ∥^{2} + Tr (Σ_{r} + Σ_{g} - 2 (Σ_{r} Σ_{g})^{1/2})$

Interpretation: FID = 0 means identical distributions; lower is better. FID captures both fidelity (are generated images realistic?) and diversity (do they cover the real distribution?). A model that generates only one perfect image scores poorly because $Σ_{g}$ collapses.

Practical considerations:

Requires a minimum of ~10,000 generated samples for stable estimates; 50,000 is standard.
The reference set matters: FID computed against ImageNet validation vs. training gives different values.
FID is biased: it systematically overestimates the true distance with small sample sizes.

Typical values: StyleGAN3 on FFHQ-256: FID ~3--4. Latent Diffusion (Stable Diffusion) on LAION: FID ~5--10 on COCO-30K. Human-indistinguishable quality is generally considered FID < 10 on standard benchmarks.

Inception Score (IS)

IS (Salimans et al., 2016) uses the InceptionV3 classifier to evaluate generated images.

$I S = exp (E_{x} [D_{K L} (p (y ∣ x) ∥ p (y))])$

where $p (y ∣ x)$ is the InceptionV3 class prediction for a generated image $x$ , and $p (y) = E_{x} [p (y ∣ x)]$ is the marginal class distribution.

Interpretation: High IS means each image is confidently classified (sharp $p (y ∣ x)$ , indicating quality) AND the overall distribution covers many classes (uniform $p (y)$ , indicating diversity). Higher is better. Maximum IS on ImageNet is ~1,000 (one confident image per class).

Limitations:

Only measures quality/diversity within ImageNet's 1,000 classes. Ignores intra-class diversity.
Does not compare to real data -- a model generating only ImageNet-like images of non-existent classes could score well.
Sensitive to InceptionV3 artifacts; images adversarially optimized for InceptionV3 can achieve high IS without visual quality.

Typical values: BigGAN on ImageNet-128: IS ~171. Real ImageNet images: IS ~331.

CLIP Score

CLIP Score (Hessel et al., 2021) measures text-image alignment for text-to-image generation.

$CLIP Score = max (0, cos (E_{im g} (x), E_{t x t} (t)))$

where $E_{im g}$ and $E_{t x t}$ are CLIP's image and text encoders, and $t$ is the conditioning text prompt.

Interpretation: Higher CLIP Score means the generated image better matches the text description. This metric is essential for evaluating text-to-image models (DALL-E, Stable Diffusion, Midjourney) because FID does not measure prompt adherence.

Limitations: CLIP has its own biases -- it may score highly on images that match CLIP's learned associations rather than genuine semantic alignment. CLIP Score can conflict with FID: a model can optimize for prompt alignment at the expense of photorealism.

Learned Perceptual Image Patch Similarity (LPIPS)

LPIPS (Zhang et al., 2018) measures perceptual distance between two specific images (not distributions).

Procedure: Extract features from multiple layers of a pretrained network (AlexNet, VGG, or SqueezeNet), compute weighted L2 distance per layer, and average:

$L P I P S (x, x_{0}) = \sum_{l} \frac{1}{H _{l} W _{l}} \sum_{h, w} ∥ w_{l} ⊙ (\hat{f}_{h w}^{l} (x) - \hat{f}_{h w}^{l} (x_{0})) ∥^{2}$

where $\hat{f}^{l}$ are channel-normalized features at layer $l$ and $w_{l}$ are learned weights.

Interpretation: LPIPS = 0 means identical images; higher values indicate greater perceptual difference. Lower is better when evaluating reconstruction quality (e.g., image super-resolution, style transfer). LPIPS correlates with human perceptual judgments better than PSNR or SSIM (by ~2x in reported agreement rates).

Typical values: Two visually similar but pixel-shifted images: LPIPS ~0.1. Clearly different images: LPIPS ~0.5--0.7.

Kernel Inception Distance (KID)

KID (Binkowski et al., 2018) is an alternative to FID that uses the squared Maximum Mean Discrepancy (MMD) with a polynomial kernel:

$K I D = M M D^{2} (f_{r}, f_{g}) = E [k (f_{r}, f_{r}^{'})] + E [k (f_{g}, f_{g}^{'})] - 2 E [k (f_{r}, f_{g})]$

where $k$ is a polynomial kernel and $f_{r}, f_{g}$ are InceptionV3 features.

Advantages over FID:

Unbiased: KID has an unbiased estimator, unlike FID which is biased upward for small samples.
Works with fewer samples: Reliable estimates from ~1,000 images (vs. 10,000+ for FID).
No Gaussian assumption: FID assumes features are Gaussian-distributed; KID does not.

Typical values: KID is reported as $\times 1 0^{- 3}$ for readability. StyleGAN3 on FFHQ: KID ~1--2 $\times 1 0^{- 3}$ .

Human Evaluation

All automated metrics are proxies. Human evaluation remains the gold standard:

Side-by-side comparison: Show two generated images; ask which is more realistic or better matches the prompt.
Mean Opinion Score (MOS): Rate images on a 1--5 Likert scale.
Human evaluations are expensive (~$0.10--0.50 per judgment) and noisy (inter-rater agreement typically 70--80%).

Why It Matters

FID is the gatekeeper metric for publication: a new generative model must demonstrate lower FID than baselines to claim improvement.
CLIP Score has become essential for text-to-image models, as FID alone cannot distinguish a photorealistic but prompt-irrelevant image from a well-aligned one.
Metric limitations drive research: the recognition that FID uses outdated InceptionV3 features has spurred development of FD_DINOv2 and CMMD as potential replacements.
No single metric is sufficient -- responsible evaluation requires reporting multiple metrics plus human evaluation.

Key Technical Details

FID is computed using pytorch-fid or cleanfid libraries. Always specify the reference dataset, number of samples, and image resolution.
Standard FID benchmarks: CIFAR-10 (50K samples), FFHQ-256 (70K samples), COCO-30K (30K samples, text-conditioned).
IS is computed over 50,000 samples, split into 10 groups, reporting mean +/- std.
CLIP Score uses ViT-B/32 or ViT-L/14 CLIP encoders. ViT-L/14 is preferred for higher discrimination.
LPIPS with AlexNet backbone: ~5 ms per pair on a GPU. VGG backbone is slower but slightly more accurate.
FID is sensitive to image preprocessing: resizing method (bilinear vs. bicubic), JPEG compression, and center cropping all affect scores by 1--5 points.

Common Misconceptions

"Low FID means the model generates perfect images." FID measures distributional similarity, not individual image quality. A model can achieve low FID by generating diverse, slightly blurry images that match the overall statistics.
"Inception Score and FID measure the same thing." IS measures quality and diversity using class predictions. FID measures distributional distance using features. They can disagree: a model generating only one class of very realistic images scores high IS but high FID.
"LPIPS replaces PSNR and SSIM." LPIPS is better at capturing perceptual similarity, but PSNR and SSIM remain useful for measuring pixel-level and structural fidelity. They are complementary.

Connections to Other Concepts

Generative Models: FID and IS are the primary evaluation metrics for GANs, diffusion models, and VAEs.
multimodal-models.md: CLIP Score evaluates the text-image alignment that CLIP and similar models are trained to optimize.
feature-extraction-and-transformation.md: All metrics (FID, IS, KID, LPIPS) rely on features from pretrained networks as perceptual representations.
benchmark-leaderboards.md: Generative model leaderboards typically rank by FID, with CLIP Score and IS as secondary metrics.