Diffusion Models

One-Line Summary: Diffusion models generate images by learning to reverse a gradual noising process, iteratively denoising random Gaussian noise into coherent images, and have dethroned GANs as the dominant paradigm for image synthesis.

Prerequisites: Gaussian distributions, Markov chains, variational inference, U-Net architecture, score functions

What Is a Diffusion Model?

Imagine dropping ink into water: the ink gradually spreads until the water is uniformly colored. Now imagine filming this process and playing it in reverse -- the uniform color coalesces back into a sharp ink drop. Diffusion models work exactly this way. The forward process progressively adds Gaussian noise to an image until it becomes pure noise. The reverse process -- a learned neural network -- takes pure noise and iteratively removes noise to produce a clean image.

Unlike GANs, which require an adversarial game, diffusion models use a simple denoising objective that is stable to train and covers the full data distribution without mode collapse.

How It Works

Forward Diffusion Process

Given a data point $x_{0}$ , the forward process produces a sequence $x_{1}, x_{2}, \dots, x_{T}$ by adding noise at each step:

$q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)$

where $β_{t} \in (0, 1)$ is a noise schedule. A key property allows sampling $x_{t}$ directly from $x_{0}$ :

$q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)$

where $α_{t} = 1 - β_{t}$ and $\overset{α}{ˉ}_{t} = \prod_{s = 1}^{t} α_{s}$ . At $T = 1000$ with standard schedules, $\overset{α}{ˉ}_{T} \approx 0$ , so $x_{T} \approx N (0, I)$ .

Reverse Process (DDPM)

Ho et al. (2020) parameterize the reverse process as:

$p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))$

The network predicts the noise $ϵ_{θ} (x_{t}, t)$ added at step $t$ , and the mean is computed as:

$μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t))$

The training objective simplifies to:

$L_{simple} = E_{t, x_{0}, ϵ} [∥ ϵ - ϵ_{θ} (\overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ, t) ∥^{2}]$

This is remarkably simple: sample a timestep $t$ , add noise to a training image, and train the network to predict that noise.

Score Matching Perspective

Song and Ermon (2019) showed an equivalent formulation via score matching. The score function $\nabla_{x} lo g p (x)$ points toward high-density regions. A score network $s_{θ} (x, t) \approx \nabla_{x_{t}} lo g q (x_{t})$ is trained with denoising score matching:

$L_{DSM} = E_{t, x_{0}, x_{t}} [∥ s_{θ} (x_{t}, t) - \nabla_{x_{t}} lo g q (x_{t} ∣ x_{0}) ∥^{2}]$

The noise prediction and score matching perspectives are equivalent: $s_{θ} (x_{t}, t) = - ϵ_{θ} (x_{t}, t) / 1 - \overset{α}{ˉ}_{t}$ .

Noise Schedules

Linear (DDPM): $β_{t}$ linearly increases from $β_{1} = 1 0^{- 4}$ to $β_{T} = 0.02$ over $T = 1000$ steps.
Cosine (Nichol and Dhariwal, 2021): $\overset{α}{ˉ}_{t} = cos^{2} (\frac{t / T + s}{1 + s} \cdot \frac{π}{2})$ with $s = 0.008$ . Produces better results at lower resolutions by avoiding spending too many steps at near-zero noise.

Accelerated Sampling

The original DDPM requires 1000 denoising steps. Faster alternatives:

DDIM (Song et al., 2020): Deterministic sampling with a non-Markovian process. Reduces to 50--100 steps with minimal quality loss.
DPM-Solver (Lu et al., 2022): ODE solvers achieving high quality in 10--20 steps.
Consistency Models (Song et al., 2023): Map any noisy input directly to the clean image in a single step, trained via consistency distillation.

Classifier-Free Guidance

Ho and Salimans (2022) introduced a technique to trade diversity for quality without a separate classifier. During training, the condition $c$ (e.g., class label or text) is randomly dropped with probability $p = 0.1$ , training both conditional and unconditional models simultaneously. At inference:

$\overset{ϵ}{^}_{θ} (x_{t}, t, c) = ϵ_{θ} (x_{t}, t, \emptyset) + w \cdot (ϵ_{θ} (x_{t}, t, c) - ϵ_{θ} (x_{t}, t, \emptyset))$

where $w > 1$ is the guidance scale (typically 3--15). Higher $w$ produces outputs more aligned with the condition at the cost of diversity.

Why It Matters

State-of-the-art image quality: Dhariwal and Nichol (2021) demonstrated that diffusion models beat GANs on ImageNet with FID 2.97 (256x256), ending GAN dominance.
Training stability: No adversarial dynamics, no mode collapse. Standard MSE loss with a U-Net works reliably.
Distribution coverage: Diffusion models achieve high recall (diversity) alongside high precision (quality), unlike GANs which often sacrifice one for the other.
Foundation for commercial tools: DALL-E 2, Stable Diffusion, Midjourney, and Imagen all build on diffusion.
Beyond images: Diffusion has been applied to video (Sora), audio (AudioLDM), 3D shapes (Point-E, DreamFusion), protein structures (RFdiffusion), and molecular design.

Key Technical Details

DDPM architecture: U-Net with ResNet blocks, self-attention at 16x16 resolution, sinusoidal timestep embeddings. ~114M parameters for 256x256.
Training: Adam optimizer, learning rate 2e-4, batch size 256, EMA decay 0.9999. ~500K iterations on ImageNet 256x256 (~3 days on 8 A100s).
Sampling cost: DDPM needs 1000 forward passes (~20 seconds per image on a V100). DDIM reduces this to 50 steps (~1 second). DPM-Solver++ achieves good quality in 15--20 steps.
FID scores on ImageNet 256x256: DDPM = 11.09, ADM (guided) = 2.97, DiT-XL/2 = 2.27.
Memory: Training a 256x256 model requires ~32 GB per GPU with batch size 8. Sampling requires ~6 GB.

Common Misconceptions

"Diffusion models are just denoising autoencoders." While the training loss resembles denoising, diffusion models are a principled probabilistic framework with a specific forward process, variational bound, and iterative sampling -- not a single-shot denoiser.
"More diffusion steps always means better quality." Beyond ~1000 steps for DDPM (or ~50 for DDIM), additional steps provide diminishing returns. The noise schedule matters more than the number of steps.
"Diffusion models are too slow for practical use." With DDIM, DPM-Solver, and distillation techniques, generation takes 1--5 seconds per image, which is practical for most applications.
"Diffusion models replaced GANs entirely." GANs remain competitive for real-time applications requiring single-step generation and for tasks where controllable latent spaces are essential (e.g., face editing).

Connections to Other Concepts

latent-diffusion-and-stable-diffusion.md: Apply diffusion in a compressed latent space for efficiency, enabling high-resolution text-to-image generation.
autoencoders-and-vaes.md: VAEs share the variational framework; diffusion can be viewed as a hierarchical VAE with a fixed encoder.
generative-adversarial-networks.md: Diffusion models surpassed GANs in image quality benchmarks, motivating hybrid approaches.
image-inpainting.md: Diffusion naturally supports inpainting by conditioning the denoising process on known pixels.
image-super-resolution.md: Diffusion-based super-resolution (SR3) produces higher-quality results than GAN-based methods.