Diffusion Models, Denoised
How do you teach a network to draw something it has never seen? You teach it to undo a mess. A surprisingly elegant idea — and the engine behind every image generator on the internet.
The five-bullet version
- A diffusion model is trained to undo gradual noise added to images.
- Training: take a clean image, add noise, ask the model to predict the noise. Repeat.
- Inference: start with pure noise, predict-and-subtract dozens of times until something appears.
- The model never sees the clean image at inference — only its own intermediate guesses.
- Modern variants (latent diffusion, flow matching) keep the recipe and change the geometry.
§ 00 · THE BIG IDEAGeneration as un-corruption
Imagine you have a clean photograph. You add a tiny bit of static — barely noticeable. Then you add more. And more. After a thousand small steps, the photograph is unrecognizable, indistinguishable from television noise. Now imagine the reverse: starting from pure static, can you remove just enough noise, just slightly, to recover the photograph?
This is a diffusion modeldiffusion model. A class of generative models that learn to reverse a gradual noising process. They sample by starting from pure noise and iteratively denoising.. You don’t teach it to draw a cat. You teach it to look at a slightly-noisy cat and predict the noise. Then at inference, you start with pure noise, ask the model to predict the noise inside it, subtract a small amount of that prediction, and repeat. After a few dozen steps, what’s left is something that wasn’t there before.1
Reverse Diffusion, Step by Step
Drag the slider from right to left — from pure Gaussian noise to a clean image. This is roughly what a diffusion model does, but where you scrub a slider, the model predicts the noise to remove at each step.
What you scrubbed by hand is the forward process — the corruption. The model’s job is the reverse. Let’s separate the two.
§ 01 · FORWARD PROCESSThe corruption schedule
The forward process is fixed and not learned. It’s a recipe: at each timestep t, add a small amount of Gaussian noise to the image. The amount is set by a schedule — typically the noise grows slowly at first, then faster.
A useful trick: because Gaussian noise added to Gaussian noise is still Gaussian, you don’t have to actually run all t steps to find out what the image looks like at step t. There’s a closed-form expression. You can jump directly to any noise level. This makes training trivial: pick a random t, jump there, ask the model to predict the noise.
§ 02 · REVERSE PROCESSThe part the network does
Reversing is the hard part. Given a noisy image at step t, what was the slightly-less-noisy image at step t−1? In general this is impossible — many clean images could have produced the same noise. But you can ask the easier question: given this noisy image, what was the noise?
That question is well-posed and a neural network can be trained to answer it. The architecture is usually a U-NetU-Net. A convolutional architecture with skip connections between an encoder (downsampling) and decoder (upsampling) path. Originally for medical image segmentation; now the workhorse of image diffusion., although Diffusion TransformersDiT. Diffusion Transformer. A transformer architecture for diffusion models, scaling more cleanly than U-Nets at large parameter counts. have become standard at the frontier.2
Predicting the noise is equivalent to predicting the score — the gradient of the log-density of the data.— Yang Song, Score-Based Generative Modeling, 2020
§ 03 · WHAT THE NETWORK LEARNSA field of arrows
Here’s a way to think about it. The space of all possible images is unimaginably vast. Most of it is noise. A vanishingly small region holds plausible photos. The diffusion model has learned, for any point in this enormous space, an arrow pointing toward the plausible region. To generate, you start anywhere and follow the arrows.
That arrow field is what “the model knows.” It’s why diffusion models can be guided — you can nudge the arrows during inference using a text prompt, a sketch, a pose, a depth map. The arrows themselves don’t change; you just bias which arrow gets followed at each step.
Both the arrow-following at sample time and the gradient descent that createdthose arrows during training are the same operation in different costumes. To make this concrete, here’s a 2D toy of the optimization that built the model in the first place. Drop a marker anywhere; the deeper valleys are the configurations the network considers more plausible. Watch which one it falls into.
§ 04 · WHERE THIS IS GOINGFaster, sharper, conditional
The 1000-step DDPM of 2020 has been replaced, in production, by 4-step or even 1-step samplers (consistency models, flow matching, rectified flow). The models also got bigger and learned to read text prompts via cross-attention. Video diffusion is the current frontier — a video is just an image with a time axis, and the same machinery works, expensively.
References & further reading
- Ho, J., Jain, A. & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. arXiv:2006.11239.
- Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers (DiT). arXiv:2212.09748.
- Song, Y. et al. (2021). Score-Based Generative Modeling Through Stochastic Differential Equations. arXiv:2011.13456.
- Sohl-Dickstein, J. et al. (2015). Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. arXiv:1503.03585.
- Rombach, R. et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752.
- Karras, T. et al. (2022). Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364.
- Weng, L. (2021). What are Diffusion Models? Lil’Log.
§ · GOING DEEPERThree threads worth following
The mathematical scaffolding underneath diffusion is more general than “add noise, predict noise.” The same objective can be derived from a variational lower bound (DDPM’s original framing), from a denoising score-matching objective (Song & Ermon), and from a stochastic differential equation (Song et al. 2020). Once you see the SDE formulation, it’s clear why the field has so many samplers — Euler, DDIM, Heun, DPM-Solver — they’re different numerical solvers for the same reverse-time ODE.
The other big lever for real-world image quality is classifier-free guidance: at inference, run the model twice per step — once with a text condition, once without — and push the result in the conditioned direction. It costs you 2× compute per step but dramatically improves fidelity and prompt adherence. And the move from pixel-space to latent-space diffusion (Stable Diffusion) was the engineering decision that made text-to-image affordable on consumer GPUs. None of these are conceptual breaks from the basic recipe; they’re what made it practical.
§ · FURTHER READINGReferences & deeper sources
- (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics · ICML
- (2020). Denoising Diffusion Probabilistic Models (DDPM) · NeurIPS
- (2020). Score-Based Generative Modeling through Stochastic Differential Equations · ICLR
- (2021). Classifier-Free Diffusion Guidance · NeurIPS Workshop
- (2022). High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion) · CVPR
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.