One-Line Summary: Autoencoders learn compressed latent representations by encoding inputs and reconstructing them, while Variational Autoencoders add a probabilistic structure that enables principled generation of new data.
Prerequisites: Convolutional neural networks, backpropagation, probability distributions, KL divergence basics
What Is an Autoencoder?
Imagine compressing a photograph into a tiny summary -- a few numbers that capture the essence of the image -- then trying to reconstruct the full photograph from only that summary. An autoencoder does exactly this: an encoder network squeezes the input into a low-dimensional latent code , and a decoder network expands back into the original input space. The bottleneck forces the network to learn which features truly matter.
A standard (deterministic) autoencoder minimizes reconstruction loss:
where is the input and is the reconstruction.
A Variational Autoencoder (VAE), introduced by Kingma and Welling (2013), replaces the deterministic latent code with a probability distribution. The encoder outputs parameters and of a Gaussian, and a latent sample is drawn via the reparameterization trick: , where .
How It Works
Deterministic Autoencoders
The encoder maps input to a fixed-length vector . The decoder maps back to the input space. Training minimizes pixel-wise MSE or binary cross-entropy between input and output. Common variants include:
- Undercomplete autoencoders: Bottleneck dimension is smaller than input dimension, forcing compression.
- Sparse autoencoders: Add an penalty on activations to encourage sparse codes.
- Denoising autoencoders: Corrupt input with noise and train to recover the clean signal (Vincent et al., 2008).
Variational Autoencoders
VAEs optimize the Evidence Lower Bound (ELBO):
The first term is the reconstruction likelihood. The second term -- the KL divergence -- regularizes the encoder to produce latent distributions close to the prior .
For Gaussian encoder output , the KL term has a closed-form solution:
The Reparameterization Trick
Sampling is not differentiable. The trick rewrites with , pushing stochasticity outside the computational graph so gradients flow through and .
Beta-VAE and Disentanglement
Higgins et al. (2017) introduced -VAE, which scales the KL term by :
Higher encourages disentangled latent factors at the cost of reconstruction quality.
VQ-VAE
Van den Oord et al. (2017) proposed Vector Quantized VAE, replacing continuous latents with a discrete codebook of embedding vectors . The encoder output is quantized by mapping to the nearest codebook entry:
The training loss combines reconstruction, codebook alignment, and commitment terms:
where is the stop-gradient operator and is the commitment coefficient. Gradients for the encoder pass through the quantization via straight-through estimation. VQ-VAE achieves high-fidelity reconstruction and serves as a foundation for later models like DALL-E.
Why It Matters
- Dimensionality reduction: Autoencoders learn nonlinear compressions superior to PCA for complex data like images.
- Generative modeling: VAEs provide a principled framework to sample new images by drawing and decoding.
- Latent space arithmetic: VAE latent spaces support interpolation -- walking between two face images produces smooth transitions.
- Downstream tasks: Pretrained encoders serve as feature extractors for classification, anomaly detection, and retrieval.
- Foundation for modern architectures: VQ-VAE is a core component of latent diffusion models (Stable Diffusion) and discrete generative models.
Key Technical Details
- Typical latent dimensions: 64--512 for image autoencoders; VQ-VAE uses codebooks of 512--8192 vectors.
- VAEs trained on CelebA 64x64 typically achieve reconstruction FID around 40--60, significantly worse than GANs (~10) at the same resolution.
- The "posterior collapse" problem occurs when the decoder is too powerful and ignores ; mitigation strategies include KL annealing, free bits, and cyclical schedules.
- VQ-VAE-2 (Razavi et al., 2019) achieves FID 31 on 256x256 CelebA-HQ using a hierarchical discrete latent space.
- Training is stable compared to GANs -- standard Adam optimizer with learning rate 1e-4 works reliably.
Common Misconceptions
- "VAEs generate blurry images because the model is bad." The blurriness comes from optimizing pixel-wise reconstruction likelihood under a Gaussian assumption, which averages over modes. Using perceptual losses or adversarial training (VAE-GAN) substantially sharpens outputs.
- "The KL term is just a regularizer you can drop." Without the KL term, the latent space has no structure and sampling from produces garbage. The KL term is what makes a VAE generative.
- "Autoencoders are the same as PCA." A linear autoencoder with MSE loss recovers the PCA subspace, but nonlinear autoencoders learn much richer representations.
- "Larger latent dimensions are always better." Beyond a task-dependent optimum, increasing latent dimension wastes capacity on noise and can degrade generalization. For CelebA faces, latent dimensions of 128--256 typically suffice.
Connections to Other Concepts
diffusion-models.md: Use a pretrained VAE (or VQ-VAE) encoder/decoder to move diffusion into a compressed latent space.generative-adversarial-networks.md: VAE-GAN hybrids use adversarial losses to sharpen VAE reconstructions.image-super-resolution.md: Autoencoders provide the backbone architecture for many super-resolution networks.neural-style-transfer.md: Encoder features from autoencoders overlap conceptually with VGG features used in style transfer.
Further Reading
- Kingma and Welling, "Auto-Encoding Variational Bayes" (2013) -- The original VAE paper introducing the reparameterization trick and ELBO objective.
- Higgins et al., "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework" (2017) -- Disentangled representation learning via scaled KL.
- Van den Oord et al., "Neural Discrete Representation Learning" (2017) -- VQ-VAE with discrete codebooks.
- Razavi et al., "Generating Diverse High-Fidelity Images with VQ-VAE-2" (2019) -- Hierarchical VQ-VAE achieving near-GAN quality.
- Vincent et al., "Extracting and Composing Robust Features with Denoising Autoencoders" (2008) -- Denoising autoencoders and their connection to score matching.