Autoencoders and VAEs

One-Line Summary: Autoencoders learn compressed latent representations by encoding inputs and reconstructing them, while Variational Autoencoders add a probabilistic structure that enables principled generation of new data.

Prerequisites: Convolutional neural networks, backpropagation, probability distributions, KL divergence basics

What Is an Autoencoder?

Imagine compressing a photograph into a tiny summary -- a few numbers that capture the essence of the image -- then trying to reconstruct the full photograph from only that summary. An autoencoder does exactly this: an encoder network squeezes the input into a low-dimensional latent code $z$ , and a decoder network expands $z$ back into the original input space. The bottleneck forces the network to learn which features truly matter.

A standard (deterministic) autoencoder minimizes reconstruction loss:

$L_{AE} = ∥ x - \overset{x}{^} ∥^{2}$

where $x$ is the input and $\overset{x}{^} = Dec (Enc (x))$ is the reconstruction.

A Variational Autoencoder (VAE), introduced by Kingma and Welling (2013), replaces the deterministic latent code with a probability distribution. The encoder outputs parameters $μ$ and $σ$ of a Gaussian, and a latent sample is drawn via the reparameterization trick: $z = μ + σ ⊙ ϵ$ , where $ϵ \sim N (0, I)$ .

How It Works

Deterministic Autoencoders

The encoder $q_{ϕ} (z ∣ x)$ maps input $x$ to a fixed-length vector $z \in R^{d}$ . The decoder $p_{θ} (x ∣ z)$ maps $z$ back to the input space. Training minimizes pixel-wise MSE or binary cross-entropy between input and output. Common variants include:

Undercomplete autoencoders: Bottleneck dimension $d$ is smaller than input dimension, forcing compression.
Sparse autoencoders: Add an $L_{1}$ penalty on activations to encourage sparse codes.
Denoising autoencoders: Corrupt input with noise and train to recover the clean signal (Vincent et al., 2008).

Variational Autoencoders

VAEs optimize the Evidence Lower Bound (ELBO):

$L_{VAE} = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))$

The first term is the reconstruction likelihood. The second term -- the KL divergence -- regularizes the encoder to produce latent distributions close to the prior $p (z) = N (0, I)$ .

For Gaussian encoder output $q_{ϕ} (z ∣ x) = N (μ, diag (σ^{2}))$ , the KL term has a closed-form solution:

$D_{KL} = - \frac{1}{2} \sum_{j = 1}^{d} (1 + lo g σ_{j}^{2} - μ_{j}^{2} - σ_{j}^{2})$

The Reparameterization Trick

Sampling $z \sim q_{ϕ} (z ∣ x)$ is not differentiable. The trick rewrites $z = μ + σ ⊙ ϵ$ with $ϵ \sim N (0, I)$ , pushing stochasticity outside the computational graph so gradients flow through $μ$ and $σ$ .

Beta-VAE and Disentanglement

Higgins et al. (2017) introduced $β$ -VAE, which scales the KL term by $β > 1$ :

$L_{β -VAE} = E [lo g p_{θ} (x ∣ z)] - β \cdot D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))$

Higher $β$ encourages disentangled latent factors at the cost of reconstruction quality.

VQ-VAE

Van den Oord et al. (2017) proposed Vector Quantized VAE, replacing continuous latents with a discrete codebook of $K$ embedding vectors $e_{k} \in R^{d}$ . The encoder output $z_{e} (x)$ is quantized by mapping to the nearest codebook entry:

$z_{q} (x) = e_{k} where k = ar g min_{j} ∥ z_{e} (x) - e_{j} ∥_{2}$

The training loss combines reconstruction, codebook alignment, and commitment terms:

$L_{VQ} = ∥ x - \overset{x}{^} ∥^{2} + ∥ sg [z_{e} (x)] - e ∥^{2} + β ∥ z_{e} (x) - sg [e] ∥^{2}$

where $sg [\cdot]$ is the stop-gradient operator and $β = 0.25$ is the commitment coefficient. Gradients for the encoder pass through the quantization via straight-through estimation. VQ-VAE achieves high-fidelity reconstruction and serves as a foundation for later models like DALL-E.

Why It Matters

Dimensionality reduction: Autoencoders learn nonlinear compressions superior to PCA for complex data like images.
Generative modeling: VAEs provide a principled framework to sample new images by drawing $z \sim N (0, I)$ and decoding.
Latent space arithmetic: VAE latent spaces support interpolation -- walking between two face images produces smooth transitions.
Downstream tasks: Pretrained encoders serve as feature extractors for classification, anomaly detection, and retrieval.
Foundation for modern architectures: VQ-VAE is a core component of latent diffusion models (Stable Diffusion) and discrete generative models.

Key Technical Details

Typical latent dimensions: 64--512 for image autoencoders; VQ-VAE uses codebooks of 512--8192 vectors.
VAEs trained on CelebA 64x64 typically achieve reconstruction FID around 40--60, significantly worse than GANs (~10) at the same resolution.
The "posterior collapse" problem occurs when the decoder is too powerful and ignores $z$ ; mitigation strategies include KL annealing, free bits, and cyclical schedules.
VQ-VAE-2 (Razavi et al., 2019) achieves FID 31 on 256x256 CelebA-HQ using a hierarchical discrete latent space.
Training is stable compared to GANs -- standard Adam optimizer with learning rate 1e-4 works reliably.

Common Misconceptions

"VAEs generate blurry images because the model is bad." The blurriness comes from optimizing pixel-wise reconstruction likelihood under a Gaussian assumption, which averages over modes. Using perceptual losses or adversarial training (VAE-GAN) substantially sharpens outputs.
"The KL term is just a regularizer you can drop." Without the KL term, the latent space has no structure and sampling from $N (0, I)$ produces garbage. The KL term is what makes a VAE generative.
"Autoencoders are the same as PCA." A linear autoencoder with MSE loss recovers the PCA subspace, but nonlinear autoencoders learn much richer representations.
"Larger latent dimensions are always better." Beyond a task-dependent optimum, increasing latent dimension wastes capacity on noise and can degrade generalization. For CelebA faces, latent dimensions of 128--256 typically suffice.

Connections to Other Concepts

diffusion-models.md: Use a pretrained VAE (or VQ-VAE) encoder/decoder to move diffusion into a compressed latent space.
generative-adversarial-networks.md: VAE-GAN hybrids use adversarial losses to sharpen VAE reconstructions.
image-super-resolution.md: Autoencoders provide the backbone architecture for many super-resolution networks.
neural-style-transfer.md: Encoder features from autoencoders overlap conceptually with VGG features used in style transfer.