Architectures · Module 22·8 min read

U-Net

A medical-imaging paper from 2015 introduced an architecture that turned out to be useful for almost any dense-prediction task — and ended up at the heart of every modern image generator.

The five-bullet version

  • Segmentation is per-pixel classification — every pixel needs a label, not just the image as a whole.
  • U-Net is a symmetric encoder–decoder: contracting path that captures context, expanding path that recovers resolution.
  • Skip connections carry fine spatial detail from the encoder directly to the matching decoder layer.
  • Originally for medical imaging; now the standard backbone for diffusion models, image-to-image, and inpainting.
  • The U shape is the architecture; everything else (number of layers, normalization, attention bolts-on) is a variation.

§ 00 · SEGMENTATION IS DENSE PREDICTIONEvery pixel needs a label

Classification asks: what’s in this image? One label for the whole picture. Segmentation asks the much harder version: what’s in every pixel? A per-pixel class label, often with sharp boundaries between regions.

For a 224×224 image, that’s 50,176 classification decisions, all of which need to be spatially consistent with their neighbors. Standard image classifiers — designed to spit out one vector at the end — aren’t the right shape.

§ 01 · THE U: DOWN, THEN BACK UPEncoder–decoder with conv layers

U-NetU-Net. A 2015 architecture for biomedical image segmentation. Symmetric encoder–decoder shape: a contracting path that downsamples the image while extracting features, then an expanding path that upsamples back to the original resolution, with skip connections joining the two paths at each resolution level. is shaped like a U. The left side (encoder, contracting path) progressively downsamples the image while building up feature channels: bigger receptive field, lower resolution. The right side (decoder, expanding path) mirrors this — progressively upsamples, reducing channels, until you’re back at the original resolution.

The bottom of the U is the bottleneck: lowest spatial resolution, highest channel count. That’s where the most abstract features live — the ones that capture image-wide context.

Lab · U-Net topologyEncoder → bottleneck → decoder · with skip connections
256
128
64
32
16
32
64
128
256

Hover an encoder block (green) — the matching decoder block lights up. Each encoder layer’s output is concatenated to the matching decoder layer’s input. That’s the “skip connection.”

§ 02 · SKIP CONNECTIONSThe trick that makes the U work

If you just downsample and then upsample with convolutions, you lose spatial detail. The decoder has the global context but not the fine-grained where-exactly information. Boundaries come out blurry.

The fix: skip connectionsskip connection. A direct connection from an early layer to a later layer, bypassing the intermediate layers. In U-Net, each encoder layer's output is concatenated to the matching decoder layer's input — carrying high-resolution detail from the encoder to the decoder.. Each encoder layer’s output is concatenated to the matching decoder layer’s input. So when the decoder is reconstructing high-resolution features, it has direct access to the spatially detailed feature maps from the encoder, plus the abstract features from the bottleneck.

The result: precise boundaries plus global context. The same model knows that this region is a kidney andexactly where the kidney’s edge is.

§ 03 · WHY THIS ARCHITECTURE STUCKBeyond medical imaging

The original U-Net paper was about segmenting biomedical microscopy. Within a few years, the same architecture (with adjustments) was winning in:

Three properties made the architecture dominant for dense prediction:

§ 04 · BEYOND SEGMENTATION: DIFFUSIONHow U-Net ended up in every image generator

Around 2020, diffusion models started using U-Net as the noise predictor. The shape of a diffusion model’s job (image-in, same-shape-image-out — predicting the noise to subtract) is identical to the shape of U-Net (image-in, same-shape-mask-out). The architecture transferred directly.

Stable Diffusion, DALL·E 2, Imagen, and most open-source image generators have a U-Net at their core. The conv blocks have grown attention layers; the bottleneck has cross-attention to a text embedding (so the model can condition on prompts); the resolution levels have been tuned for latent space rather than pixel space. But the U is there, doing what it always did.

down 1down 2down 3down 4bottlenecktext embedding (CLIP)cross-attentionup 4up 3up 2up 1Same 2015 U-shape · plus self-attention in blocks · plus cross-attention to text · plus skip connections.
Fig 1The modern diffusion U-Net. The architecture is recognizable across a decade — the conv blocks gained attention, but the U shape and skip connections are unchanged from 2015.
CHECKWhy are skip connections essential to U-Net, but you could in principle drop them?

§ 05 · TAKING THIS FORWARDVariants worth knowing

§ · GOING DEEPERFrom biomedical segmentation to diffusion's noise predictor

Ronneberger et al.’s 2015 paper introduced U-Net for biomedical image segmentation. The contracting path captures context; the expanding path enables precise localization. The skip connections from contracting to expanding paths preserve fine spatial detail that would otherwise be lost through repeated pooling. The architecture trains well from scratch on small datasets — a critical property for medical imaging where labeled data is scarce.

The same architecture has carried over to image generation. DDPM’s noise prediction network is a U-Net (Ho et al. 2020). Stable Diffusion (Rombach et al. 2022) keeps a U-Net in latent space, with cross-attention layers added to condition on text embeddings. SDXL and later models scale the U-Net up and back down. The U shape with skip connections is one of the most durable architectural patterns of the last decade.

§ · FURTHER READINGReferences & deeper sources

  1. Ronneberger, Fischer, Brox (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation · MICCAI
  2. Isola, Zhu, Zhou, Efros (2016). Image-to-Image Translation with Conditional Adversarial Networks (Pix2Pix) · CVPR
  3. Ho, Jain, Abbeel (2020). Denoising Diffusion Probabilistic Models (uses U-Net for noise prediction) · NeurIPS
  4. Çiçek et al. (2016). 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation · MICCAI
  5. Rombach et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models · CVPR

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.