Feature Pyramid Network

One-Line Summary: Feature Pyramid Networks (FPN) build a multi-scale feature hierarchy by combining top-down semantically strong features with bottom-up spatially precise features through lateral connections, enabling robust detection of objects at all sizes.

Prerequisites: Convolutional neural networks, residual networks, fast and faster R-CNN, multi-scale detection

What Is Feature Pyramid Network?

Imagine an editor reviewing a satellite image. Looking at the full zoomed-out view, they can identify cities and highways (high-level semantics), but houses and cars are invisible. Zooming into a neighborhood reveals individual structures (spatial detail), but they lose the big-picture context. FPN is like giving the editor a set of annotated overlays: each zoom level retains both the fine detail native to that scale and the contextual understanding from the broader view.

Technically, a Feature Pyramid Network (Lin et al., 2017) augments a standard CNN backbone with a top-down pathway and lateral connections. The bottom-up pathway is the backbone's forward pass, producing feature maps at progressively lower resolutions. The top-down pathway upsamples coarse, semantically rich features and merges them with corresponding bottom-up maps via $1 \times 1$ convolutions, producing a pyramid of feature maps that are all semantically strong.

How It Works

Bottom-Up Pathway

A standard backbone (e.g., ResNet) naturally produces a feature pyramid through its stages. For ResNet, we use the output of each residual block group:

$C_{2}$ : stride 4, $1/4$ spatial resolution
$C_{3}$ : stride 8, $1/8$ spatial resolution
$C_{4}$ : stride 16, $1/16$ spatial resolution
$C_{5}$ : stride 32, $1/32$ spatial resolution

Top-Down Pathway and Lateral Connections

Starting from the coarsest level:

$P_{5} = Conv_{1 \times 1} (C_{5})$ -- reduce $C_{5}$ to $d$ channels (typically $d = 256$ ).
$P_{4} = Conv_{1 \times 1} (C_{4}) + Upsample_{2 \times} (P_{5})$ -- element-wise addition.
$P_{3} = Conv_{1 \times 1} (C_{3}) + Upsample_{2 \times} (P_{4})$
$P_{2} = Conv_{1 \times 1} (C_{2}) + Upsample_{2 \times} (P_{3})$

Each merged map is followed by a $3 \times 3$ convolution to reduce aliasing from upsampling:

$P_{l} = Conv_{3 \times 3} (Conv_{1 \times 1} (C_{l}) + Upsample_{2 \times} (P_{l + 1}))$

An additional $P_{6}$ level is often added via stride-2 convolution on $P_{5}$ for detecting very large objects.

Scale Assignment

Objects are assigned to pyramid levels based on their area:

$l = ⌊ l_{0} + lo g_{2} (w h /224)⌋$

where $l_{0} = 4$ is the canonical level for an object of area $22 4^{2}$ . Small objects go to finer levels ( $P_{2}, P_{3}$ ), large objects to coarser levels ( $P_{4}, P_{5}$ ).

Integration with Detectors

FPN serves as a drop-in feature extractor. Each pyramid level independently feeds:

Faster R-CNN: RPN anchors and RoI heads operate on the assigned level.
RetinaNet: Dense classification and regression heads are applied at every level.
Mask R-CNN: RoI Align extracts features from the appropriate pyramid level.

C2 --[1x1]--> + --[3x3]--> P2 (stride 4)
               ^
C3 --[1x1]--> + --[3x3]--> P3 (stride 8)
               ^
C4 --[1x1]--> + --[3x3]--> P4 (stride 16)
               ^
C5 --[1x1]--> + --[3x3]--> P5 (stride 32)

Why It Matters

FPN improved Faster R-CNN by ~8% AP on COCO (from 33.9% to 36.2% AP with ResNet-50) with negligible extra computation.
Small object detection improved dramatically -- AP for small objects on COCO roughly doubled compared to single-scale baselines.
FPN became ubiquitous: virtually every modern detector (Mask R-CNN, RetinaNet, FCOS, DETR variants) uses FPN or a descendant.
It replaced image pyramids for multi-scale detection, avoiding the 3-4x cost of running the backbone at multiple resolutions.

Key Technical Details

Channel dimension: All pyramid levels use $d = 256$ channels, keeping memory and computation uniform.
Computation overhead: FPN adds ~1-2ms to inference (negligible compared to backbone forward pass).
COCO results (ResNet-101-FPN + Faster R-CNN): 36.2% AP, 59.1% AP50, 39.0% AP75.
Small object AP: 18.2% AP_S with FPN vs. ~10% without, on COCO.
Nearest-neighbor upsampling is used in the original paper; some variants use deconvolution or bilinear interpolation.
FPN variants: PANet (2018) adds a bottom-up path on top of FPN; BiFPN (EfficientDet, 2020) uses weighted bidirectional fusion; NAS-FPN (2019) uses neural architecture search to find the fusion topology.

Common Misconceptions

"FPN replaces the backbone." FPN is built on top of a backbone. It does not change the backbone architecture -- it adds lateral connections and a top-down path.
"Higher-resolution feature maps are always better for small objects." Without FPN's semantic enrichment, high-resolution maps ( $C_{2}$ ) contain mainly low-level features (edges, textures) that are insufficient for classification. FPN's top-down pathway adds the semantic context needed.
"FPN adds significant computation." The lateral connections and top-down path use $1 \times 1$ and $3 \times 3$ convolutions with only 256 channels, adding less than 5% to the backbone's FLOPs.

Connections to Other Concepts

multi-scale-detection.md: FPN is the modern answer to the image pyramid approach, providing multi-scale features without multi-scale input.
fast-and-faster-rcnn.md: FPN is most commonly paired with Faster R-CNN's two-stage design.
ssd.md: Also performs multi-scale detection but uses different feature maps directly from the backbone without top-down enrichment.
focal-loss.md: RetinaNet pairs FPN with focal loss for a strong single-stage detector.
yolo.md: YOLOv3 and later adopted FPN-like feature fusion for multi-scale prediction.