One-Line Summary: Feature Pyramid Networks (FPN) build a multi-scale feature hierarchy by combining top-down semantically strong features with bottom-up spatially precise features through lateral connections, enabling robust detection of objects at all sizes.
Prerequisites: Convolutional neural networks, residual networks, fast and faster R-CNN, multi-scale detection
What Is Feature Pyramid Network?
Imagine an editor reviewing a satellite image. Looking at the full zoomed-out view, they can identify cities and highways (high-level semantics), but houses and cars are invisible. Zooming into a neighborhood reveals individual structures (spatial detail), but they lose the big-picture context. FPN is like giving the editor a set of annotated overlays: each zoom level retains both the fine detail native to that scale and the contextual understanding from the broader view.
Technically, a Feature Pyramid Network (Lin et al., 2017) augments a standard CNN backbone with a top-down pathway and lateral connections. The bottom-up pathway is the backbone's forward pass, producing feature maps at progressively lower resolutions. The top-down pathway upsamples coarse, semantically rich features and merges them with corresponding bottom-up maps via convolutions, producing a pyramid of feature maps that are all semantically strong.
How It Works
Bottom-Up Pathway
A standard backbone (e.g., ResNet) naturally produces a feature pyramid through its stages. For ResNet, we use the output of each residual block group:
- : stride 4, spatial resolution
- : stride 8, spatial resolution
- : stride 16, spatial resolution
- : stride 32, spatial resolution
Top-Down Pathway and Lateral Connections
Starting from the coarsest level:
- -- reduce to channels (typically ).
- -- element-wise addition.
Each merged map is followed by a convolution to reduce aliasing from upsampling:
An additional level is often added via stride-2 convolution on for detecting very large objects.
Scale Assignment
Objects are assigned to pyramid levels based on their area:
where is the canonical level for an object of area . Small objects go to finer levels (), large objects to coarser levels ().
Integration with Detectors
FPN serves as a drop-in feature extractor. Each pyramid level independently feeds:
- Faster R-CNN: RPN anchors and RoI heads operate on the assigned level.
- RetinaNet: Dense classification and regression heads are applied at every level.
- Mask R-CNN: RoI Align extracts features from the appropriate pyramid level.
C2 --[1x1]--> + --[3x3]--> P2 (stride 4)
^
C3 --[1x1]--> + --[3x3]--> P3 (stride 8)
^
C4 --[1x1]--> + --[3x3]--> P4 (stride 16)
^
C5 --[1x1]--> + --[3x3]--> P5 (stride 32)Why It Matters
- FPN improved Faster R-CNN by ~8% AP on COCO (from 33.9% to 36.2% AP with ResNet-50) with negligible extra computation.
- Small object detection improved dramatically -- AP for small objects on COCO roughly doubled compared to single-scale baselines.
- FPN became ubiquitous: virtually every modern detector (Mask R-CNN, RetinaNet, FCOS, DETR variants) uses FPN or a descendant.
- It replaced image pyramids for multi-scale detection, avoiding the 3-4x cost of running the backbone at multiple resolutions.
Key Technical Details
- Channel dimension: All pyramid levels use channels, keeping memory and computation uniform.
- Computation overhead: FPN adds ~1-2ms to inference (negligible compared to backbone forward pass).
- COCO results (ResNet-101-FPN + Faster R-CNN): 36.2% AP, 59.1% AP50, 39.0% AP75.
- Small object AP: 18.2% AP_S with FPN vs. ~10% without, on COCO.
- Nearest-neighbor upsampling is used in the original paper; some variants use deconvolution or bilinear interpolation.
- FPN variants: PANet (2018) adds a bottom-up path on top of FPN; BiFPN (EfficientDet, 2020) uses weighted bidirectional fusion; NAS-FPN (2019) uses neural architecture search to find the fusion topology.
Common Misconceptions
- "FPN replaces the backbone." FPN is built on top of a backbone. It does not change the backbone architecture -- it adds lateral connections and a top-down path.
- "Higher-resolution feature maps are always better for small objects." Without FPN's semantic enrichment, high-resolution maps () contain mainly low-level features (edges, textures) that are insufficient for classification. FPN's top-down pathway adds the semantic context needed.
- "FPN adds significant computation." The lateral connections and top-down path use and convolutions with only 256 channels, adding less than 5% to the backbone's FLOPs.
Connections to Other Concepts
multi-scale-detection.md: FPN is the modern answer to the image pyramid approach, providing multi-scale features without multi-scale input.fast-and-faster-rcnn.md: FPN is most commonly paired with Faster R-CNN's two-stage design.ssd.md: Also performs multi-scale detection but uses different feature maps directly from the backbone without top-down enrichment.focal-loss.md: RetinaNet pairs FPN with focal loss for a strong single-stage detector.yolo.md: YOLOv3 and later adopted FPN-like feature fusion for multi-scale prediction.
Further Reading
- Lin et al., "Feature Pyramid Networks for Object Detection" (2017) -- The original FPN paper.
- Liu et al., "Path Aggregation Network for Instance Segmentation" (2018) -- PANet, adding a bottom-up augmentation path to FPN.
- Tan et al., "EfficientDet: Scalable and Efficient Object Detection" (2020) -- BiFPN with weighted bidirectional feature fusion.
- Ghiasi et al., "NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection" (2019) -- Using NAS to discover optimal FPN topology.