Fast R-CNN and Faster R-CNN

One-Line Summary: Fast R-CNN shares convolutional computation across all proposals via RoI pooling and trains end-to-end, while Faster R-CNN replaces external proposals with a learned Region Proposal Network (RPN) to achieve near-real-time detection at ~5 FPS.

Prerequisites: R-CNN, convolutional neural networks, region proposals, bounding box regression, multi-task learning

What Is Fast / Faster R-CNN?

R-CNN is like sending a separate photographer to every suspicious area in a city. Fast R-CNN is like mounting one camera on a helicopter, taking a single panoramic photo, then cropping and zooming into each area of interest digitally -- same analysis quality, vastly less work. Faster R-CNN goes further: instead of relying on a tip line (Selective Search) to identify suspicious areas, it trains a dedicated scout (the RPN) that examines the panoramic photo and suggests where to look, all within the same system.

Fast R-CNN (Girshick, 2015) processes the entire image through a CNN once, extracts fixed-size features for each proposal via RoI pooling, and jointly trains classification and bounding box regression in a single network. Faster R-CNN (Ren et al., 2015) replaces Selective Search with a Region Proposal Network that shares convolutional features with the detector, creating the first fully end-to-end trainable detection pipeline.

How It Works

Fast R-CNN

Shared feature map: The entire image is passed through a CNN backbone (e.g., VGG-16) to produce a convolutional feature map.
RoI Pooling: For each proposal (still from Selective Search), project its coordinates onto the feature map and divide the region into a fixed $H \times W$ grid (e.g., $7 \times 7$ ). Max-pool within each cell to produce a fixed-size output:

$RoI Pool (R_{i}) \in R^{C \times H \times W}$

Head network: RoI-pooled features pass through fully connected layers.
Multi-task loss: Two sibling output layers -- a softmax classifier and a bounding box regressor -- are trained jointly:

$L = L_{cls} (p, u) + λ [u \geq 1] L_{loc} (t^{u}, v)$

where $p$ is the predicted class distribution, $u$ is the true class, $t^{u}$ are the predicted box offsets for class $u$ , and $v$ is the ground-truth box. $L_{loc}$ uses a smooth $L_{1}$ loss. $λ = 1$ by default.

Faster R-CNN: Region Proposal Network

The RPN is a small fully convolutional network that slides over the shared feature map:

At each spatial location, place $k$ anchor boxes of different scales and aspect ratios (typically $k = 9$ : 3 scales $\times$ 3 ratios).
A $3 \times 3$ conv layer followed by two sibling $1 \times 1$ conv layers predicts:
- Objectness score: 2 values (object vs. background) per anchor.
- Box regression: 4 values ( $t_{x}, t_{y}, t_{w}, t_{h}$ ) per anchor.
Proposals are generated by applying the predicted offsets to anchors, clipping to image boundaries, and running NMS (IoU threshold 0.7) to yield ~300 proposals.

Training Faster R-CNN

The original paper describes 4-step alternating training:

Train RPN initialized from ImageNet-pretrained backbone.
Train Fast R-CNN using proposals from step 1 (separate backbone).
Re-initialize RPN with the Fast R-CNN backbone, fine-tune only RPN layers.
Fine-tune Fast R-CNN head with proposals from step 3 (shared backbone frozen).

Later work showed joint end-to-end training (combining RPN and detection losses) works comparably and is simpler.

Architecture Summary

Image -> Backbone CNN -> Shared Feature Map
                              |
              +---------------+----------------+
              |                                |
         RPN (anchors)                  RoI Pooling
         ~300 proposals                 (per proposal)
              |                                |
              +--------> proposals -----> FC layers
                                          |        |
                                       cls head  bbox head

Why It Matters

Fast R-CNN achieved 9x training speedup and 213x inference speedup over R-CNN (0.32s vs. 47s per image, excluding proposal time).
Faster R-CNN achieved ~5 FPS (VGG-16) and ~17 FPS (ZF-Net), making near-real-time detection feasible.
End-to-end training eliminated the disconnected SVM and regressor stages of R-CNN, simplifying the pipeline and improving accuracy.
The RPN + anchor concept became foundational -- adopted by SSD, RetinaNet, and many subsequent detectors.
Faster R-CNN remains a strong baseline: with modern backbones (ResNet-101 + FPN), it achieves ~42% AP on COCO.

Key Technical Details

Fast R-CNN with VGG-16: 66.9% mAP on VOC 2007, 19.7% mAP on COCO (no bells and whistles). Training takes ~9.5 hours on a single GPU.
Faster R-CNN with VGG-16: 73.2% mAP on VOC 2007, 21.9% mAP on COCO. Proposal generation takes ~10ms on GPU.
RPN recall: ~98% at IoU 0.5 with 300 proposals, rivaling Selective Search with 2,000 proposals.
Anchor configuration: Original paper uses areas of $12 8^{2}$ , $25 6^{2}$ , $51 2^{2}$ with ratios 1:1, 1:2, 2:1, yielding 9 anchors per location.
RoI Pooling quantization: The coordinate rounding in RoI Pooling introduces spatial misalignment. Mask R-CNN later introduced RoI Align with bilinear interpolation to fix this, gaining ~1-2% AP.
Speed breakdown (Faster R-CNN, VGG-16): backbone ~120ms, RPN ~10ms, RoI head ~70ms, total ~200ms per image.

Common Misconceptions

"Fast R-CNN eliminated all bottlenecks." Selective Search still takes ~2 seconds per image, dominating total inference time. Faster R-CNN solved this.
"RPN generates class-specific proposals." The RPN is class-agnostic -- it predicts only objectness (object vs. background). Class labels are assigned by the second-stage detector.
"More anchors always improve detection." Excessive anchors increase computation and can hurt training stability. The 9-anchor configuration captures most object shapes adequately; diminishing returns set in quickly.

Connections to Other Concepts

r-cnn.md: The predecessor whose per-proposal CNN passes motivated the shared computation in Fast R-CNN.
feature-pyramid-network.md: Added to Faster R-CNN to enable multi-scale detection, forming the dominant two-stage detector baseline.
non-maximum-suppression.md: Used both inside the RPN (to filter proposals) and at the final detection stage.
yolo.md: Single-stage alternatives that bypass the proposal stage entirely for higher speed.
detr.md: Eliminates both anchors and NMS using a transformer-based set prediction approach.