One-Line Summary: Fast R-CNN shares convolutional computation across all proposals via RoI pooling and trains end-to-end, while Faster R-CNN replaces external proposals with a learned Region Proposal Network (RPN) to achieve near-real-time detection at ~5 FPS.

Prerequisites: R-CNN, convolutional neural networks, region proposals, bounding box regression, multi-task learning

What Is Fast / Faster R-CNN?

R-CNN is like sending a separate photographer to every suspicious area in a city. Fast R-CNN is like mounting one camera on a helicopter, taking a single panoramic photo, then cropping and zooming into each area of interest digitally -- same analysis quality, vastly less work. Faster R-CNN goes further: instead of relying on a tip line (Selective Search) to identify suspicious areas, it trains a dedicated scout (the RPN) that examines the panoramic photo and suggests where to look, all within the same system.

Fast R-CNN (Girshick, 2015) processes the entire image through a CNN once, extracts fixed-size features for each proposal via RoI pooling, and jointly trains classification and bounding box regression in a single network. Faster R-CNN (Ren et al., 2015) replaces Selective Search with a Region Proposal Network that shares convolutional features with the detector, creating the first fully end-to-end trainable detection pipeline.

How It Works

Fast R-CNN

  1. Shared feature map: The entire image is passed through a CNN backbone (e.g., VGG-16) to produce a convolutional feature map.
  2. RoI Pooling: For each proposal (still from Selective Search), project its coordinates onto the feature map and divide the region into a fixed grid (e.g., ). Max-pool within each cell to produce a fixed-size output:

  1. Head network: RoI-pooled features pass through fully connected layers.
  2. Multi-task loss: Two sibling output layers -- a softmax classifier and a bounding box regressor -- are trained jointly:

where is the predicted class distribution, is the true class, are the predicted box offsets for class , and is the ground-truth box. uses a smooth loss. by default.

Faster R-CNN: Region Proposal Network

The RPN is a small fully convolutional network that slides over the shared feature map:

  1. At each spatial location, place anchor boxes of different scales and aspect ratios (typically : 3 scales 3 ratios).
  2. A conv layer followed by two sibling conv layers predicts:
    • Objectness score: 2 values (object vs. background) per anchor.
    • Box regression: 4 values () per anchor.
  3. Proposals are generated by applying the predicted offsets to anchors, clipping to image boundaries, and running NMS (IoU threshold 0.7) to yield ~300 proposals.

Training Faster R-CNN

The original paper describes 4-step alternating training:

  1. Train RPN initialized from ImageNet-pretrained backbone.
  2. Train Fast R-CNN using proposals from step 1 (separate backbone).
  3. Re-initialize RPN with the Fast R-CNN backbone, fine-tune only RPN layers.
  4. Fine-tune Fast R-CNN head with proposals from step 3 (shared backbone frozen).

Later work showed joint end-to-end training (combining RPN and detection losses) works comparably and is simpler.

Architecture Summary

Image -> Backbone CNN -> Shared Feature Map
                              |
              +---------------+----------------+
              |                                |
         RPN (anchors)                  RoI Pooling
         ~300 proposals                 (per proposal)
              |                                |
              +--------> proposals -----> FC layers
                                          |        |
                                       cls head  bbox head

Why It Matters

  1. Fast R-CNN achieved 9x training speedup and 213x inference speedup over R-CNN (0.32s vs. 47s per image, excluding proposal time).
  2. Faster R-CNN achieved ~5 FPS (VGG-16) and ~17 FPS (ZF-Net), making near-real-time detection feasible.
  3. End-to-end training eliminated the disconnected SVM and regressor stages of R-CNN, simplifying the pipeline and improving accuracy.
  4. The RPN + anchor concept became foundational -- adopted by SSD, RetinaNet, and many subsequent detectors.
  5. Faster R-CNN remains a strong baseline: with modern backbones (ResNet-101 + FPN), it achieves ~42% AP on COCO.

Key Technical Details

  • Fast R-CNN with VGG-16: 66.9% mAP on VOC 2007, 19.7% mAP on COCO (no bells and whistles). Training takes ~9.5 hours on a single GPU.
  • Faster R-CNN with VGG-16: 73.2% mAP on VOC 2007, 21.9% mAP on COCO. Proposal generation takes ~10ms on GPU.
  • RPN recall: ~98% at IoU 0.5 with 300 proposals, rivaling Selective Search with 2,000 proposals.
  • Anchor configuration: Original paper uses areas of , , with ratios 1:1, 1:2, 2:1, yielding 9 anchors per location.
  • RoI Pooling quantization: The coordinate rounding in RoI Pooling introduces spatial misalignment. Mask R-CNN later introduced RoI Align with bilinear interpolation to fix this, gaining ~1-2% AP.
  • Speed breakdown (Faster R-CNN, VGG-16): backbone ~120ms, RPN ~10ms, RoI head ~70ms, total ~200ms per image.

Common Misconceptions

  • "Fast R-CNN eliminated all bottlenecks." Selective Search still takes ~2 seconds per image, dominating total inference time. Faster R-CNN solved this.
  • "RPN generates class-specific proposals." The RPN is class-agnostic -- it predicts only objectness (object vs. background). Class labels are assigned by the second-stage detector.
  • "More anchors always improve detection." Excessive anchors increase computation and can hurt training stability. The 9-anchor configuration captures most object shapes adequately; diminishing returns set in quickly.

Connections to Other Concepts

  • r-cnn.md: The predecessor whose per-proposal CNN passes motivated the shared computation in Fast R-CNN.
  • feature-pyramid-network.md: Added to Faster R-CNN to enable multi-scale detection, forming the dominant two-stage detector baseline.
  • non-maximum-suppression.md: Used both inside the RPN (to filter proposals) and at the final detection stage.
  • yolo.md: Single-stage alternatives that bypass the proposal stage entirely for higher speed.
  • detr.md: Eliminates both anchors and NMS using a transformer-based set prediction approach.

Further Reading

  • Girshick, "Fast R-CNN" (2015) -- Introduced RoI pooling and multi-task end-to-end training for detection.
  • Ren et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks" (2015) -- Introduced the RPN and anchor-based proposal generation.
  • He et al., "Mask R-CNN" (2017) -- Extended Faster R-CNN with RoI Align and a mask prediction branch for instance segmentation.
  • Lin et al., "Feature Pyramid Networks for Object Detection" (2017) -- Multi-scale feature extraction that significantly boosts Faster R-CNN performance.