One-Line Summary: Fast R-CNN shares convolutional computation across all proposals via RoI pooling and trains end-to-end, while Faster R-CNN replaces external proposals with a learned Region Proposal Network (RPN) to achieve near-real-time detection at ~5 FPS.
Prerequisites: R-CNN, convolutional neural networks, region proposals, bounding box regression, multi-task learning
What Is Fast / Faster R-CNN?
R-CNN is like sending a separate photographer to every suspicious area in a city. Fast R-CNN is like mounting one camera on a helicopter, taking a single panoramic photo, then cropping and zooming into each area of interest digitally -- same analysis quality, vastly less work. Faster R-CNN goes further: instead of relying on a tip line (Selective Search) to identify suspicious areas, it trains a dedicated scout (the RPN) that examines the panoramic photo and suggests where to look, all within the same system.
Fast R-CNN (Girshick, 2015) processes the entire image through a CNN once, extracts fixed-size features for each proposal via RoI pooling, and jointly trains classification and bounding box regression in a single network. Faster R-CNN (Ren et al., 2015) replaces Selective Search with a Region Proposal Network that shares convolutional features with the detector, creating the first fully end-to-end trainable detection pipeline.
How It Works
Fast R-CNN
- Shared feature map: The entire image is passed through a CNN backbone (e.g., VGG-16) to produce a convolutional feature map.
- RoI Pooling: For each proposal (still from Selective Search), project its coordinates onto the feature map and divide the region into a fixed grid (e.g., ). Max-pool within each cell to produce a fixed-size output:
- Head network: RoI-pooled features pass through fully connected layers.
- Multi-task loss: Two sibling output layers -- a softmax classifier and a bounding box regressor -- are trained jointly:
where is the predicted class distribution, is the true class, are the predicted box offsets for class , and is the ground-truth box. uses a smooth loss. by default.
Faster R-CNN: Region Proposal Network
The RPN is a small fully convolutional network that slides over the shared feature map:
- At each spatial location, place anchor boxes of different scales and aspect ratios (typically : 3 scales 3 ratios).
- A conv layer followed by two sibling conv layers predicts:
- Objectness score: 2 values (object vs. background) per anchor.
- Box regression: 4 values () per anchor.
- Proposals are generated by applying the predicted offsets to anchors, clipping to image boundaries, and running NMS (IoU threshold 0.7) to yield ~300 proposals.
Training Faster R-CNN
The original paper describes 4-step alternating training:
- Train RPN initialized from ImageNet-pretrained backbone.
- Train Fast R-CNN using proposals from step 1 (separate backbone).
- Re-initialize RPN with the Fast R-CNN backbone, fine-tune only RPN layers.
- Fine-tune Fast R-CNN head with proposals from step 3 (shared backbone frozen).
Later work showed joint end-to-end training (combining RPN and detection losses) works comparably and is simpler.
Architecture Summary
Image -> Backbone CNN -> Shared Feature Map
|
+---------------+----------------+
| |
RPN (anchors) RoI Pooling
~300 proposals (per proposal)
| |
+--------> proposals -----> FC layers
| |
cls head bbox headWhy It Matters
- Fast R-CNN achieved 9x training speedup and 213x inference speedup over R-CNN (0.32s vs. 47s per image, excluding proposal time).
- Faster R-CNN achieved ~5 FPS (VGG-16) and ~17 FPS (ZF-Net), making near-real-time detection feasible.
- End-to-end training eliminated the disconnected SVM and regressor stages of R-CNN, simplifying the pipeline and improving accuracy.
- The RPN + anchor concept became foundational -- adopted by SSD, RetinaNet, and many subsequent detectors.
- Faster R-CNN remains a strong baseline: with modern backbones (ResNet-101 + FPN), it achieves ~42% AP on COCO.
Key Technical Details
- Fast R-CNN with VGG-16: 66.9% mAP on VOC 2007, 19.7% mAP on COCO (no bells and whistles). Training takes ~9.5 hours on a single GPU.
- Faster R-CNN with VGG-16: 73.2% mAP on VOC 2007, 21.9% mAP on COCO. Proposal generation takes ~10ms on GPU.
- RPN recall: ~98% at IoU 0.5 with 300 proposals, rivaling Selective Search with 2,000 proposals.
- Anchor configuration: Original paper uses areas of , , with ratios 1:1, 1:2, 2:1, yielding 9 anchors per location.
- RoI Pooling quantization: The coordinate rounding in RoI Pooling introduces spatial misalignment. Mask R-CNN later introduced RoI Align with bilinear interpolation to fix this, gaining ~1-2% AP.
- Speed breakdown (Faster R-CNN, VGG-16): backbone ~120ms, RPN ~10ms, RoI head ~70ms, total ~200ms per image.
Common Misconceptions
- "Fast R-CNN eliminated all bottlenecks." Selective Search still takes ~2 seconds per image, dominating total inference time. Faster R-CNN solved this.
- "RPN generates class-specific proposals." The RPN is class-agnostic -- it predicts only objectness (object vs. background). Class labels are assigned by the second-stage detector.
- "More anchors always improve detection." Excessive anchors increase computation and can hurt training stability. The 9-anchor configuration captures most object shapes adequately; diminishing returns set in quickly.
Connections to Other Concepts
r-cnn.md: The predecessor whose per-proposal CNN passes motivated the shared computation in Fast R-CNN.feature-pyramid-network.md: Added to Faster R-CNN to enable multi-scale detection, forming the dominant two-stage detector baseline.non-maximum-suppression.md: Used both inside the RPN (to filter proposals) and at the final detection stage.yolo.md: Single-stage alternatives that bypass the proposal stage entirely for higher speed.detr.md: Eliminates both anchors and NMS using a transformer-based set prediction approach.
Further Reading
- Girshick, "Fast R-CNN" (2015) -- Introduced RoI pooling and multi-task end-to-end training for detection.
- Ren et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks" (2015) -- Introduced the RPN and anchor-based proposal generation.
- He et al., "Mask R-CNN" (2017) -- Extended Faster R-CNN with RoI Align and a mask prediction branch for instance segmentation.
- Lin et al., "Feature Pyramid Networks for Object Detection" (2017) -- Multi-scale feature extraction that significantly boosts Faster R-CNN performance.