One-Line Summary: SSD performs object detection in a single forward pass by predicting bounding boxes and class scores from multiple convolutional feature maps at different scales, achieving 59 FPS with accuracy competitive with two-stage detectors.

Prerequisites: Convolutional neural networks, anchor boxes, multi-scale detection, non-maximum suppression, bounding box regression

What Is SSD?

Consider a security guard monitoring a wall of screens, where each screen shows the same scene at a different zoom level. The guard can spot a person on the wide-angle view, read a license plate on the close-up, and catch a package on the mid-range view -- all simultaneously without switching cameras. SSD works similarly: it examines feature maps at multiple resolutions within a single network, detecting large objects on coarse maps and small objects on finer maps, all in one shot.

Technically, SSD (Liu et al., 2016) is a single-stage detector that attaches convolutional prediction heads to multiple feature maps from a backbone network (VGG-16) and additional convolutional layers. At each spatial location on each feature map, SSD predicts offsets and class scores for a set of default (anchor) boxes of varying aspect ratios. Detection is completed in one forward pass with no proposal generation stage.

How It Works

Architecture

SSD extends VGG-16 (truncated before classification layers) with extra convolutional layers that progressively reduce spatial resolution:

Feature MapSize (300 input)Anchors/cellTotal Anchors
Conv4_345,776
Conv7 (fc7)62,166
Conv8_26600
Conv9_26150
Conv10_2654
Conv11_244

Total: 8,732 default boxes per image (for SSD-300).

Prediction Heads

At each feature map location with default boxes and classes:

  • Classification: outputs (including background).
  • Localization: outputs (offsets ).

Each head is a convolutional layer applied to the feature map:

Default Box Design

Default boxes are defined by:

  • Scale: Linearly spaced from to across feature maps:
  • Aspect ratios: , plus an extra box with scale at ratio 1.

Training

Matching: Each ground-truth box is matched to the default box with highest IoU, plus all default boxes with IoU .

Loss function:

where is cross-entropy over all classes and is smooth loss over matched boxes. by default, is the number of matched boxes.

Hard negative mining: Negatives are sorted by confidence loss and the top ones are selected so the negative-to-positive ratio is at most 3:1.

Data augmentation: Aggressive random cropping, photometric distortions, and horizontal flipping. The authors reported that data augmentation improved mAP by ~8.8 points, making it critical for SSD's performance.

Why It Matters

  1. SSD-300 achieved 74.3% mAP on VOC 2007 at 59 FPS -- comparable to Faster R-CNN (73.2%) while being ~10x faster.
  2. SSD-512 reached 76.8% mAP on VOC 2007, exceeding Faster R-CNN, though at a lower 22 FPS.
  3. It demonstrated that single-stage detectors could rival two-stage detectors in accuracy, not just speed.
  4. Multi-scale feature map detection became a standard design pattern adopted by subsequent architectures including DSSD, FSSD, and EfficientDet.

Key Technical Details

  • SSD-300 (VGG-16): 74.3% mAP on VOC 2007, 59 FPS on Titan X GPU. 23.2% AP on COCO.
  • SSD-512 (VGG-16): 76.8% mAP on VOC 2007, 22 FPS. 26.8% AP on COCO.
  • Small object weakness: SSD-300 achieves only ~6% AP_S on COCO, because the lowest feature map () has already lost fine spatial detail. SSD-512 partially mitigates this with higher resolution.
  • L2 normalization on Conv4_3 features is necessary because their magnitudes are ~10-20x larger than deeper layers.
  • Inference: No proposal stage needed. A single forward pass produces all detections; NMS is applied per-class as a post-processing step.
  • Model size: ~26M parameters (VGG-16 backbone dominates), ~95 MB.

Common Misconceptions

  • "SSD is just YOLO with more feature maps." While both are single-stage, SSD uses anchor boxes at multiple feature map scales (like an RPN at each level), whereas YOLOv1 used no anchors and predicted from a single grid. The multi-scale anchor approach is SSD's key contribution.
  • "Single-stage detectors cannot match two-stage accuracy." SSD showed they could come close, and later work (RetinaNet with focal loss) surpassed two-stage detectors entirely.
  • "SSD handles all object sizes equally well." Small objects remain a significant weakness because they are detected on the lowest-resolution feature maps that have undergone many pooling operations. FPN-based architectures address this gap.

Connections to Other Concepts

  • multi-scale-detection.md: SSD is a canonical example of detecting objects at different scales from different network layers.
  • yolo.md: Another single-stage detector; YOLOv2+ adopted SSD-style anchor boxes.
  • feature-pyramid-network.md: FPN addresses SSD's small-object weakness by enriching low-level feature maps with top-down semantic information.
  • focal-loss.md: Identifies and solves the class imbalance problem that limits SSD's accuracy, using FPN + focal loss to surpass two-stage detectors.
  • non-maximum-suppression.md: Applied per-class after SSD's single forward pass to produce final detections.

Further Reading

  • Liu et al., "SSD: Single Shot MultiBox Detector" (2016) -- The original SSD paper.
  • Fu et al., "DSSD: Deconvolutional Single Shot Detector" (2017) -- Added deconvolution modules to SSD for better feature maps.
  • Tan et al., "EfficientDet: Scalable and Efficient Object Detection" (2020) -- Modern efficient detector building on SSD's multi-scale paradigm.
  • Erhan et al., "Scalable Object Detection Using Deep Neural Networks (MultiBox)" (2014) -- Precursor that introduced the default box concept.