Architectures · Module 23·8 min read

YOLO

Two-stage object detectors looked at every image region separately, slowly. YOLO — “You Only Look Once” — predicted every bounding box in one forward pass over a grid. Real-time detection became possible, and stayed possible.

The five-bullet version

  • Object detection: find and classify all objects in an image, with bounding boxes.
  • Pre-YOLO methods (R-CNN family) used two stages: propose regions, then classify each. Accurate but slow.
  • YOLO uses a single forward pass. Divide the image into a grid; each cell predicts boxes + class scores for objects centered there.
  • Anchor boxes give each cell a few box-shape templates to refine, handling objects of varying aspect ratios.
  • Non-max suppression cleans up overlapping predictions for the same object.

§ 00 · DETECTION VS CLASSIFICATIONTwo different vision problems

Classification is “what’s in this image?” — one label per image. Detection is the harder problem: “what objects are in this image, where are they, and what’s their bounding box?”

For each detected object you need three things: the class (person, car, dog), the bounding box (4 numbers: x, y, width, height), and a confidence score. An image can contain zero objects or twenty.

§ 01 · ONE FORWARD PASS, ONE GRIDDetection as a regression problem

The original 2016 YOLOYOLO. You Only Look Once. A 2016 object detector by Joseph Redmon et al. that frames detection as a single regression problem over a grid, in one forward pass — dramatically faster than the prevailing two-stage detectors. framed detection as one big regression problem. Divide the image into a grid (originally 7×7). For each cell, the network predicts:

All of this comes out of one forward pass through a CNN. Versus previous approaches that proposed thousands of candidate regions and classified each separately, YOLO got near-state-of-the-art accuracy at 45 frames per second on a GPU — a 30× speedup. The trade-off was slightly lower accuracy on small objects. Real-time vision became practical.

§ 02 · ANCHOR BOXES & CLASS SCORESTemplates for object shape

One cell can have multiple objects centered nearby; objects come in very different aspect ratios (a person is tall, a car is wide). Version 2 of YOLO introduced anchor boxesanchor boxes. Pre-defined box shapes (typically 3–9) that each grid cell predicts adjustments to. Different anchors handle different aspect ratios (tall, wide, square), helping the network specialize per shape. — pre-defined box-shape templates the model learns to adjust. Each cell predicts several boxes, one per anchor, each with its own confidence and class scores.

The anchors are usually picked by clustering the bounding boxes in the training set (k-means on widths and heights). Three to nine anchors per cell is typical. With anchors, the network can specialize: “tall thin anchor” learns to detect people, “wide anchor” learns to detect cars.

§ 03 · NMS — PICKING WHICH BOXES TO KEEPSuppress duplicate predictions

Lab · YOLO grid + NMSRaw cell predictions vs after non-max suppression
person 92person 78person 42car 88car 51

Every grid cell where an object's center falls predicts a box. Nearby cells often predict overlapping boxes for the same object — the raw output is noisy.

Adjacent cells will often produce overlapping predictions for the same object. The raw output of YOLO is a noisy list of boxes, many of which are near-duplicates with slightly different scores.

Non-Maximum SuppressionNon-Maximum Suppression. A post-processing step that removes duplicate detections. Iteratively: take the highest-scoring box, mark it kept, suppress any other box whose IoU (intersection-over-union) with it is above a threshold. Repeat with the next-highest. Standard for any modern detector. (NMS) cleans this up. The algorithm:

  1. Take the highest-confidence box. Keep it.
  2. Find any other box whose IoU (intersection over union) with the kept box is > threshold (e.g. 0.5). Discard those.
  3. Pick the next-highest remaining box. Repeat.
  4. Stop when no boxes remain.

The result is a clean per-object list — one box per object, no duplicates. NMS is its own little algorithm and exists in basically every detection pipeline ever built, YOLO included.

§ 04 · REAL-TIME, REAL-WORLD, ONGOINGThe YOLO line continues

YOLO has gone through many versions since 2016 — v3, v4, v5, v6, v7, v8, v9, v10, and beyond. Different teams; sometimes confusing branding. The core recipe has been remarkably stable: backbone (feature extractor) → neck (multi-scale feature fusion) → head (predict boxes + classes per cell). The improvements have been incremental:

Production uses: autonomous driving, robotics, sports analytics, retail (people counting, basket analytics), agriculture (crop monitoring), public-safety surveillance, and increasingly inside multimodal LLMs as the visual front-end.

mAP (accuracy)FPS (higher is faster)YOLO lineFaster R-CNNMask R-CNNRetinaNetEfficientDetReal-time (≥30 FPS)
Fig 1The speed–accuracy frontier of object detection. YOLO has held the real-time corner for a decade, with each version pushing the line slightly out.
CHECKA team is building object detection for an embedded device (Jetson Nano, 10 FPS budget). They need to detect 20 classes with moderate accuracy. Best architectural choice?

§ 05 · TAKING THIS FORWARDWhere detection is heading

Three directions worth following:

§ · GOING DEEPERAnchor boxes, NMS, and how YOLO evolved

Redmon’s 2015 YOLO paper made object detection a single forward pass: divide the image into a grid, each cell predicts bounding boxes and class probabilities. Earlier two-stage detectors (R-CNN family) proposed regions first, classified them second — much more accurate, much too slow for real-time. YOLO traded some accuracy for the ability to run at 45 FPS, which made it the practical choice for video and embedded deployment.

Each version added a refinement. v2 (2016) introduced anchor boxes — predefined shape priors that handle aspect ratios. v3 (2018) used multi-scale prediction. v4–v9 are increasingly engineering-driven: data augmentation tricks (mosaic, mix-up), better backbones, training-time regularization. The conceptual successor is DETR (Carion et al. 2020), which replaced the grid with a transformer and bipartite matching — slower but more flexible. Modern production detection is a mix of YOLO descendants (for speed) and DETR descendants (for accuracy and instance-aware tasks).

§ · FURTHER READINGReferences & deeper sources

  1. Redmon, Divvala, Girshick, Farhadi (2016). You Only Look Once: Unified, Real-Time Object Detection · CVPR
  2. Redmon, Farhadi (2018). YOLOv3: An Incremental Improvement · arXiv
  3. Bochkovskiy, Wang, Liao (2020). YOLOv4: Optimal Speed and Accuracy of Object Detection · arXiv
  4. Carion et al. (2020). End-to-End Object Detection with Transformers (DETR) · ECCV
  5. Wang, Yeh, Liao (2024). YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.