VGGNet

One-Line Summary: VGGNet demonstrated that network depth with uniform $3 \times 3$ convolutions is a critical factor for representation quality, achieving 7.3% top-5 error on ImageNet with the VGG-16 and VGG-19 architectures.

Prerequisites: Convolution in neural networks, pooling layers, receptive field, AlexNet

What Is VGGNet?

Imagine you are building a wall. You could use a few large bricks, or many small bricks stacked carefully. The small-brick wall takes more layers but gives you finer control over the shape and is structurally more flexible. VGGNet took this approach to network design: instead of using the large $11 \times 11$ and $5 \times 5$ filters of AlexNet, it used exclusively $3 \times 3$ convolutions -- the smallest filter that still captures up, down, left, right, and center -- and compensated by going deeper.

VGGNet was developed by Karen Simonyan and Andrew Zisserman at the Visual Geometry Group (VGG) at the University of Oxford. It placed second in the ILSVRC-2014 classification task (behind GoogLeNet) but became far more widely used in practice due to its clean, uniform architecture that was easy to understand and modify.

How It Works

Design Principles

The core insight of VGGNet is architectural simplicity through uniformity:

All convolutional layers use $3 \times 3$ kernels with stride 1 and "same" padding.
All max pooling layers use $2 \times 2$ windows with stride 2.
After each pooling layer, the number of channels doubles: $64 \to 128 \to 256 \to 512 \to 512$ .
The network ends with three fully connected layers: 4096, 4096, 1000.

VGG-16 Architecture

Input: 224x224x3
 
Block 1: [Conv3-64] x 2  -> MaxPool -> 112x112x64
Block 2: [Conv3-128] x 2 -> MaxPool -> 56x56x128
Block 3: [Conv3-256] x 3 -> MaxPool -> 28x28x256
Block 4: [Conv3-512] x 3 -> MaxPool -> 14x14x512
Block 5: [Conv3-512] x 3 -> MaxPool -> 7x7x512
 
FC-4096 -> FC-4096 -> FC-1000 -> Softmax

The "16" in VGG-16 refers to 16 weight layers (13 convolutional + 3 fully connected). VGG-19 adds one extra convolutional layer to each of blocks 3, 4, and 5 (16 conv + 3 FC = 19 weight layers).

Why 3x3 Convolutions?

Two stacked $3 \times 3$ conv layers have an effective receptive field of $5 \times 5$ . Three stacked $3 \times 3$ layers cover $7 \times 7$ . Compared to a single $7 \times 7$ layer with $C$ channels:

Parameters: $3 \times (3^{2} C^{2}) = 27 C^{2}$ vs. $7^{2} C^{2} = 49 C^{2}$ -- a 45% reduction.
Nonlinearities: Three ReLU activations instead of one, giving the network more discriminative power.
Regularization: The factorization acts as an implicit regularizer by constraining the filter structure.

Training Details

Optimizer: SGD with momentum 0.9, weight decay $5 \times 1 0^{- 4}$ , batch size 256.
Learning rate: Initially 0.01, decreased by 10x three times during training.
Initialization: Shallower configurations (VGG-11) trained first, then used to initialize deeper ones. Later, Glorot initialization was shown to work directly.
Data augmentation: Random crops, horizontal flips, and RGB color jittering.
Multi-scale training: Images rescaled to random sizes $S \in [256, 512]$ before cropping, which improved generalization by exposing the network to objects at different scales. This scale jittering reduced top-5 error by approximately 1.2% compared to single-scale training.
Training time: 2--3 weeks on 4 NVIDIA Titan Black GPUs.
Multi-scale testing: At inference, the network was evaluated at multiple scales and the results were averaged, providing additional accuracy gains.

Why It Matters

Established depth as a key design principle. Going from 8 layers (AlexNet) to 16--19 layers reduced top-5 error from 16.4% to 7.3%, proving that deeper networks learn better representations.
Standardized the $3 \times 3$ convolution. Nearly all subsequent architectures adopted $3 \times 3$ as the default kernel size, following VGGNet's demonstration that stacking small filters is superior to using large ones.
Became the default feature extractor. VGG-16 pretrained on ImageNet was the backbone of choice for detection (Faster R-CNN), segmentation (FCN), and style transfer for several years.
Simplicity enabled research. The uniform, blocky structure made VGGNet easy to modify for experiments, driving its adoption in hundreds of downstream papers.

Key Technical Details

VGG-16: 138 million parameters, ~15.5 billion FLOPs per forward pass.
VGG-19: 144 million parameters, ~19.6 billion FLOPs.
Parameter distribution: The first FC layer ( $7 \times 7 \times 512 \times 4096$ ) alone contains ~102M parameters -- 74% of the total. The convolutional layers contain only ~14.7M parameters.
Top-5 error: 7.3% (VGG-16, single model), 6.8% (VGG-19, ensemble with dense evaluation).
Top-1 error: 25.5% (VGG-16, single model, single crop).
Memory: VGG-16 requires ~93 MB for weights (float32) and ~500 MB of activation memory during inference at batch size 1, making it impractical for mobile and edge deployment.
Feature maps: The conv3-256 and conv4-512 layers are commonly used as feature extractors for downstream tasks; conv4-3 features are popular in single-shot detectors (SSD).
VGG configurations explored: The paper evaluated configurations A (11 layers) through E (19 layers), systematically showing that each increase in depth improved classification accuracy up to a point.
Dense evaluation: At test time, the FC layers are converted to convolutional layers, allowing the network to process images of any size and average predictions across all spatial positions. This multi-crop dense evaluation reduced error by ~1%.
Perceptual loss: VGG features (particularly from layers relu1_2, relu2_2, relu3_3, and relu4_3) define the widely used perceptual loss function for image generation, super-resolution, and style transfer tasks.

Common Misconceptions

"VGGNet won ImageNet 2014." GoogLeNet won the classification task with 6.7% top-5 error. VGGNet placed second at 7.3%. However, VGGNet won the localization task and became more widely used in practice.
"Deeper is always better with this architecture." VGG-19 offers only marginal improvement over VGG-16 (0.1% top-5 error), and going beyond 19 layers with plain stacking degrades performance due to optimization difficulties -- a problem that ResNet later solved with skip connections.
"VGG is too slow to be useful." While VGGNet is computationally expensive at its full size, it has been effectively compressed using pruning and knowledge distillation. VGG features remain popular in perceptual loss functions (style transfer, super-resolution) where the specific architecture's feature quality matters more than inference speed.
"All VGG variants perform similarly." VGG-11 (11 layers) achieves 29.6% top-1 error, while VGG-16 achieves 25.5% -- a significant 4.1 percentage point gap demonstrating the clear value of depth within this architecture family.

Connections to Other Concepts

alexnet.md: VGGNet refined AlexNet's blueprint by replacing large kernels with stacked $3 \times 3$ convolutions and increasing depth from 8 to 19 layers.
receptive-field.md: Demonstrated that stacking small kernels achieves large receptive fields more efficiently than single large kernels.
inception.md: GoogLeNet was VGGNet's contemporary competitor, taking a very different approach with multi-branch parallel convolutions rather than uniform sequential stacking.
resnet.md: Addressed VGGNet's scaling limitation -- that simply stacking more layers eventually degrades performance -- by introducing skip connections.
depthwise-separable-convolutions.md: Later architectures factored VGG-style $3 \times 3$ convolutions into depthwise and pointwise components, reducing VGG's computational cost by an order of magnitude while preserving accuracy.
efficientnet.md: Compound scaling provided a principled alternative to VGG's manual depth-only scaling strategy.