Architectures · Module 19·8 min read

AlexNet

The 2012 paper that ended the “AI winter” for computer vision and arguably for all of deep learning. The model itself is unremarkable by modern standards. The fact that it won by 10 absolute percentage points was the earthquake.

The five-bullet version

  • AlexNet won ImageNet 2012 with a 10-point lead — the moment deep learning broke into the mainstream.
  • Architecture: 5 conv layers + 3 fully-connected layers. Stacked, with ReLU activations and max pooling.
  • The real innovations were enablers: GPU training, ReLU instead of sigmoid/tanh, dropout, data augmentation.
  • Each layer learns features at increasing levels of abstraction — edges, textures, parts, objects.
  • Every modern vision architecture descends from this template.

§ 00 · THE MOMENT COMPUTER VISION CHANGEDImageNet 2012

ImageNet had been an annual competition since 2010 — classify 1,000 categories of images, with millions of training examples and 50,000 test images. From 2010 to 2011, the winning error rate had nudged from ~28% to ~26%, with each year’s improvement looking like a normal incremental advance over hand-engineered features.

In 2012, Krizhevsky, Sutskever, and Hinton submitted AlexNetAlexNet. A deep convolutional neural network with 5 conv layers and 3 fully-connected layers, trained on ImageNet 2012. Named after its first author, Alex Krizhevsky. The submission that won the 2012 ImageNet competition and is widely credited with starting the deep-learning revolution in computer vision. and won with 15.3% error. The next-best submission was at 26%. A single year, a 10-point absolute gap, a method nobody else was using. It wasn’t an improvement — it was a category shift.

§ 01 · CONVOLUTION, BRIEFLYWhy this operation suits images

A convolutionconvolution. An operation where a small filter (e.g. 3×3 or 5×5) slides across an image, computing a weighted sum at each position. The filter's weights are learned. Critical properties: translation invariance (the same feature can be detected anywhere) and parameter sharing (one filter is reused across the whole image). is a small filter (typically 3×3 or 5×5 pixels) slid across an image. At each location, it computes a weighted sum of the pixels under it. The filter’s weights are learned from data.

Two properties make this perfect for images:

Stack many filters in a layer, and you get many feature maps. Stack many layers, and the features compose: edges → textures → parts → objects.

§ 02 · WHAT ALEXNET STACKEDEight layers, deeper than anything before it

The architecture by modern standards is small: 8 weight layers, 60 million parameters, ~727 MB at fp32. ResNet-50 (2015) has 25 million params; modern vision transformers have billions. But in 2012 this was deep — deeper than any successfully trained vision model previously.

Lab · the AlexNet stack224×224 RGB → 1000-class probability · click a layer
MaxPool 127×27×96
Downsample by 2× via max pooling.

Walk through the layers. Conv 1 with large filters and big strides — captures edges and color blobs in big visible regions. Subsequent conv layers use smaller filters and operate on the feature maps from the previous layer, building hierarchically more abstract features. The FC layers at the end mix everything together for the final classification.

§ 03 · THE UNSEXY ENABLERSWhy this worked when previous attempts didn’t

Deep convolutional networks weren’t a new idea — LeNet (1998) had the basic recipe. What made AlexNet work was a stack of mostly- unsexy implementation choices:

§ 04 · WHY THIS PAPER CHANGED EVERYTHINGThe wave that followed

Three things AlexNet’s win triggered:

human ~5%top-5 error (%)010203020102011201220132014201520162017AlexNetGoogLeNetResNet−10 pp jump
Fig 1The ImageNet error curve through the decade. AlexNet is the visible discontinuity in 2012 — everything after is a steady push made possible by what AlexNet established.
CHECKAlexNet had ~60M parameters. ResNet-50 (a 2015 successor) has 25M but performs better. What's the main reason?

§ 05 · TAKING THIS FORWARDWhere vision went after this

Reading AlexNet today is a history exercise — nobody trains AlexNet in 2026 except as a baseline or a teaching artifact. But the recipe is intact:

§ · GOING DEEPERThe three innovations that actually mattered in 2012

Krizhevsky, Sutskever, and Hinton’s 2012 ImageNet result dropped top-5 error from ~26% to ~16% in one paper. Three things compounded. GPU training — AlexNet trained on two GTX 580s for six days, infeasible on CPUs. ReLU activationstrained much faster than sigmoid or tanh (Nair & Hinton 2010 had proposed them; AlexNet showed they worked at depth). Dropout (Hinton et al. 2012) prevented overfitting on a relatively small dataset (1.2M images for 60M parameters).

The legacy is the architectural template — convolutions, ReLU, pooling, fully connected — that every CNN copied for the next decade. VGG (Simonyan & Zisserman 2014) pushed depth. GoogLeNet (Szegedy et al. 2014) introduced inception modules. ResNet (He et al. 2015) added skip connections and effectively ended the “how deep can we go” question. ImageNet as a benchmark (Deng et al. 2009) made all of this measurable — the dataset mattered as much as the architecture.

§ · FURTHER READINGReferences & deeper sources

  1. Krizhevsky, Sutskever, Hinton (2012). ImageNet Classification with Deep Convolutional Neural Networks (AlexNet) · NeurIPS
  2. Deng et al. (2009). ImageNet: A Large-Scale Hierarchical Image Database · CVPR
  3. Simonyan, Zisserman (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG) · ICLR
  4. Szegedy et al. (2014). Going Deeper with Convolutions (GoogLeNet) · CVPR
  5. He, Zhang, Ren, Sun (2015). Deep Residual Learning for Image Recognition (ResNet) · CVPR

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.