One-Line Summary: Grokking is the eerie phenomenon where a model first memorizes its training data with poor validation accuracy, then — long after training loss has plateaued — suddenly generalizes.
Prerequisites: Basic understanding of training/validation curves, weight decay, and the concept of overfitting.
What It Is
In a standard training run you expect train and validation loss to fall together. In a grokking run, they decouple: training loss drops to near zero almost immediately, validation loss stays high, and the model looks like a textbook case of overfitting. Then — sometimes thousands of epochs later, with no change to the data, optimizer, or learning rate — validation loss suddenly collapses. The model "gets it."
Power et al. (2022) first observed grokking on small algorithmic tasks like modular arithmetic, where the data is finite and the underlying rule is exact. The model could brute-force memorize the training examples long before it found the actual modular-addition algorithm encoded in its weights.
Why It Matters
Grokking puts a stake through the heart of the most reflexive defensive move in ML training: early stopping. If you stop training when validation loss plateaus, you may halt the model just before the most important learning of the run.
Mechanistically, the memorization phase is not idle. The model is slowly developing structured internal representations — circuits that compute the actual rule — but they're being drowned out by the noise of memorized lookup-table behavior. Weight decay tilts the loss landscape against memorization (which requires lots of weight), making the structured solution eventually win. Nanda et al. (2023) reverse-engineered grokking on modular addition and showed the model literally constructs Fourier-like circuits that compute the algorithm exactly.
Key Technical Details
Grokking depends on weight decay, the size of the training set relative to the task (smaller relative size makes it more likely), and the length of training (orders of magnitude longer than typical). It is most reliably observed on synthetic algorithmic tasks; in real-world LLM training the dynamics are messier and harder to disentangle, but related "phase transitions" — emergent capabilities appearing suddenly at scale — likely share some of the same causes.