One-Line Summary: The field's understanding of training data shifted from "more is better" to "quality, curation, and diversity matter more than raw volume," fundamentally changing how LLMs are trained.
Prerequisites: 01-gpt-3.md, 03-chinchilla-and-compute-optimal-training.md
What Is the Data Quality Revolution?
Imagine you are studying for a medical exam. You could read every page on the internet mentioning medicine — forum posts, ads, conspiracy theories, textbooks all mixed together. Or you could study carefully curated textbooks, peer-reviewed papers, and structured problem sets.
Early LLM training took the first approach: scrape the entire web, throw it all in, and trust that scale would average out the noise. The data quality revolution is the field's hard-won realization that the second approach — careful curation, filtering, and even synthetic generation of high-quality data — produces dramatically better models, often at a fraction of the cost.
How It Works
The Shift from "More Data" to "Better Data":
The "More Data" Era (2018-2021) The Quality Era (2023-2025)
═══════════════════════════════ ════════════════════════════
Raw Web Crawl Multi-Stage Filtering Pipeline
┌──────────────────┐ ┌──────────────────────────┐
│ ████████████████ │ │ Raw Common Crawl │
│ ████████████████ │ volume └────────────┬─────────────┘
│ ████████████████ │ matters │
│ noise + quality │ ┌────────────▼─────────────┐
│ mixed together │ │ Language Detection │
└──────────────────┘ ├──────────────────────────┤
│ URL Filtering │
GPT-3: 300B tokens ├──────────────────────────┤
light filtering │ Quality Classifier │
├──────────────────────────┤
│ Deduplication (exact+fuzzy)│
├──────────────────────────┤
│ Toxicity / PII Removal │
├──────────────────────────┤
│ Benchmark Decontamination│
└────────────┬─────────────┘
│
┌────────────▼─────────────┐
│ High-Quality Tokens │
│ FineWeb: 15T tokens │
└──────────────────────────┘
Typical Modern Data Mix:
┌──────────────────────────────────────────────┐
│ ██████████████████████████ Web Text (~50%) │
│ █████████████ Code (~25%) │
│ █████ Academic (~10%) │
│ █████ Books (~10%) │
│ ███ Conversation (~5%)│
└──────────────────────────────────────────────┘The "More Data" Era (2018-2021)
GPT-2 (2019) trained on WebText, 40GB scraped from Reddit outbound links — a rough quality filter based on social upvotes. GPT-3 (2020) scaled to 300B tokens from Common Crawl, WebText2, Books, and Wikipedia, with only light filtering.
The prevailing assumption was that neural networks could learn to ignore noise at sufficient scale. Common Crawl alone contained petabytes of raw web text, and the bottleneck seemed to be compute, not data. Most researchers focused on scaling model parameters and training compute, treating data as an essentially unlimited resource that just needed basic deduplication and language filtering.
The Pile and Curated Datasets (2020-2022)
EleutherAI's The Pile (Gao et al., 2020) changed the conversation. It combined 825GB from 22 diverse sources: PubMed abstracts, ArXiv papers, GitHub code, StackExchange posts, USPTO patents, FreeLaw court opinions, Enron emails, Ubuntu IRC logs, and more. Each source was individually processed and documented.
Models trained on The Pile outperformed those trained on equivalent-sized random web scrapes, demonstrating that data diversity and per-source quality matter. The Pile became the standard training set for open models including GPT-Neo, GPT-J, and Pythia. Its success proved that thoughtful data curation could substitute for raw scale.
The Chinchilla Insight and Its Consequences
Chinchilla (Hoffmann et al., 2022) proved that data should scale proportionally with parameters — a 70B parameter model needs roughly 1.4T tokens for optimal training. This result, known as the Chinchilla scaling law, meant most existing models were undertrained on data.
Suddenly high-quality tokens became a scarce resource. If you need trillions of tokens, you cannot rely on niche curated sources alone. The field needed to figure out how to extract quality from scale — a challenge that spawned an entire sub-field of data engineering.
Filtering at Scale (2023-2024)
RefinedWeb (Penedo et al., 2023) from the Technology Innovation Institute showed that carefully filtered Common Crawl alone could train competitive models. Their pipeline included URL filtering, language detection, quality scoring using a classifier trained on Wikipedia-like text, deduplication at both exact and fuzzy levels, and content filtering. This extracted 5T high-quality tokens from raw Common Crawl and powered the Falcon models.
Hugging Face's FineWeb (2024) pushed further: 15T tokens from Common Crawl with an extensively documented quality filtering pipeline. Each filtering step was ablated to measure its impact on downstream model quality. FineWeb-Edu, a subset filtered for educational content using a classifier trained to identify textbook-like material, proved that small models trained on curated educational data could outperform larger models trained on raw web data.
"Textbooks Are All You Need" and Synthetic Data
Microsoft's Phi series (2023-2024) demonstrated perhaps the most radical data insight: synthetic educational data generated by GPT-4 could outperform web data for training small models. Phi-1 (1.3B parameters) was trained primarily on "textbook-quality" synthetic data and outperformed models 10x its size on code benchmarks.
AI2's Dolma (2024) took the transparency route: 3T tokens with full data documentation, provenance tracking, and reproducibility — enabling the OLMo model family. Dolma's release represented a commitment to open data science, with every processing step documented and reproducible.
The Phi results suggest a provocative conclusion: the quality ceiling for small model training may be set by data quality, not model size. Given sufficiently good data, a 1.3B parameter model can outperform a 15B parameter model trained on noisier data. This inverts the conventional scaling assumption that bigger models always win.
Data Mixing and Curriculum Learning
Modern training recipes carefully balance the ratio of different data sources. A typical mix might include roughly 50% web text, 25% code, 10% academic papers, 10% books, and 5% conversation data. Each source serves a different purpose:
- Code data improves reasoning ability even for non-code tasks, because code requires logical structure
- Academic data improves factual accuracy and scientific reasoning
- Books provide depth of knowledge on specific topics and coherent long-form structure
- Conversation data improves instruction-following and dialogue skills
- Web data provides breadth of knowledge across all topics
Curriculum learning — ordering data from easy to hard during training, or adjusting the data mix as training progresses — has shown modest but consistent improvements. Some teams front-load high-quality data to establish strong foundations, then add noisier web data for breadth. Others increase the proportion of code and academic data late in training to sharpen reasoning before post-training begins.
Data decontamination has also become a critical step. If benchmark test data appears in the training set, the model memorizes answers rather than learning to solve problems, producing artificially inflated evaluation scores. Modern pipelines explicitly remove known benchmark datasets and their paraphrases from training data.
Why It Matters
The data quality revolution inverted the economics of LLM training. In the "more data" era, competitive advantage came from having the most compute. In the quality era, competitive advantage comes from having the best data pipeline.
This explains how smaller organizations — DeepSeek, Mistral, AI2 — can train models competitive with those from companies spending 10-100x more on compute. It also explains the growing importance of data decontamination: removing benchmark test data from training sets became critical for honest evaluation, as models trained on benchmark data produce artificially inflated scores.
The implications extend to ethics and policy. Data documentation and provenance tracking became ethical imperatives. The question of who owns and controls high-quality training data became a major legal issue, with lawsuits from publishers, artists, and coders whose work was used without explicit consent.
Key Technical Details
- The Pile: 825GB, 22 sources, released 2020. Standard for open models through 2022.
- GPT-3 Training Data: ~300B tokens. Common Crawl (60%), WebText2, Books1, Books2, Wikipedia.
- Chinchilla Optimal: ~20 tokens per parameter. A 70B model needs ~1.4T tokens.
- RefinedWeb: 5T tokens extracted from Common Crawl. Powered Falcon-40B (2023).
- Dolma: 3T tokens, fully documented. Powers OLMo (AI2, 2024).
- FineWeb: 15T tokens from Common Crawl. Hugging Face, 2024.
- FineWeb-Edu: Educational subset of FineWeb. 1.3T tokens scored for educational value.
- Phi-1 Training Data: ~7B tokens of synthetic "textbook" data plus 1.5B from The Stack.
- Typical Modern Data Mix: ~50% web, ~25% code, ~10% academic, ~10% books, ~5% conversation.
- Deduplication Impact: Removing duplicates typically improves performance by 2-5% on benchmarks.
Common Misconceptions
-
"More data always means a better model." Beyond a threshold, data quality dominates quantity. A model trained on 1T curated tokens can outperform one trained on 10T noisy tokens. The Phi series proved this at small scale, and Chinchilla proved the importance of the data-parameter balance at large scale.
-
"Synthetic data is just a cheap shortcut." Synthetic data is not inherently lower quality. When generated by capable models and filtered carefully, it can exceed the quality of web-scraped data. The key is diversity and verification — synthetic data must cover the full range of topics and difficulty levels.
-
"Data curation is just removing duplicates and profanity." Modern data pipelines involve dozens of steps: language identification, quality scoring, topic classification, toxicity filtering, exact and fuzzy deduplication, PII removal, benchmark decontamination, and source-specific processing. This is a major engineering effort requiring months of development.
Connections to Other Concepts
The data quality revolution was catalyzed by 03-chinchilla-and-compute-optimal-training.md showing that data scales matter as much as compute. The 01-phi-series.md demonstrated synthetic data's potential, while 06-falcon.md proved filtered web data could compete. 06-the-synthetic-data-revolution.md covers the synthetic data approach in depth. Data quality interacts directly with 07-training-efficiency-breakthroughs.md — better data means fewer tokens needed. For the open-source data ecosystem, see 04-the-open-source-ecosystem.md.
Further Reading
- Gao et al., "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" (2020) — curated dataset that changed open-source LLM training.
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022) — Chinchilla scaling laws.
- Penedo et al., "The RefinedWeb Dataset for Falcon LLM" (2023) — careful filtering of Common Crawl at scale.
- Gunasekar et al., "Textbooks Are All You Need" (2023) — Phi-1 paper demonstrating synthetic data quality.
- Soldaini et al., "Dolma: An Open Corpus of Three Trillion Tokens" (2024) — AI2's documented training corpus.
- Penedo et al., "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale" (2024) — Hugging Face 15T token pipeline.