One-Line Summary: OpenAI's 175-billion-parameter language model demonstrated that massive scale unlocks in-context learning, allowing a single model to perform diverse tasks from just a few examples in the prompt.
Prerequisites: 04-gpt-2.md, 02-kaplan-scaling-laws.md
What Is GPT-3?
Imagine a student who has read so much of the world's text that you can hand them a few examples of any task — translating French, writing poetry, solving analogies — and they immediately understand what you want, without any additional studying. That is what GPT-3 achieved: a model so large that it could learn new tasks on the fly, just from examples placed in its input prompt.
GPT-3 arrived in May 2020 through a paper by Tom Brown and over 30 co-authors at OpenAI. It was not a radical architectural departure from GPT-2 — it was the same decoder-only Transformer recipe, just scaled to a staggering degree. Where GPT-2 had 1.5 billion parameters, GPT-3 had 175 billion, a 100x increase. This was not recklessness; it was a calculated bet informed by Kaplan et al.'s scaling laws, which had shown just months earlier that language model loss decreases as a smooth power law with model size.
The motivation was a deep dissatisfaction with the fine-tuning paradigm. Every time you wanted a language model to do something new — sentiment analysis, question answering, summarization — you had to collect a labeled dataset and retrain. This was expensive, slow, and produced narrow specialists. OpenAI wanted a generalist: a single model that could handle anything you threw at it. The hypothesis was that if you made a model big enough and fed it enough diverse text, it would internalize so many patterns that it could adapt to new tasks from just a few demonstrations.
How It Works
GPT-3: In-Context Learning at Scale
The Key Insight — No Fine-Tuning Needed:
ZERO-SHOT: FEW-SHOT:
┌───────────────────────┐ ┌───────────────────────────────┐
│ Translate to French: │ │ happy ──▶ heureux │
│ "cheese" │ │ sad ──▶ triste │
│ │ │ cheese ──▶ │
│ ──▶ "fromage" │ │ ──▶ "fromage" │
└───────────────────────┘ └───────────────────────────────┘
The Eight-Model Suite (capability scales with size):
125M 350M 760M 1.3B 2.7B 6.7B 13B 175B
│ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
░░░░ ░░░░ ░░██ ████ ████ ████ ████ ████
(near-random) (improving) (strong few-shot)
3-digit addition accuracy:
Small models: ~0% ──────────────────────▶ 175B: ~80%
Scale: GPT-1 (117M) ──▶ GPT-2 (1.5B) ──▶ GPT-3 (175B)
×1 ×13 ×1,500Figure: GPT-3 demonstrated that in-context learning capability scales with model size. The eight-model suite showed smooth scaling for some tasks and sharp thresholds for others (like arithmetic), seeding the concept of emergent abilities.
Architecture at Scale
GPT-3 uses a standard autoregressive Transformer decoder with 96 layers, 96 attention heads, and a hidden dimension of 12,288. The context window is 2,048 tokens, using the same BPE tokenizer as GPT-2. The architecture includes alternating dense self-attention and feedforward layers, with the feedforward dimension set to 4x the hidden size (49,152). Despite being the largest dense Transformer ever trained at the time, every architectural choice was deliberately conservative — the innovation was in the scale, not the structure.
Training Data and Compute
The training corpus comprised approximately 300 billion tokens drawn from five sources: filtered Common Crawl (60% of training mix by weight, though only 410 billion tokens from a 45TB raw crawl), WebText2 (an expanded version of GPT-2's dataset), two internet-based book corpora (Books1 and Books2), and English-language Wikipedia. Crucially, the mixture was not proportional to dataset size — higher-quality sources like Wikipedia and books were upsampled, meaning the model saw them multiple times during training. The estimated training cost was approximately $4.6 million in cloud compute, using thousands of V100 GPUs over several weeks.
In-Context Learning
The paper's central discovery was in-context learning (ICL): the ability to perform tasks by conditioning on examples provided in the prompt, with absolutely no gradient updates to the model's weights. The authors tested three regimes: zero-shot (task description only), one-shot (one example), and few-shot (up to 64 examples in the context window). On many benchmarks, few-shot GPT-3 approached or matched fine-tuned state-of-the-art models. On some tasks like arithmetic and word unscrambling, performance jumped dramatically between the smallest and largest model sizes, hinting at emergent capabilities that would become a major research theme.
The Eight-Model Suite
OpenAI did not just train one model. They trained eight GPT-3 variants ranging from 125 million to 175 billion parameters (GPT-3 Small, Medium, Large, XL, 2.7B, 6.7B, 13B, and 175B). This suite was essential: it showed that in-context learning ability scaled smoothly with model size, and that many capabilities only appeared at the largest scales. The smallest models were barely better than random on tasks like three-digit arithmetic, while the 175B model achieved near-80% accuracy.
Why It Matters
The Birth of the Foundation Model Era
GPT-3 was the proof of concept for the "foundation model" paradigm — the idea that a single large pre-trained model could serve as the base for countless downstream applications. Rather than training task-specific models, you could build on top of one general-purpose system. This concept, later formalized by Stanford's Center for Research on Foundation Models in 2021, restructured how the entire industry thought about AI development.
The API Business Model
In June 2020, OpenAI launched the GPT-3 API, making the model available as a commercial service. This was transformative: for the first time, developers without ML expertise or massive compute budgets could build AI-powered applications. The API model became the dominant business paradigm for frontier AI, later adopted by Anthropic, Google, Cohere, and others. It also meant that the most powerful AI capabilities were controlled by a handful of companies — a dynamic that would shape the open-source vs. closed-source debates of 2023-2024.
Shifting the Overton Window
GPT-3's fluent text generation and in-context learning stunned the broader tech community. Demos went viral on Twitter. Startups rushed to build on the API. The conversation shifted from "can language models do useful things?" to "what can't they do?" This cultural shift — more than any specific benchmark result — is what made GPT-3 a turning point. It convinced investors, executives, and engineers that large language models were commercially viable, setting the stage for the billions of dollars in AI investment that followed.
Key Technical Details
- Parameters: 175 billion (96 layers, 96 heads, d_model=12,288)
- Training data: ~300B tokens from Common Crawl, WebText2, Books1, Books2, Wikipedia
- Training cost: Approximately $4.6M in compute
- Context window: 2,048 tokens
- Eight model sizes: 125M, 350M, 760M, 1.3B, 2.7B, 6.7B, 13B, 175B parameters
- Few-shot TriviaQA: 71.2% accuracy (closed-book, no fine-tuning)
- SuperGLUE few-shot: 71.8 (vs. fine-tuned BERT baseline of 69.0)
- Released: May 2020 (paper), June 2020 (API)
Common Misconceptions
-
"GPT-3 introduced a new architecture." It used the same Transformer decoder as GPT-2, just scaled 100x larger. The innovation was in scale and evaluation methodology, not architecture.
-
"GPT-3 can learn new tasks." GPT-3's in-context learning involves no weight updates. It conditions on examples in the prompt to steer its outputs, but its parameters remain frozen. It is pattern-matching against its training data, not learning in the traditional ML sense.
-
"GPT-3 was trained on the entire internet." Its training data, while large, was a filtered subset of web text plus books and Wikipedia. The Common Crawl portion underwent significant filtering, reducing 45TB of raw text to the usable portion.
-
"GPT-3 made fine-tuning obsolete." Fine-tuned models still outperformed few-shot GPT-3 on most benchmarks. In-context learning was impressive but not yet competitive with task-specific training. The later RLHF work on InstructGPT and ChatGPT combined both approaches.
-
"Scaling was the only insight." The training data mixture design — upsampling high-quality sources, careful deduplication and filtering of Common Crawl — was critical. A 175B model trained on unfiltered web text would have been far less capable.
Connections to Other Concepts
04-gpt-2.md— Direct predecessor; GPT-3 scales the same architecture 100x02-kaplan-scaling-laws.md— Provided the theoretical justification for training at this scale06-emergent-abilities.md— GPT-3's eight-model suite provided early evidence of emergence01-instructgpt-and-rlhf.md— RLHF applied to GPT-3 produced InstructGPT, proving alignment beats raw scale02-chatgpt.md— ChatGPT was built on GPT-3.5, a descendant of GPT-305-codex-and-code-generation.md— Codex was GPT-3 fine-tuned on code03-chinchilla-and-compute-optimal-training.md— Showed GPT-3 was significantly undertrained relative to its size
Further Reading
- Brown et al., "Language Models are Few-Shot Learners" (2020) — The GPT-3 paper establishing in-context learning.
- Bommasani et al., "On the Opportunities and Risks of Foundation Models" (2021) — Stanford report that formalized the paradigm GPT-3 created.
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020) — The scaling laws that motivated GPT-3's size.
- Zhao et al., "Calibrate Before Use: Improving Few-Shot Performance of Language Models" (2021) — Techniques for improving GPT-3's in-context learning reliability.