One-Line Summary: GPT-1 (Radford et al., 2018) combined a decoder-only Transformer with unsupervised generative pre-training followed by supervised fine-tuning, establishing the paradigm that decoder-only models trained on next-token prediction could develop broad language understanding.
Prerequisites: 01-attention-is-all-you-need.md, 06-ulmfit-and-transfer-learning.md
What Is GPT-1?
Imagine a student who spends months reading thousands of books — fiction, textbooks, newspapers, manuals — with no teacher, no tests, no guidance. Then, when given a specific exam (sentiment analysis, question answering, textual entailment), the student only needs a brief study session to excel, because they've already absorbed the patterns and structures of language itself. GPT-1 was this student: pre-trained to predict the next word on a vast book corpus, then lightly fine-tuned for specific tasks.
In June 2018, Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever at OpenAI published "Improving Language Understanding by Generative Pre-Training." The timing was significant: 05-elmo-and-contextual-embeddings.md (February 2018) and 06-ulmfit-and-transfer-learning.md (January 2018) had just demonstrated that pre-trained representations dramatically improve downstream NLP tasks. But both used LSTM architectures. GPT-1 made a pivotal choice: use a Transformer decoder. And another pivotal choice: pre-train with a causal (left-to-right) language modeling objective rather than bidirectional context.
These choices — decoder-only Transformer, causal language modeling, pre-train then fine-tune — defined what would become the dominant paradigm in AI. While 03-bert.md (published four months later) initially attracted more attention with its bidirectional approach, GPT-1's decoder-only architecture proved to be the one that scaled to GPT-2, GPT-3, GPT-4, and the entire family of modern LLMs.
How It Works
GPT-1: Decoder-Only Transformer + Transfer Learning
┌──────────────────────────────────────────────────────┐
│ STAGE 1: Unsupervised Pre-training │
│ │
│ BookCorpus (~800M words) │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ 12-layer Transformer Decoder │ │
│ │ (causal masking: left-to-right) │ │
│ │ │ │
│ │ Input: The cat sat on │ │
│ │ Target: cat sat on the │ │
│ │ (predict next token) │ │
│ └──────────────────────────────────┘ │
│ │
│ STAGE 2: Supervised Fine-tuning │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Same Transformer + Linear Head │ │
│ │ │ │
│ │ Classification: [text] [CLS] ──▶ label │
│ │ Entailment: [premise] [SEP] [hypothesis] ──▶ │
│ │ Similarity: [A] [SEP] [B] + [B] [SEP] [A] │
│ └──────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘Figure: GPT-1's two-stage pipeline. Pre-training learns general language knowledge from unlabeled text via next-token prediction; fine-tuning adapts this knowledge to specific tasks with a single linear layer.
Architecture: Decoder-Only Transformer
GPT-1 used a 12-layer Transformer decoder with 12 attention heads and an embedding dimension of 768, totaling approximately 117 million parameters. Unlike the original Transformer (01-attention-is-all-you-need.md), GPT-1 dropped the encoder entirely. There was no cross-attention — only masked self-attention within the decoder. Each position could only attend to positions to its left (and itself), enforcing a causal structure where the model predicts each token based solely on the preceding tokens.
The model used learned positional embeddings (not sinusoidal), GELU activation functions (instead of ReLU), and a byte-pair encoding (BPE) tokenizer with a vocabulary of approximately 40,000 tokens. The context window was 512 tokens.
Stage 1: Unsupervised Pre-training
GPT-1 was pre-trained on BookCorpus — approximately 7,000 unpublished books (~800 million words, ~5GB of text) — using a standard causal language modeling objective: maximize P(token_t | token_1, ..., token_{t-1}). The model learned to predict the next word given all previous words in the sequence. This objective forced the model to learn syntax, semantics, world knowledge, and reasoning patterns — all from raw text with no labels.
The choice of BookCorpus was deliberate: it contained long, coherent passages of text (unlike web crawl data), which helped the model learn long-range dependencies. Training used Adam optimizer with a learning rate of 2.5e-4, a batch size of 64, and 100 epochs.
Stage 2: Supervised Fine-tuning
For fine-tuning, GPT-1 added a single linear layer on top of the final Transformer block's output. The innovation was in how different tasks were formulated to fit the decoder architecture:
- Classification (e.g., sentiment): Append a special [CLS] token; use its final representation for prediction.
- Entailment: Concatenate premise and hypothesis with a delimiter token; classify the final representation.
- Similarity: Run both orderings (A-B and B-A) and add the representations.
- Multiple choice: Concatenate question with each answer; score each independently.
During fine-tuning, the language modeling objective was retained as an auxiliary loss (weighted by 0.5), which improved generalization and accelerated convergence. This was a technique borrowed from multi-task learning.
Results
GPT-1 achieved state-of-the-art results on 9 out of 12 evaluated tasks, including commonsense reasoning (Stories Cloze: 86.5%), question answering (RACE: 59.0%), and textual entailment (MNLI: 82.1%). On some tasks, it improved over the previous SOTA by 8-9%. The model showed strong performance even with minimal task-specific data, demonstrating the effectiveness of transfer learning.
Why It Matters
Establishing the GPT Paradigm
GPT-1 defined a formula: decoder-only Transformer + causal language model pre-training + task-specific fine-tuning. This formula, scaled up with more parameters and more data, would produce GPT-2 (04-gpt-2.md), GPT-3, and eventually GPT-4. The decoder-only choice was initially seen as a limitation compared to BERT's bidirectional approach, but it proved critical for generation capabilities and for the scaling properties that emerged in larger models.
Generative Pre-training as Knowledge Acquisition
The paper's key insight was that causal language modeling — simply predicting the next word — is a rich enough objective to learn general language understanding. You don't need labeled data or supervised objectives to learn syntax, semantics, or reasoning. The next-word prediction task is so demanding that a model capable of doing it well must internalize deep knowledge about language and the world. This insight drove the entire scaling era.
The GPT vs BERT Fork
GPT-1 (June 2018) and BERT (October 2018) represented two divergent bets on the future of NLP. GPT bet on unidirectional, generative pre-training with a decoder-only architecture. BERT bet on bidirectional, masked language modeling with an encoder-only architecture. BERT initially won the benchmarks, and "BERT-ification" swept NLP in 2019. But GPT's approach proved to scale better and to be more versatile — capable of both understanding and generation. The eventual triumph of decoder-only architectures is analyzed in 07-encoder-vs-decoder-vs-encoder-decoder.md.
Key Technical Details
- Paper: Radford et al., "Improving Language Understanding by Generative Pre-Training" (Jun 2018, OpenAI technical report)
- Architecture: 12-layer decoder-only Transformer, 12 heads, d_model=768, ~117M parameters
- Pre-training data: BookCorpus (~7,000 books, ~800M words, ~5GB)
- Tokenization: BPE with ~40,000 vocabulary
- Context window: 512 tokens
- Pre-training objective: Causal language modeling (next-token prediction)
- Fine-tuning: Task-specific linear head + auxiliary LM loss (weight 0.5)
- Results: SOTA on 9/12 tasks; +8.9% on Stories Cloze, +5.7% on RACE
- Training: Single run on 8 GPUs; took approximately 1 month
- Key comparison: Outperformed ELMo-based approaches on most tasks despite similar scale
Common Misconceptions
-
"GPT-1 was primarily a text generation model." GPT-1 was evaluated and promoted primarily as a language understanding model — note the subtitle "Improving Language Understanding." Text generation was not the focus. The generative capabilities that would make GPT-2 famous were latent in GPT-1 but not emphasized.
-
"GPT-1 was immediately recognized as more important than BERT." The opposite. BERT dominated the NLP landscape from late 2018 through 2020. GPT-1 was seen as a competent but less powerful approach because it couldn't use bidirectional context. The vindication of decoder-only models came later, with GPT-3's in-context learning.
-
"GPT-1 used massive amounts of training data." BookCorpus (~800M words) was modest by later standards. GPT-2 used 40GB of web text; GPT-3 used 570GB. GPT-1's data was smaller than what was used for ELMo (1B Word Benchmark) and far smaller than GloVe's training data (840B tokens). The architecture mattered more than the data scale.
-
"The 'GPT' name always referred to a chatbot." GPT stands for "Generative Pre-Training." The original paper had nothing to do with chatbots or conversational AI. That association came much later with ChatGPT (November 2022), which was built on GPT-3.5.
Connections to Other Concepts
- Built directly on the Transformer architecture from
01-attention-is-all-you-need.md(decoder-only variant) - Adopted the pre-train-then-fine-tune paradigm demonstrated by
06-ulmfit-and-transfer-learning.mdand05-elmo-and-contextual-embeddings.md - Competed with and was initially overshadowed by
03-bert.md - Scaled up to become
04-gpt-2.md, with zero-shot capabilities emerging at larger scale - The decoder-only vs encoder-only debate is analyzed in
07-encoder-vs-decoder-vs-encoder-decoder.md - For the technical details of autoregressive generation, see
llm-concepts/autoregressive-generation.md - For tokenization details, see
llm-concepts/embeddings-and-tokenization.md
Further Reading
- Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018, OpenAI) — the GPT-1 paper
- Radford et al., "Language Models are Unsupervised Multitask Learners" (2019, OpenAI) — GPT-2, the direct sequel
- Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (2018, arXiv:1810.04805) — the contemporaneous alternative approach
- Wang et al., "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding" (2018, arXiv:1804.07461) — the benchmark ecosystem GPT-1 was evaluated on