GPT-1: Generative Pre-Training

One-Line Summary: GPT-1 (Radford et al., 2018) combined a decoder-only Transformer with unsupervised generative pre-training followed by supervised fine-tuning, establishing the paradigm that decoder-only models trained on next-token prediction could develop broad language understanding.

Prerequisites: 01-attention-is-all-you-need.md, 06-ulmfit-and-transfer-learning.md

What Is GPT-1?

Imagine a student who spends months reading thousands of books — fiction, textbooks, newspapers, manuals — with no teacher, no tests, no guidance. Then, when given a specific exam (sentiment analysis, question answering, textual entailment), the student only needs a brief study session to excel, because they've already absorbed the patterns and structures of language itself. GPT-1 was this student: pre-trained to predict the next word on a vast book corpus, then lightly fine-tuned for specific tasks.

In June 2018, Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever at OpenAI published "Improving Language Understanding by Generative Pre-Training." The timing was significant: 05-elmo-and-contextual-embeddings.md (February 2018) and 06-ulmfit-and-transfer-learning.md (January 2018) had just demonstrated that pre-trained representations dramatically improve downstream NLP tasks. But both used LSTM architectures. GPT-1 made a pivotal choice: use a Transformer decoder. And another pivotal choice: pre-train with a causal (left-to-right) language modeling objective rather than bidirectional context.

These choices — decoder-only Transformer, causal language modeling, pre-train then fine-tune — defined what would become the dominant paradigm in AI. While 03-bert.md (published four months later) initially attracted more attention with its bidirectional approach, GPT-1's decoder-only architecture proved to be the one that scaled to GPT-2, GPT-3, GPT-4, and the entire family of modern LLMs.

How It Works

  GPT-1: Decoder-Only Transformer + Transfer Learning
 
  ┌──────────────────────────────────────────────────────┐
  │  STAGE 1: Unsupervised Pre-training                  │
  │                                                      │
  │  BookCorpus (~800M words)                            │
  │       │                                              │
  │       ▼                                              │
  │  ┌──────────────────────────────────┐                │
  │  │  12-layer Transformer Decoder    │                │
  │  │  (causal masking: left-to-right) │                │
  │  │                                  │                │
  │  │  Input:  The cat sat on          │                │
  │  │  Target:     cat sat on the      │                │
  │  │  (predict next token)            │                │
  │  └──────────────────────────────────┘                │
  │                                                      │
  │  STAGE 2: Supervised Fine-tuning                     │
  │                                                      │
  │  ┌──────────────────────────────────┐                │
  │  │  Same Transformer + Linear Head  │                │
  │  │                                  │                │
  │  │  Classification: [text] [CLS] ──▶ label          │
  │  │  Entailment:  [premise] [SEP] [hypothesis] ──▶   │
  │  │  Similarity:  [A] [SEP] [B]  +  [B] [SEP] [A]   │
  │  └──────────────────────────────────┘                │
  └──────────────────────────────────────────────────────┘

Figure: GPT-1's two-stage pipeline. Pre-training learns general language knowledge from unlabeled text via next-token prediction; fine-tuning adapts this knowledge to specific tasks with a single linear layer.

Architecture: Decoder-Only Transformer

GPT-1 used a 12-layer Transformer decoder with 12 attention heads and an embedding dimension of 768, totaling approximately 117 million parameters. Unlike the original Transformer (01-attention-is-all-you-need.md), GPT-1 dropped the encoder entirely. There was no cross-attention — only masked self-attention within the decoder. Each position could only attend to positions to its left (and itself), enforcing a causal structure where the model predicts each token based solely on the preceding tokens.

The model used learned positional embeddings (not sinusoidal), GELU activation functions (instead of ReLU), and a byte-pair encoding (BPE) tokenizer with a vocabulary of approximately 40,000 tokens. The context window was 512 tokens.

Stage 1: Unsupervised Pre-training

GPT-1 was pre-trained on BookCorpus — approximately 7,000 unpublished books (~800 million words, ~5GB of text) — using a standard causal language modeling objective: maximize P(token_t | token_1, ..., token_{t-1}). The model learned to predict the next word given all previous words in the sequence. This objective forced the model to learn syntax, semantics, world knowledge, and reasoning patterns — all from raw text with no labels.

The choice of BookCorpus was deliberate: it contained long, coherent passages of text (unlike web crawl data), which helped the model learn long-range dependencies. Training used Adam optimizer with a learning rate of 2.5e-4, a batch size of 64, and 100 epochs.

Stage 2: Supervised Fine-tuning

For fine-tuning, GPT-1 added a single linear layer on top of the final Transformer block's output. The innovation was in how different tasks were formulated to fit the decoder architecture:

Classification (e.g., sentiment): Append a special [CLS] token; use its final representation for prediction.
Entailment: Concatenate premise and hypothesis with a delimiter token; classify the final representation.
Similarity: Run both orderings (A-B and B-A) and add the representations.
Multiple choice: Concatenate question with each answer; score each independently.

During fine-tuning, the language modeling objective was retained as an auxiliary loss (weighted by 0.5), which improved generalization and accelerated convergence. This was a technique borrowed from multi-task learning.

Results

GPT-1 achieved state-of-the-art results on 9 out of 12 evaluated tasks, including commonsense reasoning (Stories Cloze: 86.5%), question answering (RACE: 59.0%), and textual entailment (MNLI: 82.1%). On some tasks, it improved over the previous SOTA by 8-9%. The model showed strong performance even with minimal task-specific data, demonstrating the effectiveness of transfer learning.

Why It Matters

Establishing the GPT Paradigm

GPT-1 defined a formula: decoder-only Transformer + causal language model pre-training + task-specific fine-tuning. This formula, scaled up with more parameters and more data, would produce GPT-2 (04-gpt-2.md), GPT-3, and eventually GPT-4. The decoder-only choice was initially seen as a limitation compared to BERT's bidirectional approach, but it proved critical for generation capabilities and for the scaling properties that emerged in larger models.

Generative Pre-training as Knowledge Acquisition

The paper's key insight was that causal language modeling — simply predicting the next word — is a rich enough objective to learn general language understanding. You don't need labeled data or supervised objectives to learn syntax, semantics, or reasoning. The next-word prediction task is so demanding that a model capable of doing it well must internalize deep knowledge about language and the world. This insight drove the entire scaling era.

The GPT vs BERT Fork

GPT-1 (June 2018) and BERT (October 2018) represented two divergent bets on the future of NLP. GPT bet on unidirectional, generative pre-training with a decoder-only architecture. BERT bet on bidirectional, masked language modeling with an encoder-only architecture. BERT initially won the benchmarks, and "BERT-ification" swept NLP in 2019. But GPT's approach proved to scale better and to be more versatile — capable of both understanding and generation. The eventual triumph of decoder-only architectures is analyzed in 07-encoder-vs-decoder-vs-encoder-decoder.md.

Key Technical Details

Paper: Radford et al., "Improving Language Understanding by Generative Pre-Training" (Jun 2018, OpenAI technical report)
Architecture: 12-layer decoder-only Transformer, 12 heads, d_model=768, ~117M parameters
Pre-training data: BookCorpus (~7,000 books, ~800M words, ~5GB)
Tokenization: BPE with ~40,000 vocabulary
Context window: 512 tokens
Pre-training objective: Causal language modeling (next-token prediction)
Fine-tuning: Task-specific linear head + auxiliary LM loss (weight 0.5)
Results: SOTA on 9/12 tasks; +8.9% on Stories Cloze, +5.7% on RACE
Training: Single run on 8 GPUs; took approximately 1 month
Key comparison: Outperformed ELMo-based approaches on most tasks despite similar scale

Common Misconceptions

"GPT-1 was primarily a text generation model." GPT-1 was evaluated and promoted primarily as a language understanding model — note the subtitle "Improving Language Understanding." Text generation was not the focus. The generative capabilities that would make GPT-2 famous were latent in GPT-1 but not emphasized.
"GPT-1 was immediately recognized as more important than BERT." The opposite. BERT dominated the NLP landscape from late 2018 through 2020. GPT-1 was seen as a competent but less powerful approach because it couldn't use bidirectional context. The vindication of decoder-only models came later, with GPT-3's in-context learning.
"GPT-1 used massive amounts of training data." BookCorpus (~800M words) was modest by later standards. GPT-2 used 40GB of web text; GPT-3 used 570GB. GPT-1's data was smaller than what was used for ELMo (1B Word Benchmark) and far smaller than GloVe's training data (840B tokens). The architecture mattered more than the data scale.
"The 'GPT' name always referred to a chatbot." GPT stands for "Generative Pre-Training." The original paper had nothing to do with chatbots or conversational AI. That association came much later with ChatGPT (November 2022), which was built on GPT-3.5.

Connections to Other Concepts

Built directly on the Transformer architecture from 01-attention-is-all-you-need.md (decoder-only variant)
Adopted the pre-train-then-fine-tune paradigm demonstrated by 06-ulmfit-and-transfer-learning.md and 05-elmo-and-contextual-embeddings.md
Competed with and was initially overshadowed by 03-bert.md
Scaled up to become 04-gpt-2.md, with zero-shot capabilities emerging at larger scale
The decoder-only vs encoder-only debate is analyzed in 07-encoder-vs-decoder-vs-encoder-decoder.md
For the technical details of autoregressive generation, see llm-concepts/autoregressive-generation.md
For tokenization details, see llm-concepts/embeddings-and-tokenization.md