One-Line Summary: GPT-1 (Radford et al., 2018) combined a decoder-only Transformer with unsupervised generative pre-training followed by supervised fine-tuning, establishing the paradigm that decoder-only models trained on next-token prediction could develop broad language understanding.

Prerequisites: 01-attention-is-all-you-need.md, 06-ulmfit-and-transfer-learning.md

What Is GPT-1?

Imagine a student who spends months reading thousands of books — fiction, textbooks, newspapers, manuals — with no teacher, no tests, no guidance. Then, when given a specific exam (sentiment analysis, question answering, textual entailment), the student only needs a brief study session to excel, because they've already absorbed the patterns and structures of language itself. GPT-1 was this student: pre-trained to predict the next word on a vast book corpus, then lightly fine-tuned for specific tasks.

In June 2018, Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever at OpenAI published "Improving Language Understanding by Generative Pre-Training." The timing was significant: 05-elmo-and-contextual-embeddings.md (February 2018) and 06-ulmfit-and-transfer-learning.md (January 2018) had just demonstrated that pre-trained representations dramatically improve downstream NLP tasks. But both used LSTM architectures. GPT-1 made a pivotal choice: use a Transformer decoder. And another pivotal choice: pre-train with a causal (left-to-right) language modeling objective rather than bidirectional context.

These choices — decoder-only Transformer, causal language modeling, pre-train then fine-tune — defined what would become the dominant paradigm in AI. While 03-bert.md (published four months later) initially attracted more attention with its bidirectional approach, GPT-1's decoder-only architecture proved to be the one that scaled to GPT-2, GPT-3, GPT-4, and the entire family of modern LLMs.

How It Works

  GPT-1: Decoder-Only Transformer + Transfer Learning
 
  ┌──────────────────────────────────────────────────────┐
  │  STAGE 1: Unsupervised Pre-training                  │
  │                                                      │
  │  BookCorpus (~800M words)                            │
  │       │                                              │
  │       ▼                                              │
  │  ┌──────────────────────────────────┐                │
  │  │  12-layer Transformer Decoder    │                │
  │  │  (causal masking: left-to-right) │                │
  │  │                                  │                │
  │  │  Input:  The cat sat on          │                │
  │  │  Target:     cat sat on the      │                │
  │  │  (predict next token)            │                │
  │  └──────────────────────────────────┘                │
  │                                                      │
  │  STAGE 2: Supervised Fine-tuning                     │
  │                                                      │
  │  ┌──────────────────────────────────┐                │
  │  │  Same Transformer + Linear Head  │                │
  │  │                                  │                │
  │  │  Classification: [text] [CLS] ──▶ label          │
  │  │  Entailment:  [premise] [SEP] [hypothesis] ──▶   │
  │  │  Similarity:  [A] [SEP] [B]  +  [B] [SEP] [A]   │
  │  └──────────────────────────────────┘                │
  └──────────────────────────────────────────────────────┘

Figure: GPT-1's two-stage pipeline. Pre-training learns general language knowledge from unlabeled text via next-token prediction; fine-tuning adapts this knowledge to specific tasks with a single linear layer.

Architecture: Decoder-Only Transformer

GPT-1 used a 12-layer Transformer decoder with 12 attention heads and an embedding dimension of 768, totaling approximately 117 million parameters. Unlike the original Transformer (01-attention-is-all-you-need.md), GPT-1 dropped the encoder entirely. There was no cross-attention — only masked self-attention within the decoder. Each position could only attend to positions to its left (and itself), enforcing a causal structure where the model predicts each token based solely on the preceding tokens.

The model used learned positional embeddings (not sinusoidal), GELU activation functions (instead of ReLU), and a byte-pair encoding (BPE) tokenizer with a vocabulary of approximately 40,000 tokens. The context window was 512 tokens.

Stage 1: Unsupervised Pre-training

GPT-1 was pre-trained on BookCorpus — approximately 7,000 unpublished books (~800 million words, ~5GB of text) — using a standard causal language modeling objective: maximize P(token_t | token_1, ..., token_{t-1}). The model learned to predict the next word given all previous words in the sequence. This objective forced the model to learn syntax, semantics, world knowledge, and reasoning patterns — all from raw text with no labels.

The choice of BookCorpus was deliberate: it contained long, coherent passages of text (unlike web crawl data), which helped the model learn long-range dependencies. Training used Adam optimizer with a learning rate of 2.5e-4, a batch size of 64, and 100 epochs.

Stage 2: Supervised Fine-tuning

For fine-tuning, GPT-1 added a single linear layer on top of the final Transformer block's output. The innovation was in how different tasks were formulated to fit the decoder architecture:

  • Classification (e.g., sentiment): Append a special [CLS] token; use its final representation for prediction.
  • Entailment: Concatenate premise and hypothesis with a delimiter token; classify the final representation.
  • Similarity: Run both orderings (A-B and B-A) and add the representations.
  • Multiple choice: Concatenate question with each answer; score each independently.

During fine-tuning, the language modeling objective was retained as an auxiliary loss (weighted by 0.5), which improved generalization and accelerated convergence. This was a technique borrowed from multi-task learning.

Results

GPT-1 achieved state-of-the-art results on 9 out of 12 evaluated tasks, including commonsense reasoning (Stories Cloze: 86.5%), question answering (RACE: 59.0%), and textual entailment (MNLI: 82.1%). On some tasks, it improved over the previous SOTA by 8-9%. The model showed strong performance even with minimal task-specific data, demonstrating the effectiveness of transfer learning.

Why It Matters

Establishing the GPT Paradigm

GPT-1 defined a formula: decoder-only Transformer + causal language model pre-training + task-specific fine-tuning. This formula, scaled up with more parameters and more data, would produce GPT-2 (04-gpt-2.md), GPT-3, and eventually GPT-4. The decoder-only choice was initially seen as a limitation compared to BERT's bidirectional approach, but it proved critical for generation capabilities and for the scaling properties that emerged in larger models.

Generative Pre-training as Knowledge Acquisition

The paper's key insight was that causal language modeling — simply predicting the next word — is a rich enough objective to learn general language understanding. You don't need labeled data or supervised objectives to learn syntax, semantics, or reasoning. The next-word prediction task is so demanding that a model capable of doing it well must internalize deep knowledge about language and the world. This insight drove the entire scaling era.

The GPT vs BERT Fork

GPT-1 (June 2018) and BERT (October 2018) represented two divergent bets on the future of NLP. GPT bet on unidirectional, generative pre-training with a decoder-only architecture. BERT bet on bidirectional, masked language modeling with an encoder-only architecture. BERT initially won the benchmarks, and "BERT-ification" swept NLP in 2019. But GPT's approach proved to scale better and to be more versatile — capable of both understanding and generation. The eventual triumph of decoder-only architectures is analyzed in 07-encoder-vs-decoder-vs-encoder-decoder.md.

Key Technical Details

  • Paper: Radford et al., "Improving Language Understanding by Generative Pre-Training" (Jun 2018, OpenAI technical report)
  • Architecture: 12-layer decoder-only Transformer, 12 heads, d_model=768, ~117M parameters
  • Pre-training data: BookCorpus (~7,000 books, ~800M words, ~5GB)
  • Tokenization: BPE with ~40,000 vocabulary
  • Context window: 512 tokens
  • Pre-training objective: Causal language modeling (next-token prediction)
  • Fine-tuning: Task-specific linear head + auxiliary LM loss (weight 0.5)
  • Results: SOTA on 9/12 tasks; +8.9% on Stories Cloze, +5.7% on RACE
  • Training: Single run on 8 GPUs; took approximately 1 month
  • Key comparison: Outperformed ELMo-based approaches on most tasks despite similar scale

Common Misconceptions

  • "GPT-1 was primarily a text generation model." GPT-1 was evaluated and promoted primarily as a language understanding model — note the subtitle "Improving Language Understanding." Text generation was not the focus. The generative capabilities that would make GPT-2 famous were latent in GPT-1 but not emphasized.

  • "GPT-1 was immediately recognized as more important than BERT." The opposite. BERT dominated the NLP landscape from late 2018 through 2020. GPT-1 was seen as a competent but less powerful approach because it couldn't use bidirectional context. The vindication of decoder-only models came later, with GPT-3's in-context learning.

  • "GPT-1 used massive amounts of training data." BookCorpus (~800M words) was modest by later standards. GPT-2 used 40GB of web text; GPT-3 used 570GB. GPT-1's data was smaller than what was used for ELMo (1B Word Benchmark) and far smaller than GloVe's training data (840B tokens). The architecture mattered more than the data scale.

  • "The 'GPT' name always referred to a chatbot." GPT stands for "Generative Pre-Training." The original paper had nothing to do with chatbots or conversational AI. That association came much later with ChatGPT (November 2022), which was built on GPT-3.5.

Connections to Other Concepts

  • Built directly on the Transformer architecture from 01-attention-is-all-you-need.md (decoder-only variant)
  • Adopted the pre-train-then-fine-tune paradigm demonstrated by 06-ulmfit-and-transfer-learning.md and 05-elmo-and-contextual-embeddings.md
  • Competed with and was initially overshadowed by 03-bert.md
  • Scaled up to become 04-gpt-2.md, with zero-shot capabilities emerging at larger scale
  • The decoder-only vs encoder-only debate is analyzed in 07-encoder-vs-decoder-vs-encoder-decoder.md
  • For the technical details of autoregressive generation, see llm-concepts/autoregressive-generation.md
  • For tokenization details, see llm-concepts/embeddings-and-tokenization.md

Further Reading

  • Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018, OpenAI) — the GPT-1 paper
  • Radford et al., "Language Models are Unsupervised Multitask Learners" (2019, OpenAI) — GPT-2, the direct sequel
  • Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (2018, arXiv:1810.04805) — the contemporaneous alternative approach
  • Wang et al., "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding" (2018, arXiv:1804.07461) — the benchmark ecosystem GPT-1 was evaluated on