GPT-4

One-Line Summary: OpenAI's GPT-4, released in March 2023, was the first multimodal frontier model to accept both text and image inputs, and it achieved a massive leap in reasoning, coding, and factuality that set a new ceiling for AI capabilities — while revealing almost nothing about how it was built.

Prerequisites: 01-gpt-3.md, 01-instructgpt-and-rlhf.md, 02-chatgpt.md

What Is GPT-4?

Imagine that a company builds the world's most powerful telescope. They show you breathtaking images of distant galaxies, prove it can see further than any instrument before it, and let you look through the eyepiece — but they refuse to tell you how the lenses were ground, what materials were used, or even how large the telescope is. That was GPT-4: a model whose outputs stunned the world, while its architecture, training data, and parameter count remained officially undisclosed. In the era of open research that had defined AI's growth, GPT-4 marked the beginning of secrecy at the frontier.

Released on March 14, 2023, GPT-4 was described in a 98-page "technical report" that was notable for how little technical detail it contained. OpenAI cited "the competitive landscape and the safety implications of large-scale models" as reasons for not disclosing architecture, model size, training data, or training methodology. What they did reveal was performance: GPT-4 passed the bar exam at the 90th percentile (up from the 10th percentile for GPT-3.5), scored 86.4% on MMLU (5-shot), achieved 67% on HumanEval (zero-shot, up from ChatGPT's ~48%), and demonstrated dramatic improvements in reasoning, multi-step problem solving, and factual accuracy.

The model arrived just four months after ChatGPT had electrified the world, and it immediately raised the ceiling for what AI could do. GPT-4 was not just incrementally better than GPT-3.5 — it was qualitatively different. It could analyze images, write sophisticated code, reason through complex legal and scientific problems, and maintain coherent reasoning across thousands of words. It became the benchmark that every subsequent model was measured against.

How It Works

  GPT-4: The Capability Leap
 
  Performance Comparison (GPT-3.5 vs GPT-4):
  ┌────────────────────────────────────────────────────────┐
  │  Benchmark          GPT-3.5        GPT-4     Δ        │
  │  ─────────          ───────        ─────     ─        │
  │  Bar Exam           10th %ile      90th %ile ████████  │
  │  MMLU (5-shot)      ~70%           86.4%     ████      │
  │  HumanEval (0-shot) ~48%           ~67%      ████      │
  │  Safety (disallowed) baseline      -82%      ████████  │
  └────────────────────────────────────────────────────────┘
 
  Reported Architecture (Mixture of Experts):
  ┌────────────────────────────────────────────────────────┐
  │                                                        │
  │  Total parameters: ~1.7 Trillion                       │
  │  ┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐...┌────┐       │
  │  │ E1 ││ E2 ││ E3 ││ E4 ││ E5 ││ E6 │   │E16 │       │
  │  │111B││111B││111B││111B││111B││111B│   │111B│       │
  │  └──┬─┘└──┬─┘└──┬─┘└──┬─┘└──┬─┘└──┬─┘   └──┬─┘       │
  │     └──────┴─────┘     └──────┴─────┘        │        │
  │     2 experts active per token (~220B active params)   │
  │                                                        │
  │  Input: Text + Images (multimodal)                     │
  │  Output: Text only                                     │
  │                                                        │
  │  Predictive scaling: Performance forecast from         │
  │  1/1000th compute proxy models                         │
  └────────────────────────────────────────────────────────┘

Figure: GPT-4 achieved dramatic performance improvements over GPT-3.5 across benchmarks. Widely reported to use a Mixture-of-Experts architecture with ~1.7T total parameters but only ~220B active per token.

Architecture (What Is Known)

OpenAI disclosed almost nothing about GPT-4's architecture. The technical report stated only that it was a "Transformer-style model pre-trained to predict the next token in a document, using both publicly available data and data licensed from third-party providers." However, multiple credible reports (notably from George Hotz, Semianalysis, and various industry insiders) have indicated that GPT-4 is a Mixture-of-Experts (MoE) model with approximately 1.7 trillion total parameters, organized as 16 expert networks of roughly 111 billion parameters each, with 2 experts active per token. If accurate, this would mean GPT-4 has roughly 220B active parameters per token — larger than GPT-3 but far smaller than the 1.7T total might suggest.

Multimodal Input

GPT-4 was the first major frontier model to accept image inputs alongside text. Users could upload images and ask the model to describe, analyze, or reason about them. The image understanding was not a separate module bolted on — it was integrated into the model's core architecture. GPT-4 could read text in images, interpret charts and graphs, analyze medical images (with appropriate caveats), and solve visual reasoning puzzles. Image output generation was not included; the model could only produce text.

Predictable Scaling

One of the most remarkable disclosures in the technical report was that OpenAI had developed infrastructure to predict GPT-4's performance from models trained with as little as 1/1,000th of the total compute. Using their scaling law framework, they could train small proxy models and accurately forecast the final model's loss and benchmark performance. This meant that OpenAI could make informed decisions about training runs costing tens of millions of dollars before committing the full resources — a significant de-risking of frontier model development.

Safety and Alignment

GPT-4 was extensively aligned using RLHF and underwent a months-long "red teaming" process where external experts attempted to elicit harmful, biased, or dangerous outputs. OpenAI reported that GPT-4 was 82% less likely to produce disallowed content than GPT-3.5 and 40% more likely to produce factual responses. The model also incorporated a system prompt mechanism that allowed deployers to set behavioral guidelines, and the RLHF training specifically targeted multi-turn conversation safety.

Why It Matters

The New Capability Frontier

GPT-4's performance represented a step change in what AI could do. Passing the bar exam at the 90th percentile was not a parlor trick — it required comprehension, multi-step reasoning, and the ability to apply legal principles to novel scenarios. Similarly, its performance on the GRE, SAT, AP exams, and medical licensing questions demonstrated capabilities that went beyond pattern matching into something that looked like genuine reasoning (whatever that means at a mechanistic level). GPT-4 made it impossible to dismiss LLMs as "just autocomplete."

The Secrecy Shift

GPT-4's technical report was a turning point in AI openness. GPT-1, GPT-2, and GPT-3 had all been accompanied by detailed papers disclosing architecture, training data, and methodology. GPT-4 disclosed none of these. OpenAI argued this was necessary for safety and competitive reasons, but the decision was widely criticized by the research community. It set a precedent: Google's Gemini technical report and Anthropic's Claude technical reports similarly withheld key details. The era of frontier models as open research was effectively over.

Enabling the Application Ecosystem

GPT-4's capabilities were strong enough to power a new generation of AI applications. Its improved instruction following and reliability made it suitable for production use cases: customer service, content generation, data analysis, code review, legal research, and more. The GPT-4 API, combined with function calling capabilities added shortly after launch, became the foundation for thousands of AI startups. Venture capital funding for AI applications surged, with many companies building entire products on GPT-4's API.

Key Technical Details

Released: March 14, 2023 (API and ChatGPT Plus)
Architecture: Not officially disclosed; widely reported as MoE with ~1.7T total params, 16 experts, ~111B each, 2 active per token
Input: Text + images (multimodal input, text-only output)
MMLU (5-shot): 86.4% (vs. GPT-3.5's ~70%)
Bar exam: 90th percentile (vs. GPT-3.5's ~10th percentile)
HumanEval (zero-shot): ~67% (vs. GPT-3.5's ~48%)
Context window: 8K tokens (standard), 32K tokens (extended version)
Safety improvement: 82% less likely to produce disallowed content vs. GPT-3.5
Scaling prediction: Performance predicted from 1/1000th compute proxy models
Training cost (estimated): $50-100M+

Common Misconceptions

"GPT-4 is 10x bigger than GPT-3." If the MoE reports are accurate, GPT-4's total parameter count (~1.7T) is roughly 10x GPT-3 (175B), but its active parameters per token (~220B) are only about 1.3x. The MoE architecture means most parameters are dormant for any given token.
"GPT-4 can see and generate images." GPT-4 can process image inputs and produce text about them. It cannot generate images. Image generation is handled by separate models like DALL-E.
"GPT-4 is AGI." While GPT-4's capabilities are impressive, it still struggles with genuine multi-step reasoning, novel problem solving, reliable factuality, and tasks requiring real-world grounding. It is a powerful tool, not a general intelligence.
"GPT-4's technical report is a research paper." It is explicitly not. It discloses results but not methodology, making it essentially a product announcement dressed in academic formatting. This was a deliberate choice by OpenAI.
"Nobody knows how GPT-4 works." The general approach (Transformer, pre-training, RLHF) is well understood. What is undisclosed is the specific architecture variant, training data composition, and training methodology details. The broad strokes are known; the details are proprietary.

Connections to Other Concepts

01-gpt-3.md — GPT-4 is the spiritual successor, dramatically more capable
02-chatgpt.md — GPT-4 became available through ChatGPT Plus, transforming the product
01-instructgpt-and-rlhf.md — GPT-4 used extensive RLHF for alignment
08-the-ai-arms-race-begins.md — GPT-4 intensified the competitive race
08-gemini-1.md — Google's response to GPT-4, released 9 months later
02-kaplan-scaling-laws.md — GPT-4's predictive scaling infrastructure built on scaling law principles
06-emergent-abilities.md — GPT-4's capabilities reignited debates about emergence at scale
05-mixtral-8x7b.md — If GPT-4 is MoE, Mixtral was the first open-weight validation of the approach