One-Line Summary: OpenAI's GPT-4, released in March 2023, was the first multimodal frontier model to accept both text and image inputs, and it achieved a massive leap in reasoning, coding, and factuality that set a new ceiling for AI capabilities — while revealing almost nothing about how it was built.
Prerequisites: 01-gpt-3.md, 01-instructgpt-and-rlhf.md, 02-chatgpt.md
What Is GPT-4?
Imagine that a company builds the world's most powerful telescope. They show you breathtaking images of distant galaxies, prove it can see further than any instrument before it, and let you look through the eyepiece — but they refuse to tell you how the lenses were ground, what materials were used, or even how large the telescope is. That was GPT-4: a model whose outputs stunned the world, while its architecture, training data, and parameter count remained officially undisclosed. In the era of open research that had defined AI's growth, GPT-4 marked the beginning of secrecy at the frontier.
Released on March 14, 2023, GPT-4 was described in a 98-page "technical report" that was notable for how little technical detail it contained. OpenAI cited "the competitive landscape and the safety implications of large-scale models" as reasons for not disclosing architecture, model size, training data, or training methodology. What they did reveal was performance: GPT-4 passed the bar exam at the 90th percentile (up from the 10th percentile for GPT-3.5), scored 86.4% on MMLU (5-shot), achieved 67% on HumanEval (zero-shot, up from ChatGPT's ~48%), and demonstrated dramatic improvements in reasoning, multi-step problem solving, and factual accuracy.
The model arrived just four months after ChatGPT had electrified the world, and it immediately raised the ceiling for what AI could do. GPT-4 was not just incrementally better than GPT-3.5 — it was qualitatively different. It could analyze images, write sophisticated code, reason through complex legal and scientific problems, and maintain coherent reasoning across thousands of words. It became the benchmark that every subsequent model was measured against.
How It Works
GPT-4: The Capability Leap
Performance Comparison (GPT-3.5 vs GPT-4):
┌────────────────────────────────────────────────────────┐
│ Benchmark GPT-3.5 GPT-4 Δ │
│ ───────── ─────── ───── ─ │
│ Bar Exam 10th %ile 90th %ile ████████ │
│ MMLU (5-shot) ~70% 86.4% ████ │
│ HumanEval (0-shot) ~48% ~67% ████ │
│ Safety (disallowed) baseline -82% ████████ │
└────────────────────────────────────────────────────────┘
Reported Architecture (Mixture of Experts):
┌────────────────────────────────────────────────────────┐
│ │
│ Total parameters: ~1.7 Trillion │
│ ┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐...┌────┐ │
│ │ E1 ││ E2 ││ E3 ││ E4 ││ E5 ││ E6 │ │E16 │ │
│ │111B││111B││111B││111B││111B││111B│ │111B│ │
│ └──┬─┘└──┬─┘└──┬─┘└──┬─┘└──┬─┘└──┬─┘ └──┬─┘ │
│ └──────┴─────┘ └──────┴─────┘ │ │
│ 2 experts active per token (~220B active params) │
│ │
│ Input: Text + Images (multimodal) │
│ Output: Text only │
│ │
│ Predictive scaling: Performance forecast from │
│ 1/1000th compute proxy models │
└────────────────────────────────────────────────────────┘Figure: GPT-4 achieved dramatic performance improvements over GPT-3.5 across benchmarks. Widely reported to use a Mixture-of-Experts architecture with ~1.7T total parameters but only ~220B active per token.
Architecture (What Is Known)
OpenAI disclosed almost nothing about GPT-4's architecture. The technical report stated only that it was a "Transformer-style model pre-trained to predict the next token in a document, using both publicly available data and data licensed from third-party providers." However, multiple credible reports (notably from George Hotz, Semianalysis, and various industry insiders) have indicated that GPT-4 is a Mixture-of-Experts (MoE) model with approximately 1.7 trillion total parameters, organized as 16 expert networks of roughly 111 billion parameters each, with 2 experts active per token. If accurate, this would mean GPT-4 has roughly 220B active parameters per token — larger than GPT-3 but far smaller than the 1.7T total might suggest.
Multimodal Input
GPT-4 was the first major frontier model to accept image inputs alongside text. Users could upload images and ask the model to describe, analyze, or reason about them. The image understanding was not a separate module bolted on — it was integrated into the model's core architecture. GPT-4 could read text in images, interpret charts and graphs, analyze medical images (with appropriate caveats), and solve visual reasoning puzzles. Image output generation was not included; the model could only produce text.
Predictable Scaling
One of the most remarkable disclosures in the technical report was that OpenAI had developed infrastructure to predict GPT-4's performance from models trained with as little as 1/1,000th of the total compute. Using their scaling law framework, they could train small proxy models and accurately forecast the final model's loss and benchmark performance. This meant that OpenAI could make informed decisions about training runs costing tens of millions of dollars before committing the full resources — a significant de-risking of frontier model development.
Safety and Alignment
GPT-4 was extensively aligned using RLHF and underwent a months-long "red teaming" process where external experts attempted to elicit harmful, biased, or dangerous outputs. OpenAI reported that GPT-4 was 82% less likely to produce disallowed content than GPT-3.5 and 40% more likely to produce factual responses. The model also incorporated a system prompt mechanism that allowed deployers to set behavioral guidelines, and the RLHF training specifically targeted multi-turn conversation safety.
Why It Matters
The New Capability Frontier
GPT-4's performance represented a step change in what AI could do. Passing the bar exam at the 90th percentile was not a parlor trick — it required comprehension, multi-step reasoning, and the ability to apply legal principles to novel scenarios. Similarly, its performance on the GRE, SAT, AP exams, and medical licensing questions demonstrated capabilities that went beyond pattern matching into something that looked like genuine reasoning (whatever that means at a mechanistic level). GPT-4 made it impossible to dismiss LLMs as "just autocomplete."
The Secrecy Shift
GPT-4's technical report was a turning point in AI openness. GPT-1, GPT-2, and GPT-3 had all been accompanied by detailed papers disclosing architecture, training data, and methodology. GPT-4 disclosed none of these. OpenAI argued this was necessary for safety and competitive reasons, but the decision was widely criticized by the research community. It set a precedent: Google's Gemini technical report and Anthropic's Claude technical reports similarly withheld key details. The era of frontier models as open research was effectively over.
Enabling the Application Ecosystem
GPT-4's capabilities were strong enough to power a new generation of AI applications. Its improved instruction following and reliability made it suitable for production use cases: customer service, content generation, data analysis, code review, legal research, and more. The GPT-4 API, combined with function calling capabilities added shortly after launch, became the foundation for thousands of AI startups. Venture capital funding for AI applications surged, with many companies building entire products on GPT-4's API.
Key Technical Details
- Released: March 14, 2023 (API and ChatGPT Plus)
- Architecture: Not officially disclosed; widely reported as MoE with ~1.7T total params, 16 experts, ~111B each, 2 active per token
- Input: Text + images (multimodal input, text-only output)
- MMLU (5-shot): 86.4% (vs. GPT-3.5's ~70%)
- Bar exam: 90th percentile (vs. GPT-3.5's ~10th percentile)
- HumanEval (zero-shot): ~67% (vs. GPT-3.5's ~48%)
- Context window: 8K tokens (standard), 32K tokens (extended version)
- Safety improvement: 82% less likely to produce disallowed content vs. GPT-3.5
- Scaling prediction: Performance predicted from 1/1000th compute proxy models
- Training cost (estimated): $50-100M+
Common Misconceptions
-
"GPT-4 is 10x bigger than GPT-3." If the MoE reports are accurate, GPT-4's total parameter count (~1.7T) is roughly 10x GPT-3 (175B), but its active parameters per token (~220B) are only about 1.3x. The MoE architecture means most parameters are dormant for any given token.
-
"GPT-4 can see and generate images." GPT-4 can process image inputs and produce text about them. It cannot generate images. Image generation is handled by separate models like DALL-E.
-
"GPT-4 is AGI." While GPT-4's capabilities are impressive, it still struggles with genuine multi-step reasoning, novel problem solving, reliable factuality, and tasks requiring real-world grounding. It is a powerful tool, not a general intelligence.
-
"GPT-4's technical report is a research paper." It is explicitly not. It discloses results but not methodology, making it essentially a product announcement dressed in academic formatting. This was a deliberate choice by OpenAI.
-
"Nobody knows how GPT-4 works." The general approach (Transformer, pre-training, RLHF) is well understood. What is undisclosed is the specific architecture variant, training data composition, and training methodology details. The broad strokes are known; the details are proprietary.
Connections to Other Concepts
01-gpt-3.md— GPT-4 is the spiritual successor, dramatically more capable02-chatgpt.md— GPT-4 became available through ChatGPT Plus, transforming the product01-instructgpt-and-rlhf.md— GPT-4 used extensive RLHF for alignment08-the-ai-arms-race-begins.md— GPT-4 intensified the competitive race08-gemini-1.md— Google's response to GPT-4, released 9 months later02-kaplan-scaling-laws.md— GPT-4's predictive scaling infrastructure built on scaling law principles06-emergent-abilities.md— GPT-4's capabilities reignited debates about emergence at scale05-mixtral-8x7b.md— If GPT-4 is MoE, Mixtral was the first open-weight validation of the approach
Further Reading
- OpenAI, "GPT-4 Technical Report" (2023) — The official (intentionally sparse) technical report.
- Bubeck et al., "Sparks of Artificial General Intelligence: Early experiments with GPT-4" (2023) — Microsoft Research's extensive evaluation of GPT-4's capabilities.
- Nori et al., "Capabilities of GPT-4 on Medical Competency Examinations" (2023) — GPT-4's performance on medical benchmarks.
- OpenAI, "GPT-4 System Card" (2023) — Details on safety evaluations, red teaming, and risk mitigation.