One-Line Summary: OpenAI's GPT-5 (August 2025) unified traditional language modeling, chain-of-thought reasoning, and native tool use into a single architecture, converging the separate GPT and o-series product lines into one model — then GPT-5.2 (December 2025) pushed the frontier further with three model variants and near-saturating benchmark scores, followed by GPT-5.2-Codex (January 2026) for agentic coding.

Prerequisites: 03-gpt-4o.md, 02-the-o-series-evolution.md

What Is GPT-5?

Imagine a skilled professional who doesn't need to be told whether to think quickly or carefully — they automatically calibrate their effort to the difficulty of the problem. Ask them what time it is, and they glance at a clock. Ask them to debug a distributed systems failure, and they pull out a whiteboard, diagram the architecture, and reason step by step. GPT-5 works this way: an internal router assesses query complexity and selects the appropriate "thinking mode," from fast pattern-matching for simple requests to extended chain-of-thought reasoning for hard problems.

GPT-5 represents the resolution of a product tension that had been building inside OpenAI since late 2024. The GPT line (GPT-4, GPT-4o) optimized for fast, fluent, multimodal responses. The o-series (o1, o3) optimized for deep reasoning through explicit chain-of-thought at inference time. Users had to choose between them — fast and fluid, or slow and thoughtful. GPT-5 eliminated that choice. Developed under the internal codename "Orion," the project's core ambition was convergence: one model that could do everything both lines could do, and route between modes seamlessly.

The release in August 2025 arrived into a fiercely competitive landscape. Claude Opus 4 had set new agentic benchmarks in May. Gemini 2.5 Pro had topped Chatbot Arena in coding and math categories. DeepSeek and Qwen had demonstrated that open models could match closed frontier performance on specific tasks. GPT-5 needed to be not just good but category-defining to justify OpenAI's position as the industry leader and its multi-billion-dollar valuation.

How It Works

GPT-5 Unified Architecture with Internal Complexity Router:

                         ┌──────────────┐
                         │  User Query  │
                         └──────┬───────┘

                     ┌──────────▼──────────┐
                     │  Complexity Router   │
                     │  (Learned, Continuous)│
                     └──────────┬──────────┘

          ┌─────────────────────┼─────────────────────┐
          │                     │                     │
   ┌──────▼──────┐    ┌────────▼────────┐    ┌───────▼──────┐
   │  Fast Mode  │    │  Medium Mode    │    │  Deep Think  │
   │  (< 1 sec)  │    │  (2-10 sec)     │    │  (10-30 sec) │
   │  Simple Q&A │    │  Analysis       │    │  Math Proofs │
   │  Factual    │    │  Coding Tasks   │    │  Complex Code│
   └──────┬──────┘    └────────┬────────┘    └───────┬──────┘
          │                     │                     │
          └─────────────────────┼─────────────────────┘

                     ┌──────────▼──────────┐
                     │  Unified Output     │
                     │  + Native Tool Use  │
                     │  + Multimodal Gen   │
                     └─────────────────────┘

The Unified Architecture

GPT-5's defining architectural innovation is its internal complexity router. Rather than maintaining separate model weights for "fast" and "reasoning" modes, GPT-5 uses a single set of weights with a learned routing mechanism that dynamically allocates inference compute. Simple factual queries receive minimal chain-of-thought processing. Complex mathematical, coding, or analytical problems trigger extended internal reasoning — functionally equivalent to what the o-series models did, but without requiring the user to select a different model.

This routing is not binary. The model operates on a spectrum of reasoning depth, allocating more internal "thinking" steps as problem complexity increases. The mechanism integrates the test-time compute scaling insights from the o-series: more inference compute on harder problems yields better results, but applying it uniformly wastes resources on easy queries.

The practical result is that GPT-5's latency varies by query type. A simple factual question might return in under a second. A complex mathematical proof or multi-step coding task might take 10-30 seconds as the model engages in extended reasoning. This variable latency is the visible manifestation of the internal routing — the model is genuinely spending more compute on harder problems.

Native Multimodality

GPT-5 processes and generates text, images, audio, and video natively within a single architecture, building on the multimodal foundations of GPT-4o. Image understanding is integrated at the token level rather than through external encoders. Audio input and output support real-time conversation with natural turn-taking. Video understanding enables analysis of temporal sequences — a capability absent from GPT-4o.

Tool Use and Agent Integration

GPT-5 was designed from training to integrate with OpenAI's tool-use and agent infrastructure. Function calling, code execution, web browsing, and file manipulation are not bolted-on capabilities but core competencies trained into the model's weights. This reflects the broader industry shift toward agent-native models — systems that act in the world rather than merely generating text about it. GPT-5's tool-use capabilities were deeply integrated with the OpenAI platform: Assistants API, Code Interpreter, and custom function definitions.

Training and Scale

OpenAI disclosed limited details about GPT-5's training, consistent with their increasingly closed approach to technical documentation. The model is widely believed to use a Mixture of Experts architecture — continuing the approach rumored for GPT-4 — though OpenAI has not confirmed specific parameter counts. Training incorporated massive-scale RLHF and reinforcement learning on reasoning tasks, combining the alignment techniques from InstructGPT with the reasoning training pioneered in the o-series.

The training pipeline likely involved multiple stages: large-scale pre-training on text and multimodal data, followed by supervised fine-tuning on instruction-following demonstrations, then reinforcement learning from human feedback for alignment, and finally reinforcement learning on reasoning and tool-use trajectories for agentic capability. The reasoning RL stage — where the model learns to allocate thinking depth appropriately — was the novel contribution, building on the techniques developed for o1 and o3 but integrated into a single end-to-end training process rather than a separate model.

Pricing and Accessibility

OpenAI positioned GPT-5 to be accessible despite its increased capabilities. The tiered pricing structure allowed lighter usage patterns (simple queries that trigger minimal reasoning) to cost less than heavy usage patterns (complex problems that trigger extended chain-of-thought). This dynamic pricing — where cost correlates with compute actually used — represented a more nuanced approach than the flat per-token pricing of earlier models. It incentivized efficient use while making the model's full reasoning power available when needed.

GPT-5.2 and Codex (December 2025 - January 2026)

Four months after GPT-5's release, OpenAI shipped GPT-5.2 on December 11, 2025 — a significant iteration that expanded the unified architecture into three distinct model variants: Instant (optimized for low-latency responses), Thinking (the standard reasoning mode), and Pro (maximum reasoning depth for the hardest problems). This tiered variant structure preserved GPT-5's core insight — one architecture with adaptive reasoning — while giving developers and users explicit control over the speed-capability tradeoff when needed.

GPT-5.2 shipped with a 400K context window and 128K output limit, roughly doubling the usable context of GPT-5 and enabling processing of entire codebases, long legal documents, and book-length texts in a single pass. The knowledge cutoff was August 2025, meaning the model was trained on data that included the competitive dynamics of GPT-5's own launch period.

The benchmark results were striking. GPT-5.2 scored 93.2% on GPQA Diamond and a perfect 100% on AIME 2025, effectively saturating two benchmarks that had served as meaningful differentiators just months earlier. On SWE-bench Verified, it reached 80%, surpassing GPT-5's already-strong 74.9%. GPT-5.2 Pro became the first model in the GPT family to exceed 90% on ARC-AGI-1 (Verified) and achieved 40.3% on FrontierMath — a benchmark specifically designed to remain challenging for frontier models. These scores signaled that several established benchmarks were approaching their useful ceiling as discriminators of model capability.

On January 14, 2026, OpenAI released GPT-5.2-Codex, a specialized variant optimized for agentic coding workflows. Unlike GPT-5.2's general-purpose design, Codex was tuned for sustained, multi-step software engineering tasks: large refactors, codebase migrations, feature implementations, and security audits. It scored 56.4% on SWE-Bench Pro and 64.0% on Terminal-Bench 2.0, both benchmarks designed to test real-world agentic coding rather than isolated problem-solving. Codex also achieved approximately 70.9% on GDPval, a benchmark measuring performance on standardized office tasks — outperforming human professionals on the same evaluation, underscoring the model's capabilities beyond pure code generation.

GPT-5.3-Codex (February 2026)

Just three weeks after GPT-5.2-Codex, OpenAI shipped GPT-5.3-Codex on February 5, 2026 — a model notable for a remarkable milestone: it was the first model that was instrumental in creating itself, with the Codex team using early versions to debug its own training, manage deployment, and diagnose test results and evaluations.

GPT-5.3-Codex advanced both the frontier coding performance of GPT-5.2-Codex and the reasoning and professional knowledge capabilities of GPT-5.2, unified in a single model that was also 25% faster. It set new industry highs on Terminal-Bench 2.0 (77.3%, up from 64.0%) and OSWorld-Verified (64.7%, up from 38.2%), with a more modest gain on SWE-Bench Pro (56.8%, up from 56.4%). The Terminal-Bench and OSWorld jumps were particularly significant — they showed that the model had become dramatically better at navigating real computer environments and executing multi-step terminal workflows, not just writing code.

GPT-5.3-Codex was also the first model classified as "High capability" in the Cybersecurity domain under OpenAI's Preparedness Framework, triggering comprehensive safeguards including safety training, monitoring, Trusted Access restrictions, and threat intelligence. This marked a new chapter in the dual-use tension inherent in capable coding models: the same capabilities that make a model excellent at finding and fixing vulnerabilities also make it capable of discovering and exploiting them.

Why It Matters

The End of the "Which Model?" Question

Before GPT-5, OpenAI's customers faced a fragmented product landscape: GPT-4o for speed and multimodality, o1/o3 for reasoning, different pricing for each. GPT-5's unified architecture simplified this to a single model that self-selects the appropriate reasoning depth. This product simplification was as significant as the technical achievement — it made advanced AI more accessible by removing a confusing choice.

The Frontier Benchmark Race

GPT-5 achieved 74.9% on SWE-bench Verified at its August release, the highest score at that time. This put it ahead of Gemini 2.5 Pro and competitive with Claude Opus 4 on coding tasks. On reasoning benchmarks including GPQA Diamond and AIME, GPT-5's integrated reasoning matched or exceeded standalone o-series performance, validating the unified approach.

GPT-5.2 escalated the benchmark race further. Its 93.2% GPQA Diamond, 100% AIME 2025, and 80% SWE-bench Verified scores pushed several established benchmarks toward saturation. GPT-5.2 Pro's 40.3% on FrontierMath and first-ever above-90% ARC-AGI-1 (Verified) result demonstrated that even benchmarks designed to remain challenging were yielding to rapid capability gains. GPT-5.2-Codex's 56.4% SWE-Bench Pro score showed that agentic coding — multi-step, real-world software engineering — was becoming the new meaningful axis of competition. GPT-5.3-Codex (February 2026) continued the trajectory with 77.3% on Terminal-Bench 2.0 and 64.7% on OSWorld-Verified — massive jumps that demonstrated the rapid improvement possible in computer-use and terminal-based tasks.

Convergence as Industry Trend

GPT-5 was not alone in pursuing convergence. Claude's models had been integrating extended thinking. Gemini 2.5 shipped as a "thinking model." GPT-5 confirmed that the industry was converging on hybrid models that combine fast generation with deep reasoning. The era of separate "reasoning models" as distinct products was ending.

The ChatGPT Integration

GPT-5 became the default model in ChatGPT for all signed-in users, replacing the fragmented selection of GPT-4o, o3, o4-mini, GPT-4.1, and GPT-4.5 that had confused many users with an overwhelming array of model choices. For the 200 million-plus weekly active ChatGPT users, this meant a seamless experience: ask a simple question and get a fast answer; ask a complex one and the model automatically takes more time to reason. The consumer product experience improved as much as the underlying capability, demonstrating that model architecture decisions have direct user experience implications. The consolidation from five distinct models to one default was arguably GPT-5's most impactful product decision — simplicity at the interface layer, complexity hidden underneath.

Key Technical Details

GPT-5 (August 2025):

  • Released: August 7, 2025
  • Developer: OpenAI (codename "Orion")
  • Architecture: Unified model with internal complexity routing; rumored Mixture of Experts
  • SWE-bench Verified: 74.9% at release (highest at time)
  • Multimodal: Native text, image, audio, and video understanding and generation
  • Reasoning: Integrated o-series capabilities with dynamic depth allocation
  • Converges GPT and o-series product lines into single model; default model in ChatGPT replacing GPT-4o, o3, o4-mini, GPT-4.1, and GPT-4.5
  • Pricing: Tiered based on usage, aimed at accessibility despite increased capability
  • Tool use: Native function calling, code execution, browsing integrated from training

GPT-5.2 (December 11, 2025):

  • Three variants: Instant, Thinking, Pro
  • Context window: 400K input, 128K output
  • Knowledge cutoff: August 2025
  • GPQA Diamond: 93.2%
  • AIME 2025: 100%
  • SWE-bench Verified: 80%
  • ARC-AGI-1 (Verified): Above 90% (GPT-5.2 Pro) — first GPT-family model to reach this threshold
  • FrontierMath: 40.3% (GPT-5.2 Pro)

GPT-5.2-Codex (January 14, 2026):

  • Optimized for agentic coding: large refactors, codebase migrations, feature implementations, security audits
  • SWE-Bench Pro: 56.4%
  • Terminal-Bench 2.0: 64.0%
  • GDPval: ~70.9% (outperforms human professionals on standardized office tasks)

GPT-5.3-Codex (February 5, 2026):

  • First model instrumental in creating itself (used to debug its own training)
  • 25% faster than GPT-5.2-Codex
  • SWE-Bench Pro: 56.8%
  • Terminal-Bench 2.0: 77.3% (up from 64.0%)
  • OSWorld-Verified: 64.7% (up from 38.2%)
  • First model classified "High capability" in Cybersecurity under OpenAI's Preparedness Framework

Common Misconceptions

  • "GPT-5 is just GPT-4 with o3 bolted on." The integration is architectural, not a pipeline. The model was trained end-to-end to route between reasoning modes, not as two models in a trench coat. The routing mechanism is learned, not rule-based.

  • "GPT-5 makes o-series models obsolete." While GPT-5 subsumes most o-series capabilities, specialized reasoning models may still be trained for domains where maximum reasoning depth is worth the latency cost. The o-series research informed GPT-5 but the line may continue for extreme reasoning tasks.

  • "GPT-5 is the clear #1 model." The frontier in August 2025 was intensely competitive. Claude Opus 4 led on agentic tasks and sustained coding. Gemini 2.5 Pro led on certain reasoning categories. GPT-5's strength was breadth — being excellent across all dimensions simultaneously.

  • "OpenAI published a detailed technical report." Unlike the GPT-4 technical report, OpenAI provided minimal architectural details for GPT-5, continuing a trend toward less transparency about model internals.

  • "The internal router is a simple classifier." The routing mechanism is a learned component trained end-to-end with the model. It does not simply categorize queries as "easy" or "hard" — it operates on a continuous spectrum, dynamically adjusting reasoning depth based on subtle signals in the input that correlate with problem complexity. This is a qualitatively different approach from a rule-based difficulty classifier.

  • "Unified models will always be worse than specialized ones." GPT-5's integrated reasoning matched standalone o-series performance on most benchmarks, suggesting that a well-designed unified model does not sacrifice quality compared to specialized variants. The convenience and simplicity of a single model may outweigh marginal quality advantages of specialized systems.

Connections to Other Concepts

GPT-5's unified reasoning approach builds on the inference-time compute scaling explored in 02-the-o-series-evolution.md and the test-time scaling principles in 04-test-time-compute-scaling.md. It competes directly with 01-claude-4-series.md on agentic tasks and 03-gemini-2-and-beyond.md on reasoning and multimodality. Its agent capabilities are part of the broader trend analyzed in 06-agent-native-models.md. The multimodal architecture connects to 02-native-multimodal-training.md. Its competitive positioning against open models is discussed in 07-open-vs-closed-the-narrowing-gap.md. The MoE architecture (if confirmed) connects to 04-mixture-of-experts-evolution.md. The benchmark results participate in the evaluation dynamics covered in 01-the-benchmark-and-evaluation-landscape.md. The pricing and platform strategy is part of the broader API economy described in 02-the-api-economy.md.

Further Reading

  • OpenAI, "GPT-5 System Card" (2025) — safety evaluation and capability documentation.
  • OpenAI, "GPT-4 Technical Report" (2023) — the predecessor's architecture (more detailed than GPT-5 disclosures).
  • Snell et al., "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters" (2024) — the theoretical basis for adaptive reasoning depth.
  • Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2024) — the coding benchmark used for evaluation.
  • Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022) — the reasoning technique that GPT-5's internal routing automates.
  • OpenAI, "Learning to Reason with LLMs" (2024) — the o1 announcement that established the reasoning paradigm GPT-5 subsumes.
  • Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (2017) — foundational MoE work relevant to GPT-5's rumored architecture.
  • Chiang et al., "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference" (2024) — the human preference evaluation where GPT-5 competes.
  • OpenAI, "GPT-5.2 System Card" (2025) — safety evaluation and capability documentation for the three-variant successor.
  • OpenAI, "Introducing GPT-5.2-Codex" (2026) — announcement of the agentic coding variant and its benchmark results.
  • OpenAI, "Introducing GPT-5.3-Codex" (2026) — the self-bootstrapping coding model with Terminal-Bench and OSWorld records.
  • OpenAI, "GPT-5.3-Codex System Card" (2026) — first "High" Cybersecurity classification under the Preparedness Framework.
  • Chollet, "ARC-AGI: A Formal Measure of Intelligence" (2019) — the abstraction and reasoning benchmark that GPT-5.2 Pro was the first GPT-family model to score above 90% on.