GPT-4o

One-Line Summary: OpenAI's GPT-4o ("Omni"), released in May 2024, was the first truly unified multimodal model — trained end-to-end to accept and generate text, audio, images, and video through a single neural network, at 2x the speed and half the cost of GPT-4 Turbo.

Prerequisites: 07-gpt-4.md, 01-claude-3-family.md

What Is GPT-4o?

Imagine the difference between a translator who speaks three languages natively versus one who learned them separately and mentally converts between them. The native speaker thinks fluidly across languages; the other pauses to translate. GPT-4o is the native speaker of modalities — text, audio, and vision are not separate capabilities bolted together, but a single unified understanding trained end-to-end. Previous "multimodal" systems were pipelines: speech-to-text fed into a language model that fed into text-to-speech. GPT-4o collapsed that entire pipeline into one model.

OpenAI announced GPT-4o on May 13, 2024, in a live demonstration that emphasized real-time voice interaction. The "o" stood for "omni" — a model that could see, hear, speak, and think through a single architecture. The timing was strategic: Anthropic's Claude 3 Opus had just dethroned GPT-4 on key benchmarks (see 01-claude-3-family.md), and Google was trumpeting Gemini 1.5's million-token context window (see 02-gemini-1-5.md). OpenAI needed to reassert leadership, and they did so not by chasing benchmarks but by redefining what a model could do.

The business impact was equally significant. GPT-4o was offered in the free tier of ChatGPT, making frontier-level AI accessible to anyone with an internet connection. It was 2x faster and 50% cheaper than GPT-4 Turbo. OpenAI was not just competing on capability but on accessibility and cost — a signal that frontier AI was moving from premium product to commodity service.

How It Works

GPT-4o unified architecture vs. traditional pipeline approach:

Traditional Pipeline (pre-GPT-4o):
┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Speech  │───▶│ Text     │───▶│ Language │───▶│ Text-to- │──▶ Audio
│  to Text │    │ (loses   │    │ Model    │    │ Speech   │    Out
│          │    │ tone,    │    │          │    │ (generic │
│          │    │ emotion) │    │          │    │ voice)   │
└──────────┘    └──────────┘    └──────────┘    └──────────┘
  Latency: 2-5 seconds total    Information lost at each step
 
GPT-4o Unified Model:
┌──────────────────────────────────────────────────────────────┐
│                 Single Neural Network                        │
│                                                              │
│  Audio ──┐                              ┌──▶ Audio (with    │
│           │                              │    emotion,       │
│  Text ───┼──▶ Unified Processing ──────┼──▶ tone)          │
│           │   (all modalities as        │                    │
│  Image ──┤    one token stream)         ├──▶ Text           │
│           │                              │                    │
│  Video ──┘                              └──▶ Image          │
│                                                              │
│  Latency: 320ms average (human-like)                        │
└──────────────────────────────────────────────────────────────┘

End-to-End Multimodal Architecture

The defining innovation of GPT-4o was end-to-end multimodal training. Rather than using separate models for different modalities (a vision encoder, a language model, a speech recognizer, and a speech synthesizer) and stitching them together, GPT-4o was trained as a single model across all modalities simultaneously. All inputs — text, audio, images — are tokenized and processed through one unified Transformer. All outputs — text, audio, images — are generated by the same model.

This architecture eliminated the latency and information loss inherent in pipeline approaches. In a traditional voice assistant, spoken words are first transcribed (losing tone, emotion, and nuance), then processed as text, then synthesized back to speech (in a generic voice). GPT-4o processed raw audio directly, preserving intonation, emotion, and paralinguistic cues. It could respond to voice input with an average latency of 320 milliseconds — comparable to human conversational response time.

Multimodal Input and Output

GPT-4o accepted text, images, audio, and video as input. It could generate text, audio, and images as output. In practice, this meant a user could show the model a photo, ask a question about it verbally, and receive a spoken answer that referenced specific visual details — all processed by the same neural network. The model could sing, change vocal tones, express emotions in speech, and understand visual humor. The Spring 2024 demo showcased the model helping with math homework by looking at a handwritten equation through a phone camera and walking the student through the solution verbally.

Performance and Efficiency

On standard text benchmarks, GPT-4o matched GPT-4 Turbo performance while being significantly cheaper and faster. It achieved 88.7% on MMLU (5-shot), competitive with the best models of the time. On vision benchmarks, it set new standards on several evaluations. On audio understanding, it dramatically outperformed previous pipeline approaches, particularly for non-English languages where the speech-to-text step had been a bottleneck. The 128K context window was inherited from GPT-4 Turbo.

Pricing and Availability

The pricing strategy was as important as the technology. GPT-4o was priced at $5 p er mi l l i o nin p u tt o k e n s an d$ 15 per million output tokens — 50% cheaper than GPT-4 Turbo. Critically, it was made available in ChatGPT's free tier, meaning hundreds of millions of users could access frontier-level AI without paying. This was a competitive weapon: by making GPT-4o free, OpenAI raised the floor for what users expected, pressuring competitors who charged for similar capabilities.

Why It Matters

Unified Multimodal as the New Baseline

GPT-4o proved that end-to-end multimodal training produces qualitatively different capabilities than stitching specialist models together. The fluid, real-time voice interaction — with emotional understanding, interruption handling, and natural conversation flow — was simply impossible with pipeline architectures. This set a new expectation: future models would need to be natively multimodal, not just multimodal-capable.

The Voice Interface Moment

GPT-4o's real-time voice capabilities represented a genuine interface revolution. Previous AI voice assistants (Siri, Alexa) were command-response systems with perceptible latency and no real conversational ability. GPT-4o demonstrated fluid, interruptible, emotionally aware conversation at human-like speed. Although the full "Advanced Voice Mode" rolled out gradually over subsequent months, the May 2024 demo established voice as a primary interaction modality for LLMs, not just a convenience feature.

Frontier AI Goes Free

Making GPT-4o available in the free tier was a watershed moment for AI accessibility. Previously, accessing the best AI models required a $20/month subscription or API payments. By making frontier-quality AI free, OpenAI massively expanded its user base and established a new pricing expectation across the industry. This forced competitors to reconsider their own pricing strategies and accelerated the broader trend toward cheaper AI access.

Key Technical Details

Release date: May 13, 2024
Name: GPT-4o ("o" for "omni")
Input modalities: Text, images, audio, video
Output modalities: Text, audio, images
Context window: 128,000 tokens
MMLU (5-shot): 88.7%
Voice response latency: Average 320 milliseconds (human-like)
Pricing: $5/$ 15 per million input/output tokens (50% cheaper than GPT-4 Turbo)
Speed: 2x faster than GPT-4 Turbo
Availability: Free tier in ChatGPT; API access for developers

Common Misconceptions

"GPT-4o is just GPT-4 with voice added." GPT-4o is a fundamentally different model trained end-to-end across modalities. It is not GPT-4 with a speech wrapper. The unified training enables capabilities that are architecturally impossible with a pipeline approach.
"GPT-4o is the most capable model OpenAI has made." On pure text reasoning, GPT-4o was roughly equivalent to GPT-4 Turbo, not a significant leap. Its innovation was in multimodal integration and efficiency. The later o1 model (see 01-openai-o1.md) represented the next capability jump on reasoning tasks.
"The live demo represented immediately available capabilities." Several of the most impressive features demonstrated — particularly the Advanced Voice Mode with full emotional range — rolled out gradually over months. At launch, only text and vision capabilities were fully available to most users.
"End-to-end multimodal training is straightforward." Training a single model to handle all modalities introduces enormous challenges in data balancing, loss weighting, and preventing one modality from dominating. The engineering required to make this work at GPT-4o's quality level was substantial.

Connections to Other Concepts

07-gpt-4.md — GPT-4o builds on GPT-4's foundation but fundamentally changes the multimodal architecture
01-claude-3-family.md — Anthropic's release two months earlier prompted OpenAI's competitive response
02-gemini-1-5.md — Google's competing multimodal approach emphasized context length over unified training
06-llama-3-2-multimodal.md — Meta's later open-source attempt at multimodal models
01-openai-o1.md — OpenAI's next major release shifted focus from multimodality to reasoning