Audio and Speech Models

One-Line Summary: Audio and speech capabilities in LLMs evolved from specialized speech recognition systems to native audio understanding and generation, culminating in models that can hold real-time spoken conversations with emotional nuance.

Prerequisites: 01-attention-is-all-you-need.md, 03-gpt-4o.md

What Is the Audio-Language Evolution?

For decades, speech AI lived in its own world — separate from text-based language models, using specialized architectures like CTC decoders, RNN transducers, and hidden Markov models. The audio revolution in LLMs is the story of speech and audio understanding being absorbed into the same Transformer-based systems that process text.

What started as standalone speech recognition (converting audio to text) has evolved into models that understand audio natively — perceiving tone, emotion, music, ambient sounds, and multiple languages simultaneously — and can generate natural speech as output. The destination is AI that converses as naturally as a human, understanding not just the words but the way they are spoken.

How It Works

Audio in LLMs: From Pipeline to Native Processing:

Pipeline Approach (2023)                Native Approach (GPT-4o, 2024)
(STT -> LLM -> TTS)                    (Single Model)
 
┌──────────────┐                        ┌──────────────────────────┐
│ Speech Input │                        │ Audio Input              │
└──────┬───────┘                        └──────────┬───────────────┘
       │                                           │
┌──────▼───────┐  ~500ms                ┌──────────▼───────────────┐
│ STT (Whisper)│                        │                          │
│ Audio -> Text│                        │   Unified Transformer    │
└──────┬───────┘                        │                          │
       │  ✗ Loses tone,                │   Audio tokens + Text     │
       │    emotion, sarcasm           │   tokens processed        │
┌──────▼───────┐  ~1000ms              │   together                │
│ LLM          │                        │                          │
│ Text -> Text │                        │   Understands tone,      │
└──────┬───────┘                        │   emotion, sarcasm       │
       │  ✗ Cannot indicate            │                          │
       │    how to speak               └──────────┬───────────────┘
┌──────▼───────┐  ~500ms                          │
│ TTS          │                        ┌──────────▼───────────────┐
│ Text -> Audio│                        │ Audio Output             │
└──────┬───────┘                        │ (natural expression,     │
       │                                │  singing, accents)       │
┌──────▼───────┐                        └──────────────────────────┘
│ Audio Output │
└──────────────┘                        Latency: ~232ms (vs ~2-3 sec)
                                        Preserves: emotion, nuance
Total: ~2-3 seconds
Lost: all vocal nuance

Whisper: Universal Speech Recognition (2022)

Whisper (Radford et al., September 2022) was the breakthrough that brought speech into the LLM paradigm. OpenAI trained an encoder-decoder Transformer on 680,000 hours of multilingual audio paired with transcriptions scraped from the internet. Rather than training on carefully curated, studio-quality recordings, Whisper learned from the messy reality of internet audio: accented speech, background noise, multiple languages, varying recording quality, and overlapping speakers.

The architecture is straightforward: audio is converted to an 80-channel log-mel spectrogram, processed by a Transformer encoder, then decoded autoregressively into text tokens — the same sequence-to-sequence approach used for machine translation. The input is audio, the output is text, and the model learns the mapping end-to-end.

The training scale and diversity were unprecedented. Whisper handles 97 languages, can translate from any supported language to English, detects the spoken language automatically, and produces word-level timestamps. The largest model (Whisper-large-v3, 1.55B parameters) achieves approximately 4.2% word error rate on the LibriSpeech benchmark — approaching human-level accuracy.

Whisper's key insight was that weak supervision at massive scale — using imperfect, noisy internet transcriptions rather than carefully annotated datasets — produces more robust models. The model learned to handle real-world audio conditions because it was trained on them, not because it was engineered to handle specific noise patterns.

The Pipeline Era: STT to LLM to TTS (2023)

Before native audio models, voice interaction with LLMs worked as a three-stage pipeline:

Speech-to-Text (STT): Whisper or similar converts audio to text
Language Model: The LLM processes text and generates a text response
Text-to-Speech (TTS): A synthesis model converts the response back to audio

This pipeline was functional but inherently limited. Information is lost at each stage boundary: the STT system cannot convey the speaker's emotion, sarcasm, or emphasis to the LLM. The LLM cannot indicate how its response should sound to the TTS system. A question asked sarcastically and the same question asked sincerely produce identical text after STT — all vocal nuance is discarded.

The total pipeline latency — typically 2-5 seconds round-trip — also makes natural conversation impossible. Human turn-taking in conversation operates on approximately 200ms latency, and anything above 500ms feels noticeably unnatural.

GPT-4o: Native Audio Understanding and Generation (2024)

GPT-4o ("omni," May 2024) collapsed the pipeline into a single model. Audio is tokenized and processed directly alongside text and image tokens in a unified Transformer. The model generates audio output tokens directly, without an intermediate text representation.

This architecture enables capabilities impossible in the pipeline approach:

Understanding tone and emotion from audio input (sarcasm, excitement, hesitation)
Generating speech with appropriate emotional expression matched to content
Responding with approximately 232ms average latency — fast enough for natural conversation
Processing non-speech audio (music, environmental sounds, multiple overlapping speakers)
Handling interruptions naturally, like a human conversationalist

GPT-4o's Advanced Voice Mode (August 2024) demonstrated the frontier: the model could sing, adopt different accents and speaking styles, express emotions like excitement or empathy, handle interruptions gracefully, and maintain consistent personality across extended conversations. Users described the experience as "talking to a person," marking a qualitative threshold in audio AI capability.

Gemini Audio Capabilities (2024-2025)

Google's Gemini 1.5 Pro (February 2024) brought native audio understanding with its 1 million token context window. Users could upload hours of audio — meetings, lectures, podcasts — and ask questions about specific moments, speakers, topics, or emotional dynamics. The model could process 11 hours of audio in a single prompt.

Gemini 2.0 (December 2024) added real-time audio processing and generation capabilities, enabling conversational interactions similar to GPT-4o. Google's approach leveraged their extensive audio data from YouTube, Google Search, and Google Assistant — a massive advantage in paired audio-text training data.

Music and Non-Speech Audio

AudioLM (Borsos et al., Google, October 2022) demonstrated that language modeling could generate realistic audio continuations — given a few seconds of music or speech, the model predicts what comes next, maintaining coherent structure, timbre, and rhythm.

MusicLM (Agostinelli et al., Google, January 2023) extended this to text-conditioned music generation: describe the music you want in natural language ("a calm acoustic guitar melody with soft rain in the background"), and the model generates 30-second clips at 24kHz. This connected the language understanding of LLMs to audio generation.

Stable Audio (Stability AI, 2023) brought audio generation to the open-source community, using latent diffusion for high-quality sound and music synthesis.

Why It Matters

The integration of audio into LLMs transforms how humans interact with AI. Text-based interfaces create a literacy barrier and are slower than speech. Voice interfaces are natural for humans — we evolved for spoken communication over millions of years, and text is a relatively recent invention.

When AI can understand and produce speech natively — not through a lossy three-stage pipeline — it becomes accessible to billions of additional users, including those with visual impairments, limited literacy, or simply a preference for spoken interaction.

The enterprise implications are equally significant: call center automation, meeting transcription and summarization, real-time translation across languages, voice-based assistants, and accessibility tools. The latency reduction from pipeline (~3 seconds) to native (~232ms) is the difference between awkward and natural for conversational applications, and between impractical and viable for real-time use cases.

Key Technical Details

Whisper (Sep 2022): Encoder-decoder Transformer. 680K hours training data. 97 languages. 1.55B params (large-v3).
Whisper Performance: ~4.2% word error rate on LibriSpeech (near-human). Robust to noise and accents.
GPT-4o Voice (May 2024): Native audio tokens. ~232ms average latency. Emotional expression.
GPT-4o Advanced Voice (Aug 2024): Singing, accents, interruptions. Full emotional range.
Gemini 1.5 Pro Audio: Processes up to 11 hours of audio in a single prompt (1M token context).
Audio Tokenization: Mel spectrogram with 25ms frames, 80 channels. ~2,400 tokens per minute.
Pipeline Latency: STT (~500ms) + LLM (~1000ms) + TTS (~500ms) = ~2-3 seconds minimum.
Native Latency: ~232ms average for GPT-4o voice responses.
MusicLM (Jan 2023): Text-to-music generation at 24kHz. 30-second clips from text descriptions.
AudioLM (Oct 2022): Coherent speech and music continuations from 3-second prompts.

Common Misconceptions

"Whisper is just another speech recognition system." Whisper is a universal audio model that performs speech recognition, translation across languages, language identification, and timestamping — all in a single model trained end-to-end. Its robustness to real-world conditions sets it apart from traditional ASR systems that require clean audio.
"GPT-4o voice just uses a better TTS system." GPT-4o generates audio tokens natively — there is no separate TTS component. The emotional expression, timing, and prosody come from the same model that understands the conversation content. The voice is an integral part of the response, not a post-processing step.
"Audio AI only matters for voice assistants." Audio understanding enables meeting analysis, podcast search, music generation, environmental sound monitoring, security applications, medical diagnostics (analyzing coughs, heart sounds, breathing patterns), industrial monitoring, and real-time translation. The modality has far broader applications than consumer voice assistants.

Connections to Other Concepts

Whisper's Transformer architecture connects to 01-attention-is-all-you-need.md. GPT-4o's native audio is covered in 03-gpt-4o.md, and the broader native multimodal approach in 02-native-multimodal-training.md. Audio capabilities converge with vision and text in 05-the-convergence-toward-omni-models.md. Real-time audio processing connects to the latency requirements of 06-agent-native-models.md.