Video Understanding

One-Line Summary: Video understanding in LLMs extends visual reasoning from static images to temporal sequences, enabling models to comprehend narratives, track objects, and answer questions about events unfolding over minutes to hours.

Prerequisites: 02-gemini-1-5.md, 01-vision-language-models.md

What Is Video Understanding?

Video is the richest everyday data modality — it combines visual information, temporal dynamics, audio, and often text (subtitles, captions, on-screen text) into a single stream. Understanding video requires not just recognizing objects in individual frames but tracking them over time, understanding cause-and-effect relationships, following narratives, and connecting actions to their consequences.

For LLMs, video understanding represents the hardest multimodal challenge: it demands the visual perception of image models, the sequential reasoning of language models, and a temporal awareness that neither modality requires on its own. The field has progressed from processing a handful of sampled frames to comprehending hour-long videos in a single prompt.

How It Works

Video Understanding: Frame Sampling and Token Budget:

Video -> Frame Sampling -> Visual Tokens -> LLM
 
  Video (1 hour)
  ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  ...3600 sec
  └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
       │     │     │     │     │     │
       ▼     ▼     ▼     ▼     ▼     ▼    Sample at 1 fps
  ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐  = 3,600 frames
  │ f1 │ │ f2 │ │ f3 │ │ f4 │ │ f5 │ │ f6 │  ...
  └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘
     │      │      │      │      │      │
     ▼      ▼      ▼      ▼      ▼      ▼    Each frame ->
  [256   [256   [256   [256   [256   [256     256 visual tokens
  tokens] tokens] tokens] tokens] tokens] tokens]
 
  Token Budget (at 256 tokens/frame):
  ┌──────────────────────────────────────────────────┐
  │ 1 min   = 60 frames   =  15,360 tokens          │
  │ 10 min  = 600 frames  = 153,600 tokens          │
  │ 1 hour  = 3,600 frames = 921,600 tokens          │
  │                                                  │
  │ Gemini 1.5 Pro: 1M context -> fits 1 hour video  │
  └──────────────────────────────────────────────────┘
 
  Temporal Reasoning Challenge:
  ┌────────────────────────────────────────────────────┐
  │ "What" questions (object recognition): ██████ ~85% │
  │ "When" questions (temporal ordering):  ████   ~60% │
  │ "Why" questions (causal reasoning):    ███    ~50% │
  │                                                    │
  │ ◄─── Gap from human baseline (~85%) ───►          │
  └────────────────────────────────────────────────────┘

The Frame Sampling Approach

The simplest approach to video understanding is to sample keyframes at regular intervals and process them as a set of independent images. A 60-second video sampled at 1 frame per second yields 60 images; a vision-language model processes each frame with a visual encoder and feeds the resulting tokens into the LLM.

This works for simple questions ("What color is the car?") but fails for temporal questions ("What happens after the car turns left?") because the relationship between frames is lost — or rather, the model must infer temporal relationships from the spatial content of sequentially presented images, which is a much harder task than understanding a single image.

Early video QA systems (2021-2023) improved on this by encoding frames with a visual encoder and using temporal attention mechanisms to aggregate information across time. Models like VideoChat (Li et al., 2023) and Video-LLaMA (Zhang et al., 2023) applied the adapter approach used for images — frozen video encoder, frozen LLM, trainable connector — with temporal pooling or sequential processing of frame tokens.

Gemini 1.5 Pro: The Long-Video Breakthrough (2024)

Gemini 1.5 Pro (February 2024) transformed video understanding through sheer context length. With a 1 million token context window, it could process up to 1 hour of video by sampling frames and encoding them as visual tokens alongside any audio track. This was not just an incremental improvement — it was a qualitative shift.

For the first time, an LLM could watch an entire movie scene, lecture, or sports match and answer detailed questions about specific moments. Google demonstrated Gemini 1.5 Pro analyzing a 44-minute silent Buster Keaton film, correctly answering questions about plot points, visual gags, and character motivations. It could find specific moments in hour-long videos ("At what point does the speaker discuss quantum computing?"), track objects and people across scenes, and understand the narrative arc of extended sequences.

The approach works through careful token budgeting. At 1 frame per second, a 1-hour video produces approximately 3,600 frames. Each frame is encoded as a fixed number of visual tokens — typically 256-576 tokens per frame depending on the resolution and encoding strategy.

At 256 tokens per frame, 1 hour produces approximately 920,000 visual tokens — fitting within the 1M context window with room for the text query and response. For longer videos, the frame rate is reduced (0.5 fps, 0.25 fps) to fit the available token budget.

Temporal Reasoning: The Hard Problem

Temporal reasoning — understanding the order, timing, and causal relationships between events — remains the core challenge in video understanding. Models can typically answer "what" questions about video content (object recognition, scene description) but struggle with "when" and "why" questions requiring precise temporal understanding.

Fine-grained temporal questions illustrate the difficulty:

"How long does the character wait before opening the door?"
"Which event happens first, the explosion or the alarm?"
"Does the speaker's tone change after the question is asked?"

These require the model to not just identify events but understand their precise temporal relationships — duration, ordering, simultaneity, causation. Current models handle coarse temporal reasoning (first half vs. second half of a video) much better than fine-grained reasoning (exact ordering of rapid events within a few seconds).

Extending Image Models to Video (2024)

LLaVA-NeXT-Video (2024) demonstrated that strong image-language models could be extended to video with a straightforward approach: sample N frames uniformly from the video, encode each as image tokens, concatenate them in temporal order, and feed the entire sequence to the LLM.

With 32 uniformly sampled frames and training on video QA datasets, this approach achieves competitive results on standard benchmarks without any specialized video architecture. The temporal information is implicitly captured by the positional encoding of frame tokens in the sequence — the model learns that earlier frames come first.

Gemini 2.0 (December 2024) improved real-time video processing, enabling streaming video analysis — processing video as it arrives rather than requiring the full video upfront. This capability is critical for applications like live sports commentary, real-time security monitoring, and interactive video conferencing assistance.

Applications and Use Cases

Video understanding enables a wide range of practical applications:

Content moderation: Detecting policy violations in user-uploaded video at platform scale
Security and surveillance: Identifying events of interest in continuous video feeds
Education: Making video lectures searchable, interactive, and summarizable
Media production: Automated tagging, highlight generation, and content summarization
Medical: Analyzing surgical procedures, patient monitoring, physical therapy assessment
Sports analytics: Tracking player movements, identifying plays, generating statistics
Accessibility: Generating audio descriptions of visual content for visually impaired viewers

Why It Matters

Video is the dominant medium of the internet — YouTube alone hosts over 800 million videos, and TikTok serves billions of short-form video views daily. The ability to understand, search, and reason about video content at scale has enormous practical implications that text and image understanding alone cannot address.

The ability to process hour-long videos in a single prompt — rather than requiring pre-segmented clips — is particularly transformative. An LLM can analyze an entire meeting recording, surgical procedure, lecture, or documentary and answer questions about any part of it, summarize the key points, or extract specific information. This shifts video from an opaque medium that requires human viewing time to a queryable data source.

For accessibility, video understanding enables real-time audio descriptions for visually impaired users, automatic captioning for deaf users, and content summarization for users with limited time. These are not incremental improvements but fundamental changes in who can access video content.

Key Technical Details

Gemini 1.5 Pro (Feb 2024): 1M token context. Up to 1 hour of video. Variable frame rate sampling.
Token Budget: At 256 tokens/frame: 1 min = 15,360 tokens; 10 min = 153,600; 1 hour = 921,600.
Frame Sampling: Uniform sampling at 1-2 fps for short videos, 0.25-0.5 fps for hour-long videos.
LLaVA-NeXT-Video (2024): Extends image model to video. 32 uniformly sampled frames. Competitive results.
Video-LLaMA (2023): Audio-visual encoder + LLM adapter for video understanding.
Gemini 2.0 (Dec 2024): Real-time streaming video analysis. Improved temporal reasoning.
Standard Benchmarks: ActivityNet-QA, MSRVTT-QA, MSVD-QA, EgoSchema, Video-MME.
EgoSchema: 5,000 questions about 5-minute egocentric clips. Tests long-range temporal reasoning.
Frontier Accuracy: ~60-70% on challenging video QA benchmarks (vs. ~85% human baseline).
Compute Cost: Processing 1 hour of video requires ~10x the compute of processing equivalent text.

Common Misconceptions

"Video understanding is just image understanding applied to multiple frames." Temporal reasoning — understanding sequence, causality, duration, and change over time — is a fundamentally different capability from spatial visual understanding. Models that excel at image QA can fail at basic temporal video questions because they lack the ability to reason about how things change.
"Models truly 'watch' videos like humans do." Current models process sampled frames, not continuous video. At 1 fps, they see one frame per second and miss everything between frames. Fast actions, brief visual events, and subtle temporal patterns can be entirely invisible to the model. Human vision processes continuous motion at roughly 24+ fps equivalent.
"Long-context models solve video understanding." Long context enables processing more frames, but the core challenge of temporal reasoning is not solved by context length alone. Models still struggle with precise temporal ordering and fine-grained causal reasoning even with million-token contexts. The bottleneck is the quality of temporal representations, not the quantity of frames processed.

Connections to Other Concepts

Video understanding builds directly on 01-vision-language-models.md for visual encoding and 02-gemini-1-5.md for long-context processing. The token budget challenge connects to 07-long-context-techniques.md and the compute costs discussed in 07-training-efficiency-breakthroughs.md. Native video processing is part of 02-native-multimodal-training.md, and video represents a key modality in 05-the-convergence-toward-omni-models.md. Audio tracks in video connect to 03-audio-and-speech-models.md.