One-Line Summary: Video transformers apply self-attention to spatiotemporal tokens extracted from video, achieving strong accuracy but facing a quadratic cost challenge that demands factorized attention strategies.
Prerequisites: Vision transformers (ViT), self-attention mechanism, positional encoding, 3D convolutions, video representation
What Is Video Transformers?
Imagine trying to understand a complex scene by looking at every detail in relation to every other detail -- not just in the current moment, but across time. A detective reviewing security footage might note that a person in frame 1 is connected to a bag in frame 50 and a car in frame 100. Video transformers work similarly: they break the video into small spatiotemporal patches (tokens) and use self-attention to let every patch attend to every other patch across space and time, capturing long-range dependencies that local convolutions miss.
Video transformers extend the Vision Transformer (ViT) paradigm from images to video. The input video is divided into spatiotemporal tokens (either frame-level patches or volumetric "tubelets"), which are linearly embedded and processed through transformer encoder layers. The central challenge is that the number of tokens scales as , and self-attention is in the number of tokens , making naive full spatiotemporal attention prohibitively expensive.
How It Works
Tokenization
A video of shape is divided into tokens. Two main strategies exist:
Frame-level patches: Each frame is independently divided into patches of size . For frames, total tokens:
With : tokens.
Tubelet embedding: The video is divided into non-overlapping 3D volumes of size and each is linearly projected:
With : tokens. Tubelets capture local temporal information within each token.
TimeSformer: Divided Space-Time Attention
Bertasius et al. (2021) proposed TimeSformer, which avoids full spatiotemporal attention by factorizing it into separate operations:
-
Temporal attention: Each spatial patch attends only to patches at the same spatial location across all frames. For a patch at position , it attends to the tokens at for all .
-
Spatial attention: Each patch attends only to patches within the same frame. For a patch at time , it attends to all tokens at .
These are applied sequentially within each transformer block. The cost per block drops from for full attention to:
Simplified, the cost scales as rather than , where is the number of spatial tokens per frame.
TimeSformer-L with frames: 79.7% top-1 on Kinetics-400.
ViViT: Factorized Encoder
Arnab et al. (2021) proposed ViViT with four attention variants:
Model 1 -- Spatio-temporal attention: Full self-attention over all tokens. Most accurate but cost. Impractical for long videos.
Model 2 -- Factorized encoder: Two separate transformer encoders. First, a spatial encoder processes patches within each frame independently ( parallel forward passes). Then, a temporal encoder processes the [CLS] tokens (or averaged spatial tokens) across frames. Cost: .
Model 3 -- Factorized self-attention: Similar to TimeSformer's divided attention but within a single encoder. Spatial and temporal attention alternate within each block.
Model 4 -- Factorized dot-product attention: Computes spatial and temporal attention heads separately within the same multi-head attention layer (some heads attend spatially, others temporally).
ViViT-L/16x2 (Model 2): 81.3% top-1 on Kinetics-400 (with JFT-300M pretraining).
VideoMAE: Self-Supervised Video Transformers
Tong et al. (2022) showed that masked autoencoding is highly effective for video transformers. By masking a very high ratio (90--95%) of spatiotemporal tubes and reconstructing them, VideoMAE learns strong representations without labels. Key findings:
- 90% masking ratio for video (vs. 75% for images) due to temporal redundancy
- VideoMAE with ViT-B: 81.5% on Kinetics-400 (self-supervised, then fine-tuned)
- Data-efficient: competitive results with only 3.5k training videos on SSv2
Computational Reality
For a ViT-B model with , 12 layers, and tokens, the self-attention cost per layer is:
The term dominates for large . With (32 frames, 224x224, patch 16):
- Full attention per layer: ~15.1 GFLOPs
- Factorized (TimeSformer-style) per layer: ~2.3 GFLOPs
- Full model (12 layers + FFN): ~590 GFLOPs (full) vs. ~180 GFLOPs (factorized)
Memory scales similarly: storing the attention matrix for requires ~37 MB per head per layer in FP32.
Why It Matters
- Long-range temporal modeling: Self-attention captures dependencies across the entire video in a single layer, whereas 3D CNNs require many layers to achieve a comparable temporal receptive field.
- Scalability with data: Video transformers benefit significantly from large-scale pretraining (Kinetics-710, JFT-300M) and self-supervised learning (VideoMAE), with accuracy continuing to improve with more data.
- Flexible attention patterns: Factorized attention enables architectural experimentation impossible with fixed convolution kernels -- different layers can focus on different spatiotemporal patterns.
- State-of-the-art accuracy: Video transformers (especially with MAE pretraining) hold top results on Kinetics-400 (87--88%), Something-Something V2 (~77%), and other benchmarks.
Key Technical Details
- TimeSformer-HR (high resolution, , ): 80.7% top-1 on Kinetics-400, 1703 GFLOPs
- ViViT-L/16x2 (factorized encoder): 81.3% on K400 (JFT pretrained), 3992 GFLOPs x 4 views
- VideoMAE V2 ViT-g: 87.4% on Kinetics-400 (at time of publication)
- Standard training: AdamW optimizer, cosine learning rate schedule, 0.1--0.2 label smoothing
- Positional embeddings: learned or sinusoidal; separate spatial and temporal embeddings are common
- Inference: 4--12 temporal clips x 1--3 spatial crops, averaged
- Typical training: 32--64 GPUs, 1--3 days on Kinetics-400
Common Misconceptions
- "Video transformers always outperform 3D CNNs." On datasets where temporal reasoning is critical (Something-Something V2), transformers show clear advantages. On appearance-dominated datasets (Kinetics), the gap over well-tuned 3D CNNs like SlowFast is modest (1--3%). Efficiency-wise, 3D CNNs are often faster at inference.
- "Full spatiotemporal attention is always best." TimeSformer's divided attention and ViViT's factorized encoder achieve comparable accuracy to full attention at a fraction of the cost. Full attention shows diminishing returns beyond short temporal windows because nearby frames are highly redundant.
- "Video transformers don't need convolutions." Many top-performing video transformers incorporate convolutional components: tubelet embedding (3D convolution for tokenization), convolutional position encoding, and hybrid architectures like MViT that use pooling attention.
Connections to Other Concepts
3d-convolutions.md: Video transformers emerged as alternatives, though hybrid designs combining both remain competitive.video-representation.md: Tubelet tokenization defines how raw video is converted into the token sequence transformers process.action-recognition.md: The primary benchmark task driving video transformer development.two-stream-networks.md: Some video transformers adopt two-stream designs (RGB + flow or RGB at two resolutions).video-object-tracking.md: Transformer-based trackers (STARK, MixFormer) adapt video transformer ideas for tracking.
Further Reading
- Bertasius et al., "Is Space-Time Attention All You Need for Video Understanding?" (2021) -- TimeSformer with divided space-time attention.
- Arnab et al., "ViViT: A Video Vision Transformer" (2021) -- Systematic exploration of factorized video transformer designs.
- Fan et al., "Multiscale Vision Transformers" (2021) -- MViT with pooling attention for efficient multi-scale spatiotemporal processing.
- Tong et al., "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training" (2022) -- Self-supervised pretraining with 90% masking ratio.