Courses·Advanced And Emerging·7 min read

State Space Models & Mamba

State Space Models offer a fundamentally different approach to sequence modeling that processes tokens in linear time through learned recurrent state updates, with Mamba's selective mechanism making them the most credible alternative to Transformers.

One-Line Summary: State Space Models offer a fundamentally different approach to sequence modeling that processes tokens in linear time through learned recurrent state updates, with Mamba's selective mechanism making them the most credible alternative to Transformers.

Prerequisites: Understanding of Transformer architecture and self-attention, recurrent neural networks (RNNs), computational complexity (Big-O notation), matrix operations, and the quadratic cost of standard attention.

What Are State Space Models?

Imagine reading a book. A Transformer re-reads every previous page each time it encounters a new word -- that is what quadratic attention does. A State Space Model, by contrast, reads like a human: it maintains a running mental summary (the "state") and updates it with each new word, never needing to look back at the raw text.

flowchart LR
    S1["State space model (S4) diagram"]
    S2["continuous-time state evolution discretize"]
    S1 --> S2

State Space Models (SSMs) are a class of sequence models rooted in control theory. They model a system's behavior through a hidden state that evolves over time according to learned dynamics. The key insight is that a continuous-time linear system -- defined by matrices A, B, C, and D -- can be discretized and applied to sequential data like text, producing outputs in linear time relative to sequence length.

Mamba, introduced by Albert Gu and Tri Dao in late 2023, is the breakthrough SSM variant that made this architecture competitive with Transformers for language modeling. Its core innovation is the selective state space mechanism: instead of using fixed dynamics for every token, Mamba makes the state update parameters (B, C, and the discretization step delta) functions of the input itself. This allows the model to selectively remember or forget information based on content -- something fixed linear systems cannot do.

How It Works

flowchart LR
    S1["Mamba selective state space"]
    S2["input-dependent selection mechanism replac"]
    S1 --> S2

The Classical SSM Framework

A continuous-time state space model is defined by four matrices:

h'(t) = A * h(t) + B * x(t)
y(t) = C * h(t) + D * x(t)

Where:

h(t) is the hidden state (a compressed representation of history)
x(t) is the input at time t
y(t) is the output at time t
A governs how the state evolves (state transition matrix)
B controls how input enters the state
C maps state to output
D is a skip connection (often omitted)

Discretization

To process discrete sequences (tokens), the continuous system is discretized using a step size delta:

A_bar = exp(delta * A)
B_bar = (delta * A)^(-1) * (A_bar - I) * delta * B

This converts the continuous dynamics into a recurrence:

h_t = A_bar * h_{t-1} + B_bar * x_t
y_t = C * h_t

The Structured State Space (S4) Foundation

The S4 model (Gu et al., 2022) demonstrated that initializing A with the HiPPO (High-order Polynomial Projection Operators) matrix allows the state to optimally compress long-range history. During training, the recurrence can be unrolled as a global convolution, enabling parallelism on GPUs.

Mamba's Selective Mechanism

Mamba's key departure from S4 is making B, C, and delta input-dependent:

B_t = Linear_B(x_t)
C_t = Linear_C(x_t)
delta_t = softplus(Linear_delta(x_t))

This selectivity means the model can:

Gate information: Large delta values let new input dominate; small values preserve existing state.
Content-aware filtering: The model learns which tokens matter and which to ignore -- similar to attention, but without the pairwise comparison cost.

Hardware-Efficient Implementation

Mamba uses a parallel scan algorithm instead of naive sequential recurrence. The scan operates in O(n) time but is parallelizable on GPUs. Gu and Dao also developed a kernel-fused implementation that keeps the expanded state in GPU SRAM (fast memory), avoiding costly HBM (high-bandwidth memory) reads -- a technique borrowing from FlashAttention's philosophy.

Why It Matters

The quadratic cost of self-attention (O(n^2) in sequence length) is the fundamental bottleneck limiting context windows and throughput in Transformer-based LLMs. SSMs process sequences in O(n) time and O(1) memory per step during inference (constant state size regardless of history length). This means:

Inference scales linearly with sequence length, not quadratically.
Constant memory during generation -- no growing KV cache.
Potentially unlimited context without the memory explosion of attention.

For applications like real-time streaming, long-document processing, and edge deployment, these properties are transformative.

Key Technical Details

State dimension (N): Typically 16-64 per channel. Larger states capture more history but cost more compute.
Mamba-1 uses a state dimension of N=16 with an expand factor of 2 (inner dimension = 2 * model dimension).
Mamba-2 (Dao and Gu, 2024) reformulated selective SSMs as a form of structured masked attention (SMA), achieving 2-8x faster training while connecting SSMs and attention theoretically.
Training throughput: Mamba-3B matches Transformer-3B quality on language modeling while training ~5x faster on long sequences.
The convolution-recurrence duality: During training, SSMs use the convolutional view for parallelism. During inference, they use the recurrent view for efficiency. Mamba's selectivity breaks the pure convolutional form, requiring the scan algorithm instead.
No explicit attention matrix is ever computed or stored.

Common Misconceptions

"SSMs have replaced Transformers": As of early 2026, pure SSMs have not displaced Transformers for frontier language models. Transformers still dominate at the largest scales, particularly for tasks requiring precise retrieval from context (associative recall). SSMs excel at efficiency but can struggle with tasks that demand exact copying or lookup from long contexts.
"Mamba is just a better RNN": While Mamba has a recurrent form, its parallel scan training, selective gating, and structured initialization make it fundamentally different from classical RNNs. It does not suffer from vanishing gradients in the same way.
"Linear attention and SSMs are the same thing": While Mamba-2 showed a theoretical connection, linear attention approximates softmax attention, whereas SSMs come from control theory with different inductive biases. Their practical behavior differs meaningfully.
"SSMs cannot do in-context learning": Mamba demonstrates strong in-context learning. The selective mechanism provides input-dependent processing analogous to what attention enables, though the mechanism is different.

Connections to Other Concepts

attention-mechanism.md: SSMs are the primary alternative to attention. Understanding why attention is O(n^2) clarifies why SSMs' O(n) complexity matters.
flash-attention.md: Mamba's hardware-aware kernel design directly borrows principles from FlashAttention's SRAM-optimized approach.
Hybrid Architectures: The dominant production trend is now hybrid SSM-Attention models that combine SSM efficiency with attention's precise recall:
- Jamba (AI21 Labs, 2024): Interleaves Mamba and Transformer attention layers in a 1:7 ratio (1 attention layer per 7 Mamba layers), combined with MoE. Jamba-1.5 scaled to 398B total parameters with 94B active, achieving competitive quality with 2x the throughput and 256K context.
- Zamba (Zyphra, 2024): A hybrid architecture sharing attention parameters across all attention layers while keeping Mamba blocks separate, achieving strong quality with minimal attention overhead.
- RWKV-6/7 (RWKV Foundation, 2024-2025): An alternative linear-time architecture based on linear attention with exponential decay, supporting unlimited context and competitive quality. RWKV-6 demonstrated that pure recurrent architectures can match Transformer quality at the 7B-14B scale.
- xLSTM (Sepp Hochreiter et al., 2024): Extended LSTM with exponential gating and matrix-valued memory, achieving competitive performance with Transformers while maintaining the LSTM's recurrent efficiency.
The hybrid pattern suggests the future may not be "SSMs vs. Transformers" but "SSMs + Transformers," using each where its strengths matter most.
context-window-extension.md: SSMs sidestep the context extension problem entirely -- their recurrent nature means infinite context is theoretically possible, limited only by state capacity rather than memory.
bidirectional-rnns.md: SSMs are the modern evolution of recurrent sequence modeling, solving the parallelization and vanishing gradient problems that limited classical RNNs.
Model Efficiency: SSMs connect to the broader theme of making models faster and cheaper at inference, alongside quantization and distillation.