Courses·Advanced And Emerging·7 min read

Test-Time Compute & Inference-Time Scaling

Test-time compute is the paradigm shift from making models bigger to making models think harder, allocating additional computation at inference to explore reasoning paths, verify answers, and dramatically improve performance on complex problems.

One-Line Summary: Test-time compute is the paradigm shift from making models bigger to making models think harder, allocating additional computation at inference to explore reasoning paths, verify answers, and dramatically improve performance on complex problems.

Prerequisites: Understanding of LLM inference, sampling strategies (temperature, top-k, top-p), chain-of-thought prompting, reinforcement learning basics (rewards, policies), and the concept of scaling laws (train-time compute scaling).

What Is Test-Time Compute?

Consider the difference between a student who writes the first answer that comes to mind versus one who drafts multiple solutions, checks each for errors, and selects the best one. Both students have the same knowledge (same model weights), but the second student spends more time thinking (more compute at test time) and consistently gets better results.

flowchart LR
    S1["performance improving with repeated sampli"]
    S2["verification"]
    S3["search"]
    S1 --> S2
    S2 --> S3

Test-time compute -- also called inference-time scaling or inference-time compute -- refers to techniques that improve model outputs by spending more computation when generating each response, rather than during training. This is the new scaling frontier: instead of only asking "how big should the model be?" we now also ask "how much should the model think about this particular problem?"

The traditional scaling paradigm (Kaplan et al., 2020; Hoffmann et al., 2022) showed that performance improves predictably with more training compute (larger models, more data). The test-time compute paradigm, validated by OpenAI's o1 and subsequent reasoning models, shows that performance also scales predictably with inference compute -- and for many hard problems, this scaling is more efficient.

How It Works

flowchart TD
    L1["Compute-optimal frontier"]
    L2["trade-off between model size and inference"]
    L1 --> L2

Self-Consistency (Wang et al., 2023)

The simplest test-time compute method: sample multiple independent responses and take a majority vote.

1. Generate N responses to the same prompt (e.g., N = 40)
2. Extract the final answer from each response
3. Return the answer that appears most frequently

For a math problem, if 32 out of 40 samples produce "42" and 8 produce other answers, output "42." This works because correct reasoning paths tend to converge on the same answer, while errors are diverse. The accuracy improvement follows a logarithmic curve with the number of samples.

Tree-of-Thought (Yao et al., 2023)

Instead of sampling complete independent responses, Tree-of-Thought (ToT) explores a branching tree of partial reasoning steps:

1. Generate multiple candidate first steps
2. Evaluate each step (using the model itself or a heuristic)
3. Expand the most promising steps with multiple next steps
4. Repeat until reaching a solution
5. Select the best complete path

This is analogous to game tree search: rather than hoping a single chain of thought finds the right path, you systematically explore the reasoning space. The evaluation function can use the model's own assessment, a separate verifier, or heuristic scoring.

Verifier-Guided Search (Process Reward Models)

A major advance: train a separate verifier model that scores individual reasoning steps, not just final answers.

Outcome Reward Models (ORMs): Score complete solutions as correct or incorrect. Used to select the best among multiple full solutions.

Process Reward Models (PRMs): Score each intermediate reasoning step. Lightman et al. (2023) showed that step-level verification dramatically outperforms outcome-level verification:

For each candidate solution:
  Score = product of P(step_i is correct) for all steps i
Select the solution with the highest cumulative step score

PRMs enable more efficient search because they can prune bad reasoning paths early, before they complete. This is analogous to alpha-beta pruning in game tree search.

Reasoning Models: o1, o3, and DeepSeek-R1

OpenAI's o1 (2024) demonstrated that training a model to use extended internal reasoning chains -- spending hundreds or thousands of tokens "thinking" before answering -- produces dramatic improvements on math, coding, and science problems. The model generates a long chain of reasoning tokens (hidden from the user) before producing its final answer.

The training recipe (as understood from public information and DeepSeek-R1's open approach):

Start with a capable base model (pre-trained + instruction-tuned)
Reinforcement learning with verifiable rewards (RLVR): Train the model to produce reasoning chains that lead to correct answers on problems with verifiable solutions (math, code, formal logic).
The reward signal: For a math problem, the reward is simply whether the final answer matches the ground truth. For code, whether it passes test cases. No human preference labels needed.
The model learns to use thinking tokens: Through RL, the model discovers that generating intermediate reasoning steps (checking work, trying alternatives, backtracking) increases its reward. The chain-of-thought emerges from optimization pressure, not explicit supervision.

DeepSeek-R1 showed this works remarkably well, matching o1-level performance with a transparent training methodology.

The Scaling Law for Test-Time Compute

Snell et al. (2024) formalized the relationship:

Performance = f(train_compute, test_compute)

Key findings:

Test-time compute and train-time compute are partially substitutable: a smaller model with more test-time compute can match a larger model with less.
The optimal allocation depends on problem difficulty: easy problems benefit little from extra thinking; hard problems benefit enormously.
There are diminishing returns, but the curve is favorable for current models on hard reasoning tasks.

Why It Matters

Test-time compute may be the most important development in AI scaling since the Chinchilla scaling laws. Its implications include:

Democratization: Smaller models with test-time compute can approach the performance of much larger models, reducing the cost barrier for frontier capabilities.
Adaptive computation: Instead of one-size-fits-all inference cost, systems can allocate more thinking to harder problems and less to easy ones, optimizing the cost-quality tradeoff.
Reasoning breakthroughs: Tasks previously considered intractable for LLMs (competition-level math, complex multi-step reasoning) become solvable with sufficient test-time compute.
New training paradigm: RLVR (reinforcement learning with verifiable rewards) is simpler than RLHF -- no human labelers needed, just automatic verification. This is cheaper and more scalable.

Key Technical Details

Compute multiplication: Generating N samples multiplies inference cost by N. A 70B model generating 32 samples per query uses the same total compute as running a ~2T parameter model once -- but often achieves better results on reasoning tasks.
Latency vs. quality tradeoff: More thinking means longer wait times. Real-time applications may limit test-time compute, while batch processing (research, code generation) can afford extensive search.
Hidden reasoning tokens: Reasoning models like o1 generate internal chain-of-thought tokens that are not shown to users. DeepSeek-R1 makes these visible, enabling research into what effective reasoning looks like.
Verification is easier than generation: A verifier only needs to check if a solution is correct, which is often much easier than generating a correct solution from scratch. This asymmetry is what makes verifier-guided search so effective.
The RL training is surprisingly simple: DeepSeek-R1 showed that starting from a base model with GRPO (Group Relative Policy Optimization) and simple binary rewards (correct/incorrect) produces strong reasoning behavior without complex reward shaping.
Emergent behaviors: During RL training, models spontaneously develop behaviors like self-correction, trying alternative approaches, and expressing uncertainty -- without being explicitly taught these strategies.

Common Misconceptions

"Test-time compute replaces the need for large models": It complements, not replaces. A better base model benefits more from test-time compute. The scaling laws show both dimensions matter.
"This is just chain-of-thought prompting": Chain-of-thought prompting asks a pre-trained model to show its work. Reasoning models are specifically trained through RL to develop effective internal reasoning strategies. The difference in quality is dramatic.
"More samples always helps": Returns diminish, and for easy problems, even one sample from a good model is often sufficient. The benefit concentrates on problems near the model's capability frontier.
"Reasoning tokens are just the model stalling": Analysis of reasoning traces shows structured problem-solving: breaking down problems, exploring alternatives, catching errors, and synthesizing solutions. The tokens do meaningful computational work.
"This only works for math and code": While verifiable domains show the clearest gains (because reward signals are clean), reasoning models also improve on writing, analysis, and planning tasks. The benefit extends beyond formal verification.

Connections to Other Concepts

chain-of-thought-prompting.md: The precursor to test-time compute. CoT demonstrated that intermediate reasoning steps improve performance; reasoning models formalize and optimize this through RL.
rlhf.md: RLVR uses the same RL algorithms (PPO, GRPO) but with automatic verification instead of human preference labels, making it more scalable.
scaling-laws.md: Test-time compute extends the scaling law framework from a single axis (train compute) to two axes (train + inference compute), fundamentally changing how we think about compute allocation.
compound-ai-systems.md: Test-time compute techniques (sampling, verification, search) are building blocks for compound systems that combine multiple inference strategies.
Model Distillation: Knowledge from reasoning models can be distilled into faster models that approximate the reasoning behavior without the full search cost.
trajectory-evaluation.md: Reasoning models have saturated many traditional benchmarks, driving demand for harder evaluations (FrontierMath, SWE-bench Verified).