Courses·Advanced And Emerging·7 min read

Model Merging

Model merging combines the weights of two or more separately trained models into a single model without any additional training, exploiting the surprising geometric structure of neural network loss landscapes to blend capabilities from different fine-tuned variants.

One-Line Summary: Model merging combines the weights of two or more separately trained models into a single model without any additional training, exploiting the surprising geometric structure of neural network loss landscapes to blend capabilities from different fine-tuned variants.

Prerequisites: Understanding of neural network weight spaces, fine-tuning, loss landscapes and optimization, linear algebra (vector operations, interpolation), and the concept of task-specific adaptation from a shared pre-trained base.

What Is Model Merging?

Imagine you have two chefs: one specializes in French cuisine, the other in Japanese. Model merging is like mathematically combining their skills into a single chef who can cook both -- without that chef ever setting foot in a kitchen. It sounds like it should not work, but it does, and remarkably well.

flowchart LR
    subgraph L1["Model merging methods comparison: linear i"]
        LI3["TIES"]
    end
    subgraph R2["SLERP"]
        RI4["Feature 1"]
    end

Model merging takes the learned parameters (weights) from multiple models and combines them using mathematical operations -- averaging, interpolation, or more sophisticated methods -- to produce a new model that inherits capabilities from all parents. No training data is needed. No gradient computation. Just arithmetic on weight tensors.

This works because fine-tuned models that share a common pre-trained base occupy the same region of the loss landscape. The loss surface between them tends to be relatively flat (low loss barriers), meaning interpolated points between them are also good solutions. This is sometimes called the "linear mode connectivity" property.

How It Works

flowchart LR
    S1["how task capabilities can be added"]
    S2["subtracted in weight space"]
    S1 --> S2

Linear Interpolation (LERP) and Model Soups

The simplest approach: weighted average of model parameters.

theta_merged = alpha * theta_A + (1 - alpha) * theta_B

Where alpha is a mixing coefficient between 0 and 1. "Model Soups" (Wortsman et al., 2022) extended this to averaging multiple fine-tuned checkpoints of the same model, showing that the average often outperforms any individual checkpoint. This is the foundation -- simple averaging works because fine-tuned variants from the same base live in the same loss basin.

SLERP (Spherical Linear Interpolation)

SLERP treats weight vectors as points on a hypersphere and interpolates along the great circle between them, preserving the magnitude of the weight vectors:

theta_merged = (sin((1 - t) * omega) / sin(omega)) * theta_A + (sin(t * omega) / sin(omega)) * theta_B

Where omega = arccos(cos_similarity(theta_A, theta_B)) is the angle between the weight vectors, and t is the interpolation parameter (0 to 1).

Critical limitation: SLERP only works with exactly two models. It cannot directly combine three or more. This is because spherical interpolation defines a path between two points on a sphere -- extending to multiple points requires sequential merging, which introduces path-dependence.

Task Arithmetic

Ilharco et al. (2023) introduced the elegant concept of "task vectors." A task vector is the difference between a fine-tuned model and its base:

tau_A = theta_fine_tuned_A - theta_base
tau_B = theta_fine_tuned_B - theta_base

These task vectors can be added to combine capabilities or negated to remove them:

theta_merged = theta_base + lambda_A * tau_A + lambda_B * tau_B  (add skills)
theta_merged = theta_base - lambda * tau_toxic  (remove behavior)

The scaling factors lambda control the strength of each task's influence. This reframes merging as vector arithmetic in task space.

TIES-Merging (Trim, Elect Sign, and Merge)

Yadav et al. (2023) identified that naive task arithmetic suffers from parameter conflicts -- different task vectors may push the same parameter in opposite directions, causing destructive interference. TIES addresses this in three steps:

Trim: Zero out small-magnitude changes (parameters with minimal task-specific adjustment are noise).
Elect Sign: For each parameter, take a majority vote across task vectors on whether it should increase or decrease. Discard task vectors that disagree with the majority.
Merge: Average the remaining (agreeing) task vectors.

This resolves conflicts by keeping only consistent signals, significantly improving merged model quality.

DARE (Drop And REscale)

Yu et al. (2024) observed that fine-tuned models contain high redundancy in their parameter deltas. DARE randomly drops a large fraction (e.g., 90%) of the delta parameters and rescales the remaining ones:

mask = Bernoulli(p)  # p = drop rate, e.g., 0.9
tau_sparse = (tau * mask) / (1 - p)  # rescale to preserve expected magnitude

The sparsified task vectors are then merged. This sounds like it should destroy information, but the redundancy in delta weights means dropping 90% of changes barely affects individual model performance while dramatically reducing conflicts during merging.

Kimi's Reasoning-Fast Model Fusion

An important innovation from Moonshot AI: merging a strong reasoning model (slow, careful thinking) with a fast model (efficient, fluent generation). By carefully merging the weights, they created models that reason well but generate quickly -- combining the "thinking" of one model with the "fluency" of another. This represents a practical frontier application of merging.

Why It Matters

Model merging has democratized model development. The open-source community uses it extensively:

Top leaderboard models are often merges: Many top-performing open-weight models on benchmarks like the Open LLM Leaderboard are merges, not independently trained models.
Zero-cost capability combination: A model fine-tuned for coding can be merged with one fine-tuned for medical reasoning, producing a model capable of both -- without any training compute.
Rapid experimentation: Practitioners can explore the model space by merging, evaluating, and iterating in minutes rather than the days required for fine-tuning.
Ensemble-like benefits at single-model cost: Merged models capture diverse training signals (like ensembles) while requiring the same inference cost as a single model.

Key Technical Details

Base model compatibility: Models being merged must share the same architecture and almost always the same pre-trained base. Merging a LLaMA fine-tune with a Mistral fine-tune will produce garbage.
Layer-wise mixing: Most tools allow setting different mixing ratios per layer. Typically, early layers (generic features) tolerate aggressive merging while later layers (task-specific) require more careful balancing.
Mergekit: The community-standard tool for model merging, supporting SLERP, TIES, DARE, task arithmetic, linear, and passthrough merge methods. It handles the practical details of loading, aligning, and saving merged models.
Frankenmerging (passthrough): Stacking layers from different models to create a larger model (e.g., taking layers 0-24 from model A and layers 12-36 from model B to create a 48-layer model). Surprisingly effective sometimes.
Evaluation is essential: Merging is stochastic in quality. A merge may excel on one benchmark while degrading on another. Systematic evaluation across target tasks is critical.

Common Misconceptions

"Merging always improves quality": Merging can and often does degrade performance on specific tasks. It is not guaranteed to produce a model better than either parent. Parameter conflicts, incompatible fine-tuning strategies, or excessive merging can all harm quality.
"You can merge any two models": Models must share architecture and, for best results, a common pre-trained ancestor. The closer their lineage, the better the merge.
"Merging is the same as ensembling": Ensembles run multiple models and combine their outputs at inference time (requiring N times the compute). Merging produces a single model with single-model inference cost. The quality is typically lower than a full ensemble but the efficiency is far superior.
"The merged model understands both domains equally": Merged models typically compromise. A coding model merged with a chat model may be worse at coding than the pure coding model and worse at chat than the pure chat model, while being decent at both.
"SLERP is always better than linear interpolation": SLERP preserves magnitude, which matters when weight norms carry important information. But for many practical merges, LERP and SLERP produce nearly identical results.

Connections to Other Concepts

supervised-fine-tuning.md: Model merging operates on the outputs of fine-tuning. LoRA adapters can also be merged -- both with the base model and with each other -- making lightweight adaptation composable.
Loss Landscapes: The theoretical foundation for why merging works is linear mode connectivity -- the observation that fine-tuned models from the same base share a flat loss basin.
Ensemble Methods: Merging can be viewed as "poor man's ensembling" -- getting some diversity benefits at single-model cost.
agent-distillation.md: An alternative approach to combining capabilities: distill a merged model's knowledge into a new model through training on the merged model's outputs.
test-time-compute.md: Kimi's innovation of merging reasoning and fast models connects merging to the inference-time scaling paradigm.
04-the-open-source-ecosystem.md: Merging has become central to the open-source model ecosystem, enabling community-driven improvement without massive compute budgets.