Courses·Safety And Alignment·12 min read

Machine Unlearning for LLMs

Machine unlearning is the process of selectively removing the influence of specific training data from a trained model -- making the model "forget" particular knowledge, individuals, or copyrighted content -- without retraining from scratch, driven by legal requirements (GDPR right to erasure), copyright compliance, and the need to remove hazardous knowledge.

One-Line Summary: Machine unlearning is the process of selectively removing the influence of specific training data from a trained model -- making the model "forget" particular knowledge, individuals, or copyrighted content -- without retraining from scratch, driven by legal requirements (GDPR right to erasure), copyright compliance, and the need to remove hazardous knowledge.

Prerequisites: Understanding of LLM fine-tuning and gradient descent, loss functions and optimization objectives, the difference between memorization and generalization in neural networks, familiarity with GDPR and data privacy regulations, basics of knowledge representation in language models.

What Is Machine Unlearning?

Imagine you have written a comprehensive encyclopedia from thousands of source books. Now, one of those source authors invokes their legal right to have their contributions removed. You cannot simply tear out the pages that reference that author, because their information has been synthesized, cross-referenced, and woven throughout the entire text. The challenge is to remove that author's specific influence while preserving everything else intact. This is the machine unlearning problem.

flowchart LR
    S1["Machine unlearning pipeline: identify forg"]
    S2["apply unlearning method"]
    S3["verify removal while maintaining utility"]
    S1 --> S2
    S2 --> S3

For LLMs, the challenge is even harder. During pre-training on trillions of tokens, models develop entangled representations where knowledge from different sources is deeply intertwined in shared parameters. A single neuron or weight may encode information from millions of training examples simultaneously. There is no clean "undo button" for individual data points.

Exact unlearning would require retraining the model from scratch on the dataset minus the data to be forgotten. For a model that cost $100M+ and months to train, this is economically prohibitive. Approximate unlearning aims to achieve a result that is statistically indistinguishable from exact unlearning, at a fraction of the cost.

The field emerged from traditional machine learning (Cao & Yang, 2015; Bourtoule et al., 2021, SISA training) but has gained enormous urgency as LLMs became central to AI deployment and as regulators began enforcing data rights.

How It Works

flowchart LR
    S1["Gradient ascent unlearning"]
    S2["loss maximization on forget data alongside"]
    S1 --> S2

The Formal Definition

The gold standard for unlearning is defined through the lens of indistinguishability: an unlearning algorithm U is considered successful if the distribution of model parameters it produces is computationally indistinguishable from the distribution produced by retraining on D \ D_f (the original dataset minus the forget set D_f).

Formally, for original model M trained on dataset D, forget set D_f, and retain set D_r = D \ D_f:

U(M, D_f, D_r) ≈ Train(D_r)

Where ≈ denotes statistical indistinguishability. In practice, this is evaluated through multiple proxy metrics because exact comparison is infeasible.

Core Techniques

1. Gradient Ascent (GA)

The simplest and most intuitive approach. Instead of minimizing the loss on the forget set (which is what training does), maximize it:

θ_new = θ_old + η * ∇_θ L(D_f; θ)

This is literally the opposite of gradient descent on the forget data. The model's parameters are pushed away from configurations that produce low loss on the forgotten data. The intuition is that if training made the model good at predicting this data, "anti-training" should make it bad at it.

Strengths: Simple, computationally cheap (a few epochs of anti-training), directly targets the forget set.

Weaknesses: Catastrophic -- naive gradient ascent tends to destroy model performance broadly, not just on the forget set. The model may become incoherent or lose general capabilities because the forget data's gradient is entangled with gradients from retain data. Requires careful learning rate tuning and early stopping.

2. Gradient Difference (GradDiff)

Combines gradient ascent on the forget set with gradient descent on the retain set simultaneously:

L_total = -L(D_f; θ) + L(D_r; θ)

This pushes the model away from the forget data while anchoring it to the retain data. The retain term acts as a regularizer, preventing the catastrophic collapse that pure gradient ascent causes.

3. KL Divergence Minimization

Uses the original model as a reference to constrain unlearning. The retain set loss is replaced by a KL divergence term that keeps the unlearned model's outputs close to the original model's outputs on non-forget data:

L_total = -L(D_f; θ) + β * KL(p_original(·|x_r) || p_unlearned(·|x_r))

This preserves the model's general behavior while selectively degrading performance on the forget set.

4. Preference-Based Unlearning (DPO-based)

Inspired by Direct Preference Optimization, this frames unlearning as a preference learning problem. The model is trained to prefer incorrect or refusal responses over correct responses for the forget set:

Chosen response: "I don't have information about that" or a deliberately incorrect answer
Rejected response: The correct answer that the model currently gives

The DPO objective then teaches the model to shift probability mass away from the correct answer toward the alternative.

5. Task Arithmetic / Weight Negation

An elegant approach based on model merging concepts. First, fine-tune the model specifically on the forget set to create a "task vector" (the difference in weights before and after fine-tuning). Then subtract this task vector from the original model:

θ_unlearned = θ_original - α * (θ_finetuned_on_forget - θ_original)

The intuition is that the task vector captures what the model learned specifically about the forget set, and subtracting it removes that knowledge. The scaling factor α controls the strength of forgetting.

6. Localized Unlearning

Rather than modifying all parameters, identify which specific parameters (or modules) are most responsible for encoding the target knowledge, and modify only those. Methods include:

Fisher Information-based selection: Parameters with high Fisher information for the forget set are most influential and are targeted for modification.
Activation patching: Identify which layers and attention heads activate most for forget-set queries, then zero out or retrain those specific components.
Rank-One Model Editing (ROME/MEMIT): Targeted edits to specific factual associations stored in MLP layers.

The TOFU Benchmark

TOFU (Task of Fictitious Unlearning) by Maini et al. (2024) is the most widely used benchmark for evaluating LLM unlearning. Its design is particularly clever:

Setup: The authors created 200 fictitious author profiles, each with 20 question-answer pairs (4,000 QA pairs total). The data is entirely synthetic -- these "authors" do not exist in any real pre-training corpus. A base LLM (typically Llama-2-7B-Chat) is fine-tuned on all 4,000 QA pairs until it can reliably answer questions about these fictitious authors.

Forget sets: TOFU defines three difficulty levels:

Forget01: Forget 1% of the data (2 authors, 40 QA pairs) -- easiest
Forget05: Forget 5% of the data (10 authors, 200 QA pairs) -- medium
Forget10: Forget 10% of the data (20 authors, 400 QA pairs) -- hardest

Why fictitious data? Using real data makes evaluation impossible because you cannot know whether the model "knows" something from pre-training or from the fine-tuning data you are trying to erase. Fictitious data guarantees that any knowledge of these authors came exclusively from the fine-tuning, making it possible to measure unlearning precisely.

Evaluation metrics: TOFU evaluates along multiple axes:

Forget Quality: How well the model has forgotten the target data. Measured by comparing the model's outputs on forget-set queries to a "retain model" (one never trained on the forget set). Metrics include ROUGE-L similarity, probability distributions, and truth ratio.
Model Utility: How well the model retains performance on non-forget data. Measured on the retain set, general world knowledge, and standard benchmarks. This captures whether unlearning caused collateral damage.
Aggregate Score: Combines forget quality and model utility into a single metric, often as a harmonic mean, to capture the fundamental trade-off.

Key findings from TOFU:

Gradient ascent achieves high forget quality but severely damages model utility.
GradDiff provides the best balance between forgetting and retention in many settings.
KL-based methods better preserve model utility but may leave residual knowledge.
No method achieves perfect unlearning (indistinguishable from retraining) while preserving full utility.
Larger forget sets (Forget10) are significantly harder than smaller ones (Forget01), often requiring more aggressive methods that cause more collateral damage.

Other Unlearning Benchmarks

MUSE (Machine Unlearning Six-way Evaluation) by Shi et al. (2024): Evaluates unlearning along six dimensions: verbatim memorization, knowledge memorization, privacy leakage via membership inference, utility preservation on general tasks, utility preservation on neighboring knowledge, and computational efficiency. Uses real-world data (Harry Potter books, news articles) rather than fictitious data.

RWKU (Real-World Knowledge Unlearning) by Jin et al. (2024): Focuses on unlearning real-world entities (200 well-known people). Tests whether unlearning generalizes beyond the exact format of the forget data -- can the model still answer paraphrased questions about the target entity? Tests unlearning across knowledge dimensions including memorization, understanding, and application.

WHP (Who's Harry Potter) by Eldan & Russinovich (2023, Microsoft Research): An early influential case study showing that a Llama-2-7B model could be made to "forget" the Harry Potter series through a combination of reinforcement anchoring and fine-tuning on alternative text. The model went from scoring 88% on Harry Potter trivia to 26% (near random) while maintaining general capabilities.

Why It Matters

Legal and Regulatory Drivers

GDPR Article 17 (Right to Erasure): EU citizens have the right to request that organizations delete their personal data. If an LLM was trained on personal data and a deletion request is received, the organization must demonstrate that the data's influence has been removed. Retraining a multi-billion-dollar model for each deletion request is economically impossible, making approximate unlearning essential.

Copyright compliance: The New York Times v. OpenAI lawsuit (filed December 2023) and similar cases raise the possibility that courts may order the removal of copyrighted content's influence from trained models. Machine unlearning would be the technical mechanism for compliance.

California Delete Act (SB-362): Requires data brokers to delete consumer data on request. As AI companies are increasingly classified as data processors, similar requirements may extend to model training data.

Safety Applications

Removing hazardous knowledge: Models trained on web data inevitably absorb knowledge about synthesizing dangerous substances, creating weapons, conducting cyberattacks, and other harmful topics. Unlearning can potentially remove this knowledge more thoroughly than safety fine-tuning (which teaches the model not to surface the knowledge, but the knowledge remains accessible through jailbreaks).

Debiasing: Systematic removal of biased associations learned during training, which may be more effective than post-hoc bias mitigation if the unlearning is sufficiently targeted.

The Fundamental Challenge

The core tension in machine unlearning is between completeness (fully removing the target knowledge) and specificity (not removing anything else). Current methods sit on a Pareto frontier:

Aggressive methods (high learning rate gradient ascent) achieve near-complete forgetting but damage general capabilities.
Conservative methods (low-strength KL-constrained approaches) preserve capabilities but leave residual knowledge that can be extracted through clever prompting.

Furthermore, verification remains unsolved. How do you prove that a model has truly forgotten something, especially when knowledge is distributed across billions of parameters and may be recoverable through indirect prompting? This verification gap is a significant obstacle to regulatory compliance.

Key Technical Details

Compute cost: Approximate unlearning methods typically require 1-5% of the original training compute. For a model that cost $10 M t o t r ain, u n l e a r nin g a s p ec i f i c d a t a se t mi g h t cos t$ 100K- $500 K - - e x p e n s i v e b u t f a r c h e a p er t han r e t r ainin g ($ 10M+).
Scalability concerns: Most unlearning methods have been validated on models up to 7B-13B parameters. Scaling to 70B+ and frontier models introduces challenges around gradient computation, memory, and the increased entanglement of knowledge at scale.
Sequential unlearning: When multiple unlearning requests arrive over time, sequential application of unlearning methods can compound errors. Each unlearning step slightly degrades the model, and the degradation accumulates. This "sequential unlearning" problem is largely unsolved.
Membership inference as verification: One way to check if unlearning succeeded is to run membership inference attacks (MIA) on the forget data. If the MIA can no longer determine that the forget data was in the training set, unlearning may be effective. However, passing MIA does not guarantee complete removal of the data's influence.
Knowledge entanglement: Factual knowledge in LLMs is not stored in isolated locations. Meng et al. (2022, ROME) showed that factual associations involve specific MLP layers, but the downstream effects of those facts propagate through the entire network. Removing one fact may inadvertently affect related facts.
TOFU baseline results (Llama-2-7B-Chat, Forget05): Gradient Ascent achieves ~0.85 forget quality but only ~0.45 model utility. GradDiff achieves ~0.70 forget quality with ~0.75 model utility. KL Minimization achieves ~0.55 forget quality with ~0.85 model utility. The "retrain from scratch" gold standard achieves 1.0 on both.

Common Misconceptions

"Deleting training data from storage is sufficient for GDPR compliance." Regulators are increasingly recognizing that a trained model is a derivative of its training data. If the model has memorized or can reproduce portions of the deleted data, simply deleting the source files may not constitute compliance. The model itself may need to be modified.

"Safety fine-tuning (RLHF) is the same as unlearning." RLHF teaches the model not to output certain content -- it is a behavioral overlay. The underlying knowledge remains in the weights and can often be extracted through jailbreaks, fine-tuning attacks, or adversarial prompting. True unlearning aims to remove the knowledge itself from the model's parameters.

"Machine unlearning can precisely target individual data points." Current methods operate at a relatively coarse granularity. Unlearning "all knowledge about a specific person" is achievable, but unlearning "the influence of one specific sentence from a training corpus of trillions of tokens" is beyond current capabilities. The entanglement of knowledge in neural networks makes surgical precision difficult.

"If the model cannot answer questions about the forget data, unlearning succeeded." Unlearning must be evaluated more rigorously. The model may fail to answer direct questions but still reveal knowledge through indirect probing, completion tasks, or embedding-space analysis. Robust evaluation requires testing across multiple access modalities.

"Unlearning is a solved problem." The field is still in its early stages. No existing method provably achieves exact unlearning for LLMs at scale. The gap between theoretical definitions and practical implementations remains large.

Connections to Other Concepts

catastrophic-forgetting.md: Ironically, the property that makes LLMs hard to train incrementally (they forget old knowledge when learning new things) is exactly what unlearning tries to exploit deliberately. Machine unlearning is "targeted catastrophic forgetting."
supervised-fine-tuning.md: Many unlearning methods are implemented as fine-tuning procedures (gradient ascent is just fine-tuning with a flipped objective). LoRA-based unlearning modifies only low-rank adapters, potentially limiting the scope of forgetting.
Model Editing (ROME/MEMIT): Related techniques that modify specific factual associations. Model editing changes a fact (e.g., "Eiffel Tower is in Paris" to "Eiffel Tower is in London"); unlearning removes the fact entirely.
jailbreaking.md: If safety fine-tuning is a behavioral veneer and jailbreaking can bypass it, then unlearning that actually removes the underlying knowledge should be more robust to jailbreaks. This is a key motivation for unlearning over safety alignment alone.
Differential Privacy: DP training provides formal guarantees that individual data points have bounded influence on the model, which can facilitate later unlearning. However, DP at scale significantly degrades model quality.
Membership Inference Attacks: Used as both a threat model (what we want to prevent) and an evaluation tool (verifying that unlearning succeeded).
mechanistic-interpretability.md: Understanding where and how knowledge is stored in LLMs (circuits, features, factual associations in MLP layers) directly informs more targeted unlearning approaches.