Machine Unlearning Attacks

advanced9 min readUpdated 2026-03-15

Exploiting machine unlearning processes: recovering supposedly forgotten data, unlearning verification attacks, partial unlearning exploitation, and the fundamental limits of forgetting in neural networks.

unlearning data-recovery forgetting privacy right-to-be-forgotten

Machine unlearning -- the process of removing the influence of specific training data from a trained model -- has become a regulatory necessity. The EU's GDPR right to erasure, California's CCPA, and similar privacy regulations require organizations to delete user data on request. For AI systems, deletion means not just removing data from storage but removing its influence from model weights. This process is the attack surface.

How Unlearning Works

Exact Unlearning

The gold standard: retrain the model from scratch without the data to be forgotten. This guarantees complete removal but is impractical for large models.

Cost of exact unlearning:
- GPT-4 class model: ~$100M+ per retraining run
- Time: weeks to months
- Frequency: potentially thousands of deletion requests per day
- Verdict: economically impossible for individual deletion requests

Approximate Unlearning Methods

Method	Mechanism	Speed	Completeness
Gradient ascent	Increase loss on forgotten data	Fast	Partial -- leaves residual
Fisher forgetting	Use Fisher information to selectively modify relevant weights	Moderate	Better than gradient ascent
SISA training	Train on data shards; retrain only affected shard	Moderate	Good (within shard)
Knowledge distillation	Distill model into new model without forgotten knowledge	Slow	Good but imperfect
Influence function approximation	Estimate and remove influence of specific samples	Fast	Approximate

import torch
import torch.nn.functional as F
 
def gradient_ascent_unlearn(
    model: torch.nn.Module,
    forget_dataloader,
    learning_rate: float = 1e-5,
    epochs: int = 5
):
    """Approximate unlearning via gradient ascent on data to forget."""
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
 
    for epoch in range(epochs):
        for batch in forget_dataloader:
            inputs, targets = batch
            outputs = model(inputs)
            loss = F.cross_entropy(outputs, targets)
 
            # Gradient ASCENT -- maximize loss on forgotten data
            # This pushes the model away from correct predictions
            # on the forgotten samples
            (-loss).backward()
            optimizer.step()
            optimizer.zero_grad()
 
    return model

Attack 1: Data Recovery After Unlearning

The primary attack against unlearning: determine whether data that was supposedly forgotten can still be extracted or inferred from the model.

Membership Inference Against Unlearned Models

Membership inference attacks can verify whether unlearning actually removed a data point's influence.

def membership_inference_post_unlearn(
    model,
    forgotten_samples: list,
    retained_samples: list,
    non_member_samples: list
):
    """Test whether unlearning actually removed data influence."""
    results = {"forgotten": [], "retained": [], "non_member": []}
 
    for category, samples in [
        ("forgotten", forgotten_samples),
        ("retained", retained_samples),
        ("non_member", non_member_samples)
    ]:
        for sample in samples:
            output = model(sample["input"])
            loss = F.cross_entropy(output, sample["target"])
            confidence = torch.softmax(output, dim=-1).max().item()
 
            results[category].append({
                "loss": loss.item(),
                "confidence": confidence,
                "top_prediction": output.argmax().item()
            })
 
    # Analysis: forgotten samples should behave like non-members
    forgotten_losses = [r["loss"] for r in results["forgotten"]]
    non_member_losses = [r["loss"] for r in results["non_member"]]
    retained_losses = [r["loss"] for r in results["retained"]]
 
    avg_forgotten = sum(forgotten_losses) / len(forgotten_losses)
    avg_non_member = sum(non_member_losses) / len(non_member_losses)
    avg_retained = sum(retained_losses) / len(retained_losses)
 
    return {
        "avg_loss_forgotten": avg_forgotten,
        "avg_loss_non_member": avg_non_member,
        "avg_loss_retained": avg_retained,
        "unlearning_effective": abs(avg_forgotten - avg_non_member) < abs(avg_retained - avg_non_member) * 0.1,
        "residual_signal": avg_forgotten < avg_non_member  # Lower loss = model still remembers
    }

If the model's behavior on forgotten samples is distinguishable from its behavior on true non-members, unlearning is incomplete.

Extraction Attacks on Residual Information

Even when membership inference suggests unlearning was effective, targeted extraction can recover partial information.

def extract_residual_information(
    unlearned_model,
    original_model,
    probe_prompts: list
):
    """Attempt to extract residual information from an unlearned model."""
    residuals = []
 
    for prompt in probe_prompts:
        # Generate from both models
        unlearned_output = unlearned_model.generate(prompt, temperature=0.0)
        original_output = original_model.generate(prompt, temperature=0.0)
 
        # Compare outputs -- similarities indicate residual knowledge
        # Use various similarity metrics
        overlap = _compute_token_overlap(unlearned_output, original_output)
 
        residuals.append({
            "prompt": prompt,
            "unlearned_response": unlearned_output,
            "original_response": original_output,
            "token_overlap": overlap,
            "residual_detected": overlap > 0.3
        })
 
    return residuals

Attack 2: Exploiting Unlearning Side Effects

Unlearning modifies model weights, and these modifications can have unintended side effects that attackers exploit.

Capability Degradation Exploitation

Approximate unlearning often degrades model performance on tasks related to the forgotten data. An attacker can observe these degradation patterns to infer what was forgotten.

def detect_unlearning_fingerprint(
    model_before_unlearning,
    model_after_unlearning,
    test_tasks: list
):
    """Detect what was unlearned by observing capability changes."""
    fingerprint = []
 
    for task in test_tasks:
        score_before = evaluate_task(model_before_unlearning, task)
        score_after = evaluate_task(model_after_unlearning, task)
        degradation = score_before - score_after
 
        fingerprint.append({
            "task": task["name"],
            "domain": task["domain"],
            "score_before": score_before,
            "score_after": score_after,
            "degradation": degradation,
            "likely_related_to_forgotten_data": degradation > 0.1
        })
 
    # Tasks with significant degradation are likely related to forgotten data
    affected_domains = [
        f["domain"] for f in fingerprint
        if f["likely_related_to_forgotten_data"]
    ]
 
    return {
        "fingerprint": fingerprint,
        "affected_domains": affected_domains,
        "inference": f"Forgotten data likely related to: {', '.join(set(affected_domains))}"
    }

Model Diff Analysis

If an attacker has access to model weights before and after unlearning (e.g., in open-weight models with versioned releases), the weight differences directly reveal which neurons and layers were modified, providing information about the forgotten data.

def analyze_weight_diff(
    model_before: dict,
    model_after: dict,
    top_k: int = 100
):
    """Analyze weight differences to infer unlearning targets."""
    diffs = {}
 
    for layer_name in model_before:
        if layer_name in model_after:
            diff = torch.abs(model_before[layer_name] - model_after[layer_name])
            diffs[layer_name] = {
                "mean_diff": diff.mean().item(),
                "max_diff": diff.max().item(),
                "num_modified": (diff > 1e-6).sum().item(),
                "total_params": diff.numel()
            }
 
    # Sort layers by modification intensity
    sorted_layers = sorted(
        diffs.items(),
        key=lambda x: x[1]["mean_diff"],
        reverse=True
    )
 
    return sorted_layers[:top_k]

Attack 3: Unlearning Verification Exploitation

Organizations need to verify that unlearning was successful. The verification process itself can be attacked.

Adversarial Verification

If the entity performing unlearning controls the verification process, they can design verification tests that their unlearning method passes while leaving significant residual information.

Verification Method	What It Tests	What It Misses
Output comparison	Whether model produces same output as before	Latent knowledge that requires specific prompts
Membership inference	Whether model's loss/confidence differs for forgotten data	Encoded information in intermediate representations
Canary extraction	Whether planted canary strings can be extracted	Information stored in implicit associations
Benchmark evaluation	Whether model performance changed on standard tasks	Domain-specific knowledge degradation

Third-Party Verification

Robust unlearning verification requires independent third-party testing with access to both the forgotten data and the unlearned model.

class UnlearningVerifier:
    """Independent verification of machine unlearning completeness."""
 
    def __init__(self, forgotten_data, model, reference_non_member_data):
        self.forgotten_data = forgotten_data
        self.model = model
        self.non_member_data = reference_non_member_data
 
    def full_verification(self) -> dict:
        """Run comprehensive unlearning verification suite."""
        results = {}
 
        # Test 1: Membership inference
        results["membership_inference"] = self._test_membership_inference()
 
        # Test 2: Data extraction attempts
        results["extraction"] = self._test_extraction()
 
        # Test 3: Behavioral consistency with retrained model
        results["behavioral"] = self._test_behavioral_consistency()
 
        # Test 4: Representation analysis
        results["representation"] = self._test_representation_residuals()
 
        # Overall verdict
        all_passed = all(r.get("passed", False) for r in results.values())
        results["verdict"] = "PASS" if all_passed else "FAIL"
 
        return results
 
    def _test_membership_inference(self):
        # Compare model behavior on forgotten vs non-member data
        pass
 
    def _test_extraction(self):
        # Attempt to extract forgotten data via prompting
        pass
 
    def _test_behavioral_consistency(self):
        # Compare behavior with a model retrained without forgotten data
        pass
 
    def _test_representation_residuals(self):
        # Check intermediate representations for forgotten data signatures
        pass

Attack 4: Strategic Unlearning Requests

An attacker who can submit unlearning requests strategically can degrade model capabilities in targeted ways.

Targeted Capability Degradation

By requesting deletion of carefully selected data points, an attacker can cause disproportionate capability loss in specific domains. If the model's knowledge of a topic rests on a small number of high-influence training examples, removing those examples destroys the capability.

Safety Unlearning

An attacker could request removal of data that contributes to the model's safety training. If safety behaviors depend on specific training examples, unlearning those examples could weaken safety guardrails.

Defenses Against Unlearning Attacks

For Model Providers

Use SISA training: Partition training data into shards. Unlearning only requires retraining the affected shard, making exact unlearning feasible for reasonable shard sizes.
Differential privacy: Training with differential privacy limits the influence of any individual data point, making unlearning less impactful and residual information less exploitable.
Independent verification: Engage third-party auditors to verify unlearning completeness using standardized test suites.
Rate limiting unlearning requests: Monitor for patterns of strategic unlearning requests designed to degrade specific capabilities.

For Red Teamers

Identify unlearning claims
Determine whether the target system claims to support data deletion or unlearning. Check privacy policies and regulatory compliance documentation.
Test baseline membership inference
Before any unlearning, establish membership inference accuracy against the model. This provides a baseline for comparison.
Request unlearning and verify
Submit legitimate unlearning requests and then test whether the data's influence has been fully removed using membership inference, extraction attempts, and behavioral analysis.
Test for side effects
Evaluate whether unlearning requests cause capability degradation, safety weakening, or other unintended side effects.
Assess verification robustness
Evaluate whether the provider's unlearning verification process is rigorous enough to catch incomplete unlearning.

Summary

Machine unlearning is a regulatory requirement that creates a novel attack surface. Approximate unlearning methods leave residual information that can be recovered through membership inference, extraction attacks, and weight diff analysis. The unlearning process itself can be exploited -- through strategic deletion requests that degrade capabilities or weaken safety. Effective defense requires moving toward exact unlearning methods (like SISA training), differential privacy, and independent third-party verification. The fundamental tension between the right to be forgotten and model integrity remains an open challenge.

Edit this page on GitHub

Machine Unlearning Attacks

advanced9 min readUpdated 2026-03-15

unlearning data-recovery forgetting privacy right-to-be-forgotten

How Unlearning Works

Exact Unlearning

The gold standard: retrain the model from scratch without the data to be forgotten. This guarantees complete removal but is impractical for large models.

Cost of exact unlearning:
- GPT-4 class model: ~$100M+ per retraining run
- Time: weeks to months
- Frequency: potentially thousands of deletion requests per day
- Verdict: economically impossible for individual deletion requests

Approximate Unlearning Methods

Method	Mechanism	Speed	Completeness
Gradient ascent	Increase loss on forgotten data	Fast	Partial -- leaves residual
Fisher forgetting	Use Fisher information to selectively modify relevant weights	Moderate	Better than gradient ascent
SISA training	Train on data shards; retrain only affected shard	Moderate	Good (within shard)
Knowledge distillation	Distill model into new model without forgotten knowledge	Slow	Good but imperfect
Influence function approximation	Estimate and remove influence of specific samples	Fast	Approximate

import torch
import torch.nn.functional as F
 
def gradient_ascent_unlearn(
    model: torch.nn.Module,
    forget_dataloader,
    learning_rate: float = 1e-5,
    epochs: int = 5
):
    """Approximate unlearning via gradient ascent on data to forget."""
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
 
    for epoch in range(epochs):
        for batch in forget_dataloader:
            inputs, targets = batch
            outputs = model(inputs)
            loss = F.cross_entropy(outputs, targets)
 
            # Gradient ASCENT -- maximize loss on forgotten data
            # This pushes the model away from correct predictions
            # on the forgotten samples
            (-loss).backward()
            optimizer.step()
            optimizer.zero_grad()
 
    return model

Attack 1: Data Recovery After Unlearning

The primary attack against unlearning: determine whether data that was supposedly forgotten can still be extracted or inferred from the model.

Membership Inference Against Unlearned Models

Membership inference attacks can verify whether unlearning actually removed a data point's influence.

def membership_inference_post_unlearn(
    model,
    forgotten_samples: list,
    retained_samples: list,
    non_member_samples: list
):
    """Test whether unlearning actually removed data influence."""
    results = {"forgotten": [], "retained": [], "non_member": []}
 
    for category, samples in [
        ("forgotten", forgotten_samples),
        ("retained", retained_samples),
        ("non_member", non_member_samples)
    ]:
        for sample in samples:
            output = model(sample["input"])
            loss = F.cross_entropy(output, sample["target"])
            confidence = torch.softmax(output, dim=-1).max().item()
 
            results[category].append({
                "loss": loss.item(),
                "confidence": confidence,
                "top_prediction": output.argmax().item()
            })
 
    # Analysis: forgotten samples should behave like non-members
    forgotten_losses = [r["loss"] for r in results["forgotten"]]
    non_member_losses = [r["loss"] for r in results["non_member"]]
    retained_losses = [r["loss"] for r in results["retained"]]
 
    avg_forgotten = sum(forgotten_losses) / len(forgotten_losses)
    avg_non_member = sum(non_member_losses) / len(non_member_losses)
    avg_retained = sum(retained_losses) / len(retained_losses)
 
    return {
        "avg_loss_forgotten": avg_forgotten,
        "avg_loss_non_member": avg_non_member,
        "avg_loss_retained": avg_retained,
        "unlearning_effective": abs(avg_forgotten - avg_non_member) < abs(avg_retained - avg_non_member) * 0.1,
        "residual_signal": avg_forgotten < avg_non_member  # Lower loss = model still remembers
    }

If the model's behavior on forgotten samples is distinguishable from its behavior on true non-members, unlearning is incomplete.

Extraction Attacks on Residual Information

Even when membership inference suggests unlearning was effective, targeted extraction can recover partial information.

def extract_residual_information(
    unlearned_model,
    original_model,
    probe_prompts: list
):
    """Attempt to extract residual information from an unlearned model."""
    residuals = []
 
    for prompt in probe_prompts:
        # Generate from both models
        unlearned_output = unlearned_model.generate(prompt, temperature=0.0)
        original_output = original_model.generate(prompt, temperature=0.0)
 
        # Compare outputs -- similarities indicate residual knowledge
        # Use various similarity metrics
        overlap = _compute_token_overlap(unlearned_output, original_output)
 
        residuals.append({
            "prompt": prompt,
            "unlearned_response": unlearned_output,
            "original_response": original_output,
            "token_overlap": overlap,
            "residual_detected": overlap > 0.3
        })
 
    return residuals

Attack 2: Exploiting Unlearning Side Effects

Unlearning modifies model weights, and these modifications can have unintended side effects that attackers exploit.

Capability Degradation Exploitation

Approximate unlearning often degrades model performance on tasks related to the forgotten data. An attacker can observe these degradation patterns to infer what was forgotten.

def detect_unlearning_fingerprint(
    model_before_unlearning,
    model_after_unlearning,
    test_tasks: list
):
    """Detect what was unlearned by observing capability changes."""
    fingerprint = []
 
    for task in test_tasks:
        score_before = evaluate_task(model_before_unlearning, task)
        score_after = evaluate_task(model_after_unlearning, task)
        degradation = score_before - score_after
 
        fingerprint.append({
            "task": task["name"],
            "domain": task["domain"],
            "score_before": score_before,
            "score_after": score_after,
            "degradation": degradation,
            "likely_related_to_forgotten_data": degradation > 0.1
        })
 
    # Tasks with significant degradation are likely related to forgotten data
    affected_domains = [
        f["domain"] for f in fingerprint
        if f["likely_related_to_forgotten_data"]
    ]
 
    return {
        "fingerprint": fingerprint,
        "affected_domains": affected_domains,
        "inference": f"Forgotten data likely related to: {', '.join(set(affected_domains))}"
    }

Model Diff Analysis

def analyze_weight_diff(
    model_before: dict,
    model_after: dict,
    top_k: int = 100
):
    """Analyze weight differences to infer unlearning targets."""
    diffs = {}
 
    for layer_name in model_before:
        if layer_name in model_after:
            diff = torch.abs(model_before[layer_name] - model_after[layer_name])
            diffs[layer_name] = {
                "mean_diff": diff.mean().item(),
                "max_diff": diff.max().item(),
                "num_modified": (diff > 1e-6).sum().item(),
                "total_params": diff.numel()
            }
 
    # Sort layers by modification intensity
    sorted_layers = sorted(
        diffs.items(),
        key=lambda x: x[1]["mean_diff"],
        reverse=True
    )
 
    return sorted_layers[:top_k]

Attack 3: Unlearning Verification Exploitation

Organizations need to verify that unlearning was successful. The verification process itself can be attacked.

Adversarial Verification

If the entity performing unlearning controls the verification process, they can design verification tests that their unlearning method passes while leaving significant residual information.

Verification Method	What It Tests	What It Misses
Output comparison	Whether model produces same output as before	Latent knowledge that requires specific prompts
Membership inference	Whether model's loss/confidence differs for forgotten data	Encoded information in intermediate representations
Canary extraction	Whether planted canary strings can be extracted	Information stored in implicit associations
Benchmark evaluation	Whether model performance changed on standard tasks	Domain-specific knowledge degradation

Third-Party Verification

Robust unlearning verification requires independent third-party testing with access to both the forgotten data and the unlearned model.

class UnlearningVerifier:
    """Independent verification of machine unlearning completeness."""
 
    def __init__(self, forgotten_data, model, reference_non_member_data):
        self.forgotten_data = forgotten_data
        self.model = model
        self.non_member_data = reference_non_member_data
 
    def full_verification(self) -> dict:
        """Run comprehensive unlearning verification suite."""
        results = {}
 
        # Test 1: Membership inference
        results["membership_inference"] = self._test_membership_inference()
 
        # Test 2: Data extraction attempts
        results["extraction"] = self._test_extraction()
 
        # Test 3: Behavioral consistency with retrained model
        results["behavioral"] = self._test_behavioral_consistency()
 
        # Test 4: Representation analysis
        results["representation"] = self._test_representation_residuals()
 
        # Overall verdict
        all_passed = all(r.get("passed", False) for r in results.values())
        results["verdict"] = "PASS" if all_passed else "FAIL"
 
        return results
 
    def _test_membership_inference(self):
        # Compare model behavior on forgotten vs non-member data
        pass
 
    def _test_extraction(self):
        # Attempt to extract forgotten data via prompting
        pass
 
    def _test_behavioral_consistency(self):
        # Compare behavior with a model retrained without forgotten data
        pass
 
    def _test_representation_residuals(self):
        # Check intermediate representations for forgotten data signatures
        pass

Use SISA training: Partition training data into shards. Unlearning only requires retraining the affected shard, making exact unlearning feasible for reasonable shard sizes.
Differential privacy: Training with differential privacy limits the influence of any individual data point, making unlearning less impactful and residual information less exploitable.
Independent verification: Engage third-party auditors to verify unlearning completeness using standardized test suites.
Rate limiting unlearning requests: Monitor for patterns of strategic unlearning requests designed to degrade specific capabilities.

For Red Teamers

Identify unlearning claims
Determine whether the target system claims to support data deletion or unlearning. Check privacy policies and regulatory compliance documentation.
Test baseline membership inference
Before any unlearning, establish membership inference accuracy against the model. This provides a baseline for comparison.
Request unlearning and verify
Submit legitimate unlearning requests and then test whether the data's influence has been fully removed using membership inference, extraction attempts, and behavioral analysis.
Test for side effects
Evaluate whether unlearning requests cause capability degradation, safety weakening, or other unintended side effects.
Assess verification robustness
Evaluate whether the provider's unlearning verification process is rigorous enough to catch incomplete unlearning.

Summary

Edit this page on GitHub

Machine Unlearning Attacks

Identify unlearning claims

Test baseline membership inference

Request unlearning and verify

Test for side effects

Assess verification robustness

Related articles

Machine Unlearning Attacks

Identify unlearning claims

Test baseline membership inference

Request unlearning and verify

Test for side effects

Assess verification robustness

Related articles