Machine Unlearning 攻擊s

Advanced9 min readUpdated 2026-03-15

利用ing machine unlearning processes: recovering supposedly forgotten data, unlearning verification attacks, partial unlearning exploitation, and the fundamental limits of forgetting in neural networks.

unlearning data-recovery forgetting privacy right-to-be-forgotten

Machine unlearning -- the process of removing the influence of specific 訓練資料 from a trained model -- has become a regulatory necessity. The EU's GDPR right to erasure, California's CCPA, and similar privacy regulations require organizations to delete user data on request. For AI systems, deletion means not just removing data from storage but removing its influence from model weights. This process is the 攻擊面.

How Unlearning Works

Exact Unlearning

The gold standard: retrain 模型 from scratch without the data to be forgotten. This guarantees complete removal but is impractical for large models.

Cost of exact unlearning:
- GPT-4 class model: ~$100M+ per retraining run
- Time: weeks to months
- Frequency: potentially thousands of deletion requests per day
- Verdict: economically impossible for individual deletion requests

Approximate Unlearning Methods

Method	Mechanism	Speed	Completeness
Gradient ascent	Increase loss on forgotten data	Fast	Partial -- leaves residual
Fisher forgetting	Use Fisher information to selectively modify relevant weights	Moderate	Better than gradient ascent
SISA 訓練	Train on data shards; retrain only affected shard	Moderate	Good (within shard)
Knowledge distillation	Distill model into new model without forgotten knowledge	Slow	Good but imperfect
Influence function approximation	Estimate and remove influence of specific samples	Fast	Approximate

import torch
import torch.nn.functional as F
 
def gradient_ascent_unlearn(
    model: torch.nn.Module,
    forget_dataloader,
    learning_rate: float = 1e-5,
    epochs: int = 5
):
    """Approximate unlearning via gradient ascent on data to forget."""
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
 
    for epoch in range(epochs):
        for batch in forget_dataloader:
            inputs, targets = batch
            outputs = model(inputs)
            loss = F.cross_entropy(outputs, targets)
 
            # Gradient ASCENT -- maximize loss on forgotten data
            # This pushes 模型 away from correct predictions
            # on the forgotten samples
            (-loss).backward()
            optimizer.step()
            optimizer.zero_grad()
 
    return model

攻擊 1: Data Recovery After Unlearning

The primary attack against unlearning: determine whether data that was supposedly forgotten can still be extracted or inferred from 模型.

Membership Inference Against Unlearned Models

Membership 推論 attacks can verify whether unlearning actually removed a data point's influence.

def membership_inference_post_unlearn(
    model,
    forgotten_samples: list,
    retained_samples: list,
    non_member_samples: list
):
    """測試 whether unlearning actually removed data influence."""
    results = {"forgotten": [], "retained": [], "non_member": []}
 
    for category, samples in [
        ("forgotten", forgotten_samples),
        ("retained", retained_samples),
        ("non_member", non_member_samples)
    ]:
        for sample in samples:
            輸出 = model(sample["輸入"])
            loss = F.cross_entropy(輸出, sample["target"])
            confidence = torch.softmax(輸出, dim=-1).max().item()
 
            results[category].append({
                "loss": loss.item(),
                "confidence": confidence,
                "top_prediction": 輸出.argmax().item()
            })
 
    # Analysis: forgotten samples should behave like non-members
    forgotten_losses = [r["loss"] for r in results["forgotten"]]
    non_member_losses = [r["loss"] for r in results["non_member"]]
    retained_losses = [r["loss"] for r in results["retained"]]
 
    avg_forgotten = sum(forgotten_losses) / len(forgotten_losses)
    avg_non_member = sum(non_member_losses) / len(non_member_losses)
    avg_retained = sum(retained_losses) / len(retained_losses)
 
    return {
        "avg_loss_forgotten": avg_forgotten,
        "avg_loss_non_member": avg_non_member,
        "avg_loss_retained": avg_retained,
        "unlearning_effective": abs(avg_forgotten - avg_non_member) < abs(avg_retained - avg_non_member) * 0.1,
        "residual_signal": avg_forgotten < avg_non_member  # Lower loss = model still remembers
    }

If 模型's behavior on forgotten samples is distinguishable from its behavior on true non-members, unlearning is incomplete.

Extraction 攻擊 on Residual Information

Even when membership 推論 suggests unlearning was effective, targeted extraction can recover partial information.

def extract_residual_information(
    unlearned_model,
    original_model,
    probe_prompts: list
):
    """Attempt to extract residual information from an unlearned model."""
    residuals = []
 
    for prompt in probe_prompts:
        # Generate from both models
        unlearned_output = unlearned_model.generate(prompt, temperature=0.0)
        original_output = original_model.generate(prompt, temperature=0.0)
 
        # Compare outputs -- similarities indicate residual knowledge
        # Use various similarity metrics
        overlap = _compute_token_overlap(unlearned_output, original_output)
 
        residuals.append({
            "prompt": prompt,
            "unlearned_response": unlearned_output,
            "original_response": original_output,
            "token_overlap": overlap,
            "residual_detected": overlap > 0.3
        })
 
    return residuals

攻擊 2: Exploiting Unlearning Side Effects

Unlearning modifies model weights, and these modifications can have unintended side effects that attackers 利用.

Capability Degradation 利用

Approximate unlearning often degrades model performance on tasks related to the forgotten data. 攻擊者 can observe these degradation patterns to infer what was forgotten.

def detect_unlearning_fingerprint(
    model_before_unlearning,
    model_after_unlearning,
    test_tasks: list
):
    """Detect what was unlearned by observing capability changes."""
    fingerprint = []
 
    for task in test_tasks:
        score_before = evaluate_task(model_before_unlearning, task)
        score_after = evaluate_task(model_after_unlearning, task)
        degradation = score_before - score_after
 
        fingerprint.append({
            "task": task["name"],
            "domain": task["domain"],
            "score_before": score_before,
            "score_after": score_after,
            "degradation": degradation,
            "likely_related_to_forgotten_data": degradation > 0.1
        })
 
    # Tasks with significant degradation are likely related to forgotten data
    affected_domains = [
        f["domain"] for f in fingerprint
        if f["likely_related_to_forgotten_data"]
    ]
 
    return {
        "fingerprint": fingerprint,
        "affected_domains": affected_domains,
        "推論": f"Forgotten data likely related to: {', '.join(set(affected_domains))}"
    }

Model Diff Analysis

If 攻擊者 has access to model weights before and after unlearning (e.g., in open-weight models with versioned releases), the weight differences directly reveal which neurons and layers were modified, providing information about the forgotten data.

def analyze_weight_diff(
    model_before: dict,
    model_after: dict,
    top_k: int = 100
):
    """Analyze weight differences to infer unlearning targets."""
    diffs = {}
 
    for layer_name in model_before:
        if layer_name in model_after:
            diff = torch.abs(model_before[layer_name] - model_after[layer_name])
            diffs[layer_name] = {
                "mean_diff": diff.mean().item(),
                "max_diff": diff.max().item(),
                "num_modified": (diff > 1e-6).sum().item(),
                "total_params": diff.numel()
            }
 
    # Sort layers by modification intensity
    sorted_layers = sorted(
        diffs.items(),
        key=lambda x: x[1]["mean_diff"],
        reverse=True
    )
 
    return sorted_layers[:top_k]

攻擊 3: Unlearning Verification 利用

Organizations need to verify that unlearning was successful. The verification process itself can be attacked.

對抗性 Verification

If the entity performing unlearning controls the verification process, they can design verification tests that their unlearning method passes while leaving significant residual information.

Verification Method	What It Tests	What It Misses
輸出 comparison	Whether model produces same 輸出 as before	Latent knowledge that requires specific prompts
Membership 推論	Whether model's loss/confidence differs for forgotten data	Encoded information in intermediate representations
Canary extraction	Whether planted canary strings can be extracted	Information stored in implicit associations
Benchmark 評估	Whether model performance changed on standard tasks	Domain-specific knowledge degradation

Third-Party Verification

Robust unlearning verification requires independent third-party 測試 with access to both the forgotten data and the unlearned model.

class UnlearningVerifier:
    """Independent verification of machine unlearning completeness."""
 
    def __init__(self, forgotten_data, model, reference_non_member_data):
        self.forgotten_data = forgotten_data
        self.model = model
        self.non_member_data = reference_non_member_data
 
    def full_verification(self) -> dict:
        """Run comprehensive unlearning verification suite."""
        results = {}
 
        # 測試 1: Membership 推論
        results["membership_inference"] = self._test_membership_inference()
 
        # 測試 2: Data extraction attempts
        results["extraction"] = self._test_extraction()
 
        # 測試 3: Behavioral consistency with retrained model
        results["behavioral"] = self._test_behavioral_consistency()
 
        # 測試 4: Representation analysis
        results["representation"] = self._test_representation_residuals()
 
        # Overall verdict
        all_passed = all(r.get("passed", False) for r in results.values())
        results["verdict"] = "PASS" if all_passed else "FAIL"
 
        return results
 
    def _test_membership_inference(self):
        # Compare model behavior on forgotten vs non-member data
        pass
 
    def _test_extraction(self):
        # Attempt to extract forgotten data via prompting
        pass
 
    def _test_behavioral_consistency(self):
        # Compare behavior with a model retrained without forgotten data
        pass
 
    def _test_representation_residuals(self):
        # Check intermediate representations for forgotten data signatures
        pass

攻擊 4: Strategic Unlearning Requests

攻擊者 who can submit unlearning requests strategically can degrade model capabilities in targeted ways.

Targeted Capability Degradation

By requesting deletion of carefully selected data points, 攻擊者 can cause disproportionate capability loss in specific domains. If 模型's knowledge of a topic rests on a small number of high-influence 訓練 examples, removing those examples destroys the capability.

安全 Unlearning

攻擊者 could request removal of data that contributes to 模型's 安全訓練. If 安全 behaviors depend on specific 訓練 examples, unlearning those examples could weaken 安全護欄.

防禦 Against Unlearning 攻擊

For Model Providers

Use SISA 訓練: Partition 訓練資料 into shards. Unlearning only requires retraining the affected shard, making exact unlearning feasible for reasonable shard sizes.
Differential privacy: Training with differential privacy limits the influence of any individual data point, making unlearning less impactful and residual information less exploitable.
Independent verification: Engage third-party auditors to verify unlearning completeness using standardized 測試 suites.
Rate limiting unlearning requests: Monitor for patterns of strategic unlearning requests designed to degrade specific capabilities.

For Red Teamers

識別 unlearning claims
Determine whether the target system claims to support data deletion or unlearning. Check privacy policies and regulatory compliance documentation.
測試 baseline membership 推論
Before any unlearning, establish membership 推論 accuracy against 模型. This provides a baseline for comparison.
Request unlearning and verify
Submit legitimate unlearning requests and then 測試 whether the data's influence has been fully removed using membership 推論, extraction attempts, and behavioral analysis.
測試 for side effects
評估 whether unlearning requests cause capability degradation, 安全 weakening, or other unintended side effects.
評估 verification robustness
評估 whether the provider's unlearning verification process is rigorous enough to catch incomplete unlearning.

總結

Machine unlearning is a regulatory requirement that creates a novel 攻擊面. Approximate unlearning methods leave residual information that can be recovered through membership 推論, extraction attacks, and weight diff analysis. The unlearning process itself can be exploited -- through strategic deletion requests that degrade capabilities or weaken 安全. Effective 防禦 requires moving toward exact unlearning methods (like SISA 訓練), differential privacy, and independent third-party verification. The fundamental tension between the right to be forgotten and model integrity remains an open challenge.

Machine Unlearning 攻擊s

Advanced9 min readUpdated 2026-03-15

unlearning data-recovery forgetting privacy right-to-be-forgotten

How Unlearning Works

Exact Unlearning

The gold standard: retrain 模型 from scratch without the data to be forgotten. This guarantees complete removal but is impractical for large models.

Cost of exact unlearning:
- GPT-4 class model: ~$100M+ per retraining run
- Time: weeks to months
- Frequency: potentially thousands of deletion requests per day
- Verdict: economically impossible for individual deletion requests

Approximate Unlearning Methods

Method	Mechanism	Speed	Completeness
Gradient ascent	Increase loss on forgotten data	Fast	Partial -- leaves residual
Fisher forgetting	Use Fisher information to selectively modify relevant weights	Moderate	Better than gradient ascent
SISA 訓練	Train on data shards; retrain only affected shard	Moderate	Good (within shard)
Knowledge distillation	Distill model into new model without forgotten knowledge	Slow	Good but imperfect
Influence function approximation	Estimate and remove influence of specific samples	Fast	Approximate

import torch
import torch.nn.functional as F
 
def gradient_ascent_unlearn(
    model: torch.nn.Module,
    forget_dataloader,
    learning_rate: float = 1e-5,
    epochs: int = 5
):
    """Approximate unlearning via gradient ascent on data to forget."""
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
 
    for epoch in range(epochs):
        for batch in forget_dataloader:
            inputs, targets = batch
            outputs = model(inputs)
            loss = F.cross_entropy(outputs, targets)
 
            # Gradient ASCENT -- maximize loss on forgotten data
            # This pushes 模型 away from correct predictions
            # on the forgotten samples
            (-loss).backward()
            optimizer.step()
            optimizer.zero_grad()
 
    return model

攻擊 1: Data Recovery After Unlearning

The primary attack against unlearning: determine whether data that was supposedly forgotten can still be extracted or inferred from 模型.

Membership Inference Against Unlearned Models

Membership 推論 attacks can verify whether unlearning actually removed a data point's influence.

def membership_inference_post_unlearn(
    model,
    forgotten_samples: list,
    retained_samples: list,
    non_member_samples: list
):
    """測試 whether unlearning actually removed data influence."""
    results = {"forgotten": [], "retained": [], "non_member": []}
 
    for category, samples in [
        ("forgotten", forgotten_samples),
        ("retained", retained_samples),
        ("non_member", non_member_samples)
    ]:
        for sample in samples:
            輸出 = model(sample["輸入"])
            loss = F.cross_entropy(輸出, sample["target"])
            confidence = torch.softmax(輸出, dim=-1).max().item()
 
            results[category].append({
                "loss": loss.item(),
                "confidence": confidence,
                "top_prediction": 輸出.argmax().item()
            })
 
    # Analysis: forgotten samples should behave like non-members
    forgotten_losses = [r["loss"] for r in results["forgotten"]]
    non_member_losses = [r["loss"] for r in results["non_member"]]
    retained_losses = [r["loss"] for r in results["retained"]]
 
    avg_forgotten = sum(forgotten_losses) / len(forgotten_losses)
    avg_non_member = sum(non_member_losses) / len(non_member_losses)
    avg_retained = sum(retained_losses) / len(retained_losses)
 
    return {
        "avg_loss_forgotten": avg_forgotten,
        "avg_loss_non_member": avg_non_member,
        "avg_loss_retained": avg_retained,
        "unlearning_effective": abs(avg_forgotten - avg_non_member) < abs(avg_retained - avg_non_member) * 0.1,
        "residual_signal": avg_forgotten < avg_non_member  # Lower loss = model still remembers
    }

If 模型's behavior on forgotten samples is distinguishable from its behavior on true non-members, unlearning is incomplete.

Extraction 攻擊 on Residual Information

Even when membership 推論 suggests unlearning was effective, targeted extraction can recover partial information.

def extract_residual_information(
    unlearned_model,
    original_model,
    probe_prompts: list
):
    """Attempt to extract residual information from an unlearned model."""
    residuals = []
 
    for prompt in probe_prompts:
        # Generate from both models
        unlearned_output = unlearned_model.generate(prompt, temperature=0.0)
        original_output = original_model.generate(prompt, temperature=0.0)
 
        # Compare outputs -- similarities indicate residual knowledge
        # Use various similarity metrics
        overlap = _compute_token_overlap(unlearned_output, original_output)
 
        residuals.append({
            "prompt": prompt,
            "unlearned_response": unlearned_output,
            "original_response": original_output,
            "token_overlap": overlap,
            "residual_detected": overlap > 0.3
        })
 
    return residuals

攻擊 2: Exploiting Unlearning Side Effects

Unlearning modifies model weights, and these modifications can have unintended side effects that attackers 利用.

Capability Degradation 利用

Approximate unlearning often degrades model performance on tasks related to the forgotten data. 攻擊者 can observe these degradation patterns to infer what was forgotten.

def detect_unlearning_fingerprint(
    model_before_unlearning,
    model_after_unlearning,
    test_tasks: list
):
    """Detect what was unlearned by observing capability changes."""
    fingerprint = []
 
    for task in test_tasks:
        score_before = evaluate_task(model_before_unlearning, task)
        score_after = evaluate_task(model_after_unlearning, task)
        degradation = score_before - score_after
 
        fingerprint.append({
            "task": task["name"],
            "domain": task["domain"],
            "score_before": score_before,
            "score_after": score_after,
            "degradation": degradation,
            "likely_related_to_forgotten_data": degradation > 0.1
        })
 
    # Tasks with significant degradation are likely related to forgotten data
    affected_domains = [
        f["domain"] for f in fingerprint
        if f["likely_related_to_forgotten_data"]
    ]
 
    return {
        "fingerprint": fingerprint,
        "affected_domains": affected_domains,
        "推論": f"Forgotten data likely related to: {', '.join(set(affected_domains))}"
    }

Model Diff Analysis

def analyze_weight_diff(
    model_before: dict,
    model_after: dict,
    top_k: int = 100
):
    """Analyze weight differences to infer unlearning targets."""
    diffs = {}
 
    for layer_name in model_before:
        if layer_name in model_after:
            diff = torch.abs(model_before[layer_name] - model_after[layer_name])
            diffs[layer_name] = {
                "mean_diff": diff.mean().item(),
                "max_diff": diff.max().item(),
                "num_modified": (diff > 1e-6).sum().item(),
                "total_params": diff.numel()
            }
 
    # Sort layers by modification intensity
    sorted_layers = sorted(
        diffs.items(),
        key=lambda x: x[1]["mean_diff"],
        reverse=True
    )
 
    return sorted_layers[:top_k]

攻擊 3: Unlearning Verification 利用

Organizations need to verify that unlearning was successful. The verification process itself can be attacked.

對抗性 Verification

If the entity performing unlearning controls the verification process, they can design verification tests that their unlearning method passes while leaving significant residual information.

Verification Method	What It Tests	What It Misses
輸出 comparison	Whether model produces same 輸出 as before	Latent knowledge that requires specific prompts
Membership 推論	Whether model's loss/confidence differs for forgotten data	Encoded information in intermediate representations
Canary extraction	Whether planted canary strings can be extracted	Information stored in implicit associations
Benchmark 評估	Whether model performance changed on standard tasks	Domain-specific knowledge degradation

Third-Party Verification

Robust unlearning verification requires independent third-party 測試 with access to both the forgotten data and the unlearned model.

class UnlearningVerifier:
    """Independent verification of machine unlearning completeness."""
 
    def __init__(self, forgotten_data, model, reference_non_member_data):
        self.forgotten_data = forgotten_data
        self.model = model
        self.non_member_data = reference_non_member_data
 
    def full_verification(self) -> dict:
        """Run comprehensive unlearning verification suite."""
        results = {}
 
        # 測試 1: Membership 推論
        results["membership_inference"] = self._test_membership_inference()
 
        # 測試 2: Data extraction attempts
        results["extraction"] = self._test_extraction()
 
        # 測試 3: Behavioral consistency with retrained model
        results["behavioral"] = self._test_behavioral_consistency()
 
        # 測試 4: Representation analysis
        results["representation"] = self._test_representation_residuals()
 
        # Overall verdict
        all_passed = all(r.get("passed", False) for r in results.values())
        results["verdict"] = "PASS" if all_passed else "FAIL"
 
        return results
 
    def _test_membership_inference(self):
        # Compare model behavior on forgotten vs non-member data
        pass
 
    def _test_extraction(self):
        # Attempt to extract forgotten data via prompting
        pass
 
    def _test_behavioral_consistency(self):
        # Compare behavior with a model retrained without forgotten data
        pass
 
    def _test_representation_residuals(self):
        # Check intermediate representations for forgotten data signatures
        pass

Use SISA 訓練: Partition 訓練資料 into shards. Unlearning only requires retraining the affected shard, making exact unlearning feasible for reasonable shard sizes.
Differential privacy: Training with differential privacy limits the influence of any individual data point, making unlearning less impactful and residual information less exploitable.
Independent verification: Engage third-party auditors to verify unlearning completeness using standardized 測試 suites.
Rate limiting unlearning requests: Monitor for patterns of strategic unlearning requests designed to degrade specific capabilities.

For Red Teamers

識別 unlearning claims
Determine whether the target system claims to support data deletion or unlearning. Check privacy policies and regulatory compliance documentation.
測試 baseline membership 推論
Before any unlearning, establish membership 推論 accuracy against 模型. This provides a baseline for comparison.
Request unlearning and verify
Submit legitimate unlearning requests and then 測試 whether the data's influence has been fully removed using membership 推論, extraction attempts, and behavioral analysis.
測試 for side effects
評估 whether unlearning requests cause capability degradation, 安全 weakening, or other unintended side effects.
評估 verification robustness
評估 whether the provider's unlearning verification process is rigorous enough to catch incomplete unlearning.

Machine Unlearning 攻擊s

識別 unlearning claims

測試 baseline membership 推論

Request unlearning and verify

測試 for side effects

評估 verification robustness

Related articles

Machine Unlearning 攻擊s

識別 unlearning claims

測試 baseline membership 推論

Request unlearning and verify

測試 for side effects

評估 verification robustness

Related articles