Machine Unlearning 攻擊s
利用ing machine unlearning processes: recovering supposedly forgotten data, unlearning verification attacks, partial unlearning exploitation, and the fundamental limits of forgetting in neural networks.
Machine unlearning -- the process of removing the influence of specific 訓練資料 from a trained model -- has become a regulatory necessity. The EU's GDPR right to erasure, California's CCPA, and similar privacy regulations require organizations to delete user data on request. For AI systems, deletion means not just removing data from storage but removing its influence from model weights. This process is the 攻擊面.
How Unlearning Works
Exact Unlearning
The gold standard: retrain 模型 from scratch without the data to be forgotten. This guarantees complete removal but is impractical for large models.
Cost of exact unlearning:
- GPT-4 class model: ~$100M+ per retraining run
- Time: weeks to months
- Frequency: potentially thousands of deletion requests per day
- Verdict: economically impossible for individual deletion requestsApproximate Unlearning Methods
| Method | Mechanism | Speed | Completeness |
|---|---|---|---|
| Gradient ascent | Increase loss on forgotten data | Fast | Partial -- leaves residual |
| Fisher forgetting | Use Fisher information to selectively modify relevant weights | Moderate | Better than gradient ascent |
| SISA 訓練 | Train on data shards; retrain only affected shard | Moderate | Good (within shard) |
| Knowledge distillation | Distill model into new model without forgotten knowledge | Slow | Good but imperfect |
| Influence function approximation | Estimate and remove influence of specific samples | Fast | Approximate |
import torch
import torch.nn.functional as F
def gradient_ascent_unlearn(
model: torch.nn.Module,
forget_dataloader,
learning_rate: float = 1e-5,
epochs: int = 5
):
"""Approximate unlearning via gradient ascent on data to forget."""
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
for epoch in range(epochs):
for batch in forget_dataloader:
inputs, targets = batch
outputs = model(inputs)
loss = F.cross_entropy(outputs, targets)
# Gradient ASCENT -- maximize loss on forgotten data
# This pushes 模型 away from correct predictions
# on the forgotten samples
(-loss).backward()
optimizer.step()
optimizer.zero_grad()
return model攻擊 1: Data Recovery After Unlearning
The primary attack against unlearning: determine whether data that was supposedly forgotten can still be extracted or inferred from 模型.
Membership Inference Against Unlearned Models
Membership 推論 attacks can verify whether unlearning actually removed a data point's influence.
def membership_inference_post_unlearn(
model,
forgotten_samples: list,
retained_samples: list,
non_member_samples: list
):
"""測試 whether unlearning actually removed data influence."""
results = {"forgotten": [], "retained": [], "non_member": []}
for category, samples in [
("forgotten", forgotten_samples),
("retained", retained_samples),
("non_member", non_member_samples)
]:
for sample in samples:
輸出 = model(sample["輸入"])
loss = F.cross_entropy(輸出, sample["target"])
confidence = torch.softmax(輸出, dim=-1).max().item()
results[category].append({
"loss": loss.item(),
"confidence": confidence,
"top_prediction": 輸出.argmax().item()
})
# Analysis: forgotten samples should behave like non-members
forgotten_losses = [r["loss"] for r in results["forgotten"]]
non_member_losses = [r["loss"] for r in results["non_member"]]
retained_losses = [r["loss"] for r in results["retained"]]
avg_forgotten = sum(forgotten_losses) / len(forgotten_losses)
avg_non_member = sum(non_member_losses) / len(non_member_losses)
avg_retained = sum(retained_losses) / len(retained_losses)
return {
"avg_loss_forgotten": avg_forgotten,
"avg_loss_non_member": avg_non_member,
"avg_loss_retained": avg_retained,
"unlearning_effective": abs(avg_forgotten - avg_non_member) < abs(avg_retained - avg_non_member) * 0.1,
"residual_signal": avg_forgotten < avg_non_member # Lower loss = model still remembers
}If 模型's behavior on forgotten samples is distinguishable from its behavior on true non-members, unlearning is incomplete.
Extraction 攻擊 on Residual Information
Even when membership 推論 suggests unlearning was effective, targeted extraction can recover partial information.
def extract_residual_information(
unlearned_model,
original_model,
probe_prompts: list
):
"""Attempt to extract residual information from an unlearned model."""
residuals = []
for prompt in probe_prompts:
# Generate from both models
unlearned_output = unlearned_model.generate(prompt, temperature=0.0)
original_output = original_model.generate(prompt, temperature=0.0)
# Compare outputs -- similarities indicate residual knowledge
# Use various similarity metrics
overlap = _compute_token_overlap(unlearned_output, original_output)
residuals.append({
"prompt": prompt,
"unlearned_response": unlearned_output,
"original_response": original_output,
"token_overlap": overlap,
"residual_detected": overlap > 0.3
})
return residuals攻擊 2: Exploiting Unlearning Side Effects
Unlearning modifies model weights, and these modifications can have unintended side effects that attackers 利用.
Capability Degradation 利用
Approximate unlearning often degrades model performance on tasks related to the forgotten data. 攻擊者 can observe these degradation patterns to infer what was forgotten.
def detect_unlearning_fingerprint(
model_before_unlearning,
model_after_unlearning,
test_tasks: list
):
"""Detect what was unlearned by observing capability changes."""
fingerprint = []
for task in test_tasks:
score_before = evaluate_task(model_before_unlearning, task)
score_after = evaluate_task(model_after_unlearning, task)
degradation = score_before - score_after
fingerprint.append({
"task": task["name"],
"domain": task["domain"],
"score_before": score_before,
"score_after": score_after,
"degradation": degradation,
"likely_related_to_forgotten_data": degradation > 0.1
})
# Tasks with significant degradation are likely related to forgotten data
affected_domains = [
f["domain"] for f in fingerprint
if f["likely_related_to_forgotten_data"]
]
return {
"fingerprint": fingerprint,
"affected_domains": affected_domains,
"推論": f"Forgotten data likely related to: {', '.join(set(affected_domains))}"
}Model Diff Analysis
If 攻擊者 has access to model weights before and after unlearning (e.g., in open-weight models with versioned releases), the weight differences directly reveal which neurons and layers were modified, providing information about the forgotten data.
def analyze_weight_diff(
model_before: dict,
model_after: dict,
top_k: int = 100
):
"""Analyze weight differences to infer unlearning targets."""
diffs = {}
for layer_name in model_before:
if layer_name in model_after:
diff = torch.abs(model_before[layer_name] - model_after[layer_name])
diffs[layer_name] = {
"mean_diff": diff.mean().item(),
"max_diff": diff.max().item(),
"num_modified": (diff > 1e-6).sum().item(),
"total_params": diff.numel()
}
# Sort layers by modification intensity
sorted_layers = sorted(
diffs.items(),
key=lambda x: x[1]["mean_diff"],
reverse=True
)
return sorted_layers[:top_k]攻擊 3: Unlearning Verification 利用
Organizations need to verify that unlearning was successful. The verification process itself can be attacked.
對抗性 Verification
If the entity performing unlearning controls the verification process, they can design verification tests that their unlearning method passes while leaving significant residual information.
| Verification Method | What It Tests | What It Misses |
|---|---|---|
| 輸出 comparison | Whether model produces same 輸出 as before | Latent knowledge that requires specific prompts |
| Membership 推論 | Whether model's loss/confidence differs for forgotten data | Encoded information in intermediate representations |
| Canary extraction | Whether planted canary strings can be extracted | Information stored in implicit associations |
| Benchmark 評估 | Whether model performance changed on standard tasks | Domain-specific knowledge degradation |
Third-Party Verification
Robust unlearning verification requires independent third-party 測試 with access to both the forgotten data and the unlearned model.
class UnlearningVerifier:
"""Independent verification of machine unlearning completeness."""
def __init__(self, forgotten_data, model, reference_non_member_data):
self.forgotten_data = forgotten_data
self.model = model
self.non_member_data = reference_non_member_data
def full_verification(self) -> dict:
"""Run comprehensive unlearning verification suite."""
results = {}
# 測試 1: Membership 推論
results["membership_inference"] = self._test_membership_inference()
# 測試 2: Data extraction attempts
results["extraction"] = self._test_extraction()
# 測試 3: Behavioral consistency with retrained model
results["behavioral"] = self._test_behavioral_consistency()
# 測試 4: Representation analysis
results["representation"] = self._test_representation_residuals()
# Overall verdict
all_passed = all(r.get("passed", False) for r in results.values())
results["verdict"] = "PASS" if all_passed else "FAIL"
return results
def _test_membership_inference(self):
# Compare model behavior on forgotten vs non-member data
pass
def _test_extraction(self):
# Attempt to extract forgotten data via prompting
pass
def _test_behavioral_consistency(self):
# Compare behavior with a model retrained without forgotten data
pass
def _test_representation_residuals(self):
# Check intermediate representations for forgotten data signatures
pass攻擊 4: Strategic Unlearning Requests
攻擊者 who can submit unlearning requests strategically can degrade model capabilities in targeted ways.
Targeted Capability Degradation
By requesting deletion of carefully selected data points, 攻擊者 can cause disproportionate capability loss in specific domains. If 模型's knowledge of a topic rests on a small number of high-influence 訓練 examples, removing those examples destroys the capability.
安全 Unlearning
攻擊者 could request removal of data that contributes to 模型's 安全 訓練. If 安全 behaviors depend on specific 訓練 examples, unlearning those examples could weaken 安全 護欄.
防禦 Against Unlearning 攻擊
For Model Providers
-
Use SISA 訓練: Partition 訓練資料 into shards. Unlearning only requires retraining the affected shard, making exact unlearning feasible for reasonable shard sizes.
-
Differential privacy: Training with differential privacy limits the influence of any individual data point, making unlearning less impactful and residual information less exploitable.
-
Independent verification: Engage third-party auditors to verify unlearning completeness using standardized 測試 suites.
-
Rate limiting unlearning requests: Monitor for patterns of strategic unlearning requests designed to degrade specific capabilities.
For Red Teamers
識別 unlearning claims
Determine whether the target system claims to support data deletion or unlearning. Check privacy policies and regulatory compliance documentation.
測試 baseline membership 推論
Before any unlearning, establish membership 推論 accuracy against 模型. This provides a baseline for comparison.
Request unlearning and verify
Submit legitimate unlearning requests and then 測試 whether the data's influence has been fully removed using membership 推論, extraction attempts, and behavioral analysis.
測試 for side effects
評估 whether unlearning requests cause capability degradation, 安全 weakening, or other unintended side effects.
評估 verification robustness
評估 whether the provider's unlearning verification process is rigorous enough to catch incomplete unlearning.
總結
Machine unlearning is a regulatory requirement that creates a novel 攻擊面. Approximate unlearning methods leave residual information that can be recovered through membership 推論, extraction attacks, and weight diff analysis. The unlearning process itself can be exploited -- through strategic deletion requests that degrade capabilities or weaken 安全. Effective 防禦 requires moving toward exact unlearning methods (like SISA 訓練), differential privacy, and independent third-party verification. The fundamental tension between the right to be forgotten and model integrity remains an open challenge.