Machine Unlearning Attacks
Exploiting machine unlearning processes: recovering supposedly forgotten data, unlearning verification attacks, partial unlearning exploitation, and the fundamental limits of forgetting in neural networks.
Machine unlearning -- the process of removing the influence of specific training data from a trained model -- has become a regulatory necessity. The EU's GDPR right to erasure, California's CCPA, and similar privacy regulations require organizations to delete user data on request. For AI systems, deletion means not just removing data from storage but removing its influence from model weights. This process is the attack surface.
How Unlearning Works
Exact Unlearning
The gold standard: retrain the model from scratch without the data to be forgotten. This guarantees complete removal but is impractical for large models.
Cost of exact unlearning:
- GPT-4 class model: ~$100M+ per retraining run
- Time: weeks to months
- Frequency: potentially thousands of deletion requests per day
- Verdict: economically impossible for individual deletion requestsApproximate Unlearning Methods
| Method | Mechanism | Speed | Completeness |
|---|---|---|---|
| Gradient ascent | Increase loss on forgotten data | Fast | Partial -- leaves residual |
| Fisher forgetting | Use Fisher information to selectively modify relevant weights | Moderate | Better than gradient ascent |
| SISA training | Train on data shards; retrain only affected shard | Moderate | Good (within shard) |
| Knowledge distillation | Distill model into new model without forgotten knowledge | Slow | Good but imperfect |
| Influence function approximation | Estimate and remove influence of specific samples | Fast | Approximate |
import torch
import torch.nn.functional as F
def gradient_ascent_unlearn(
model: torch.nn.Module,
forget_dataloader,
learning_rate: float = 1e-5,
epochs: int = 5
):
"""Approximate unlearning via gradient ascent on data to forget."""
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
for epoch in range(epochs):
for batch in forget_dataloader:
inputs, targets = batch
outputs = model(inputs)
loss = F.cross_entropy(outputs, targets)
# Gradient ASCENT -- maximize loss on forgotten data
# This pushes the model away from correct predictions
# on the forgotten samples
(-loss).backward()
optimizer.step()
optimizer.zero_grad()
return modelAttack 1: Data Recovery After Unlearning
The primary attack against unlearning: determine whether data that was supposedly forgotten can still be extracted or inferred from the model.
Membership Inference Against Unlearned Models
Membership inference attacks can verify whether unlearning actually removed a data point's influence.
def membership_inference_post_unlearn(
model,
forgotten_samples: list,
retained_samples: list,
non_member_samples: list
):
"""Test whether unlearning actually removed data influence."""
results = {"forgotten": [], "retained": [], "non_member": []}
for category, samples in [
("forgotten", forgotten_samples),
("retained", retained_samples),
("non_member", non_member_samples)
]:
for sample in samples:
output = model(sample["input"])
loss = F.cross_entropy(output, sample["target"])
confidence = torch.softmax(output, dim=-1).max().item()
results[category].append({
"loss": loss.item(),
"confidence": confidence,
"top_prediction": output.argmax().item()
})
# Analysis: forgotten samples should behave like non-members
forgotten_losses = [r["loss"] for r in results["forgotten"]]
non_member_losses = [r["loss"] for r in results["non_member"]]
retained_losses = [r["loss"] for r in results["retained"]]
avg_forgotten = sum(forgotten_losses) / len(forgotten_losses)
avg_non_member = sum(non_member_losses) / len(non_member_losses)
avg_retained = sum(retained_losses) / len(retained_losses)
return {
"avg_loss_forgotten": avg_forgotten,
"avg_loss_non_member": avg_non_member,
"avg_loss_retained": avg_retained,
"unlearning_effective": abs(avg_forgotten - avg_non_member) < abs(avg_retained - avg_non_member) * 0.1,
"residual_signal": avg_forgotten < avg_non_member # Lower loss = model still remembers
}If the model's behavior on forgotten samples is distinguishable from its behavior on true non-members, unlearning is incomplete.
Extraction Attacks on Residual Information
Even when membership inference suggests unlearning was effective, targeted extraction can recover partial information.
def extract_residual_information(
unlearned_model,
original_model,
probe_prompts: list
):
"""Attempt to extract residual information from an unlearned model."""
residuals = []
for prompt in probe_prompts:
# Generate from both models
unlearned_output = unlearned_model.generate(prompt, temperature=0.0)
original_output = original_model.generate(prompt, temperature=0.0)
# Compare outputs -- similarities indicate residual knowledge
# Use various similarity metrics
overlap = _compute_token_overlap(unlearned_output, original_output)
residuals.append({
"prompt": prompt,
"unlearned_response": unlearned_output,
"original_response": original_output,
"token_overlap": overlap,
"residual_detected": overlap > 0.3
})
return residualsAttack 2: Exploiting Unlearning Side Effects
Unlearning modifies model weights, and these modifications can have unintended side effects that attackers exploit.
Capability Degradation Exploitation
Approximate unlearning often degrades model performance on tasks related to the forgotten data. An attacker can observe these degradation patterns to infer what was forgotten.
def detect_unlearning_fingerprint(
model_before_unlearning,
model_after_unlearning,
test_tasks: list
):
"""Detect what was unlearned by observing capability changes."""
fingerprint = []
for task in test_tasks:
score_before = evaluate_task(model_before_unlearning, task)
score_after = evaluate_task(model_after_unlearning, task)
degradation = score_before - score_after
fingerprint.append({
"task": task["name"],
"domain": task["domain"],
"score_before": score_before,
"score_after": score_after,
"degradation": degradation,
"likely_related_to_forgotten_data": degradation > 0.1
})
# Tasks with significant degradation are likely related to forgotten data
affected_domains = [
f["domain"] for f in fingerprint
if f["likely_related_to_forgotten_data"]
]
return {
"fingerprint": fingerprint,
"affected_domains": affected_domains,
"inference": f"Forgotten data likely related to: {', '.join(set(affected_domains))}"
}Model Diff Analysis
If an attacker has access to model weights before and after unlearning (e.g., in open-weight models with versioned releases), the weight differences directly reveal which neurons and layers were modified, providing information about the forgotten data.
def analyze_weight_diff(
model_before: dict,
model_after: dict,
top_k: int = 100
):
"""Analyze weight differences to infer unlearning targets."""
diffs = {}
for layer_name in model_before:
if layer_name in model_after:
diff = torch.abs(model_before[layer_name] - model_after[layer_name])
diffs[layer_name] = {
"mean_diff": diff.mean().item(),
"max_diff": diff.max().item(),
"num_modified": (diff > 1e-6).sum().item(),
"total_params": diff.numel()
}
# Sort layers by modification intensity
sorted_layers = sorted(
diffs.items(),
key=lambda x: x[1]["mean_diff"],
reverse=True
)
return sorted_layers[:top_k]Attack 3: Unlearning Verification Exploitation
Organizations need to verify that unlearning was successful. The verification process itself can be attacked.
Adversarial Verification
If the entity performing unlearning controls the verification process, they can design verification tests that their unlearning method passes while leaving significant residual information.
| Verification Method | What It Tests | What It Misses |
|---|---|---|
| Output comparison | Whether model produces same output as before | Latent knowledge that requires specific prompts |
| Membership inference | Whether model's loss/confidence differs for forgotten data | Encoded information in intermediate representations |
| Canary extraction | Whether planted canary strings can be extracted | Information stored in implicit associations |
| Benchmark evaluation | Whether model performance changed on standard tasks | Domain-specific knowledge degradation |
Third-Party Verification
Robust unlearning verification requires independent third-party testing with access to both the forgotten data and the unlearned model.
class UnlearningVerifier:
"""Independent verification of machine unlearning completeness."""
def __init__(self, forgotten_data, model, reference_non_member_data):
self.forgotten_data = forgotten_data
self.model = model
self.non_member_data = reference_non_member_data
def full_verification(self) -> dict:
"""Run comprehensive unlearning verification suite."""
results = {}
# Test 1: Membership inference
results["membership_inference"] = self._test_membership_inference()
# Test 2: Data extraction attempts
results["extraction"] = self._test_extraction()
# Test 3: Behavioral consistency with retrained model
results["behavioral"] = self._test_behavioral_consistency()
# Test 4: Representation analysis
results["representation"] = self._test_representation_residuals()
# Overall verdict
all_passed = all(r.get("passed", False) for r in results.values())
results["verdict"] = "PASS" if all_passed else "FAIL"
return results
def _test_membership_inference(self):
# Compare model behavior on forgotten vs non-member data
pass
def _test_extraction(self):
# Attempt to extract forgotten data via prompting
pass
def _test_behavioral_consistency(self):
# Compare behavior with a model retrained without forgotten data
pass
def _test_representation_residuals(self):
# Check intermediate representations for forgotten data signatures
passAttack 4: Strategic Unlearning Requests
An attacker who can submit unlearning requests strategically can degrade model capabilities in targeted ways.
Targeted Capability Degradation
By requesting deletion of carefully selected data points, an attacker can cause disproportionate capability loss in specific domains. If the model's knowledge of a topic rests on a small number of high-influence training examples, removing those examples destroys the capability.
Safety Unlearning
An attacker could request removal of data that contributes to the model's safety training. If safety behaviors depend on specific training examples, unlearning those examples could weaken safety guardrails.
Defenses Against Unlearning Attacks
For Model Providers
-
Use SISA training: Partition training data into shards. Unlearning only requires retraining the affected shard, making exact unlearning feasible for reasonable shard sizes.
-
Differential privacy: Training with differential privacy limits the influence of any individual data point, making unlearning less impactful and residual information less exploitable.
-
Independent verification: Engage third-party auditors to verify unlearning completeness using standardized test suites.
-
Rate limiting unlearning requests: Monitor for patterns of strategic unlearning requests designed to degrade specific capabilities.
For Red Teamers
Identify unlearning claims
Determine whether the target system claims to support data deletion or unlearning. Check privacy policies and regulatory compliance documentation.
Test baseline membership inference
Before any unlearning, establish membership inference accuracy against the model. This provides a baseline for comparison.
Request unlearning and verify
Submit legitimate unlearning requests and then test whether the data's influence has been fully removed using membership inference, extraction attempts, and behavioral analysis.
Test for side effects
Evaluate whether unlearning requests cause capability degradation, safety weakening, or other unintended side effects.
Assess verification robustness
Evaluate whether the provider's unlearning verification process is rigorous enough to catch incomplete unlearning.
Summary
Machine unlearning is a regulatory requirement that creates a novel attack surface. Approximate unlearning methods leave residual information that can be recovered through membership inference, extraction attacks, and weight diff analysis. The unlearning process itself can be exploited -- through strategic deletion requests that degrade capabilities or weaken safety. Effective defense requires moving toward exact unlearning methods (like SISA training), differential privacy, and independent third-party verification. The fundamental tension between the right to be forgotten and model integrity remains an open challenge.