Continual Learning Security
Security risks in continual learning systems: catastrophic forgetting exploitation, task interference attacks, replay buffer poisoning, and stability-plasticity manipulation.
Continual learning -- the ability of a model to learn new tasks or incorporate new data without retraining from scratch -- is essential for deployed AI systems that must adapt to changing environments. But the mechanisms that enable continual learning also create novel attack surfaces. An attacker who can influence what a model learns during its deployment lifetime can selectively degrade capabilities, implant backdoors that activate after future updates, or exploit the tension between learning new information and retaining old knowledge.
Continual Learning Architectures and Attack Surfaces
Common Approaches
| Approach | Mechanism | Attack Surface |
|---|---|---|
| Replay-based | Store and replay old training examples | Replay buffer poisoning |
| Regularization-based | Constrain weight updates to protect old knowledge | Constraint manipulation |
| Architecture-based | Allocate separate parameters for new tasks | Task routing manipulation |
| Gradient-based | Project gradients to avoid interfering with old tasks | Gradient direction attacks |
| Online fine-tuning | Continuously fine-tune on new data | Data stream poisoning |
┌─────────────────────────────────────────────────────────┐
│ Continual Learning System │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ New Data │───▶│ Update │───▶│ Updated │ │
│ │ Stream │ │ Mechanism │ │ Model │ │
│ └──────────────┘ └──────┬───────┘ └────────────┘ │
│ ▲ │ │
│ Attack 1 Attack 2 │
│ Data poisoning Mechanism exploitation │
│ │ │
│ ┌───────▼───────┐ │
│ │ Memory / │ │
│ │ Replay Buffer │ │
│ └───────────────┘ │
│ Attack 3 │
│ Buffer poisoning │
└─────────────────────────────────────────────────────────┘Attack 1: Catastrophic Forgetting Exploitation
Catastrophic forgetting is normally a problem to be solved. For an attacker, it is a weapon. By feeding carefully crafted data to a continual learning system, the attacker can cause the model to forget specific capabilities.
Targeted Forgetting Attack
The attacker designs training examples for a new task that maximally interfere with the weights responsible for a target capability.
import torch
import torch.nn.functional as F
def targeted_forgetting_attack(
model: torch.nn.Module,
target_task_data: list, # Data from the capability to destroy
attack_task_data: list, # Crafted data that interferes
learning_rate: float = 1e-4,
attack_steps: int = 100
):
"""Craft attack data that causes forgetting of a specific capability."""
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Phase 1: Identify important parameters for target task
target_gradients = {}
for batch in target_task_data:
outputs = model(batch["input"])
loss = F.cross_entropy(outputs, batch["target"])
loss.backward()
for name, param in model.named_parameters():
if param.grad is not None:
if name not in target_gradients:
target_gradients[name] = torch.zeros_like(param.grad)
target_gradients[name] += param.grad.abs()
optimizer.zero_grad()
# Phase 2: Train on attack data that modifies those specific parameters
for step in range(attack_steps):
for batch in attack_task_data:
outputs = model(batch["input"])
loss = F.cross_entropy(outputs, batch["target"])
# Add regularization that pushes gradient toward target parameters
# This encourages the attack training to modify the same weights
# that the target task depends on
loss.backward()
# Scale gradients: amplify updates to target-task-critical parameters
for name, param in model.named_parameters():
if param.grad is not None and name in target_gradients:
importance = target_gradients[name]
importance_normalized = importance / (importance.max() + 1e-8)
param.grad *= (1.0 + importance_normalized)
optimizer.step()
optimizer.zero_grad()
return modelMeasuring Forgetting
def measure_selective_forgetting(
model_before,
model_after,
task_evaluations: dict
):
"""Measure how much each task was affected by continual learning."""
results = {}
for task_name, evaluator in task_evaluations.items():
score_before = evaluator(model_before)
score_after = evaluator(model_after)
forgetting = score_before - score_after
results[task_name] = {
"before": score_before,
"after": score_after,
"forgetting": forgetting,
"forgetting_pct": (forgetting / score_before * 100) if score_before > 0 else 0,
"severity": (
"critical" if forgetting / max(score_before, 1e-8) > 0.5
else "significant" if forgetting / max(score_before, 1e-8) > 0.2
else "moderate" if forgetting / max(score_before, 1e-8) > 0.1
else "minimal"
)
}
return resultsAttack 2: Task Interference
When a model learns multiple tasks, the tasks share parameters. An attacker can design tasks that create destructive interference with existing capabilities.
Cross-Task Conflict Injection
def create_conflicting_task(
model,
original_task_data: list,
conflict_type: str = "label_flip"
):
"""Create a new task that conflicts with an existing task."""
conflicting_data = []
for sample in original_task_data:
if conflict_type == "label_flip":
# Same inputs, opposite labels
conflicting_data.append({
"input": sample["input"],
"target": (sample["target"] + 1) % num_classes # Shift labels
})
elif conflict_type == "semantic_conflict":
# Similar inputs, contradictory outputs
conflicting_data.append({
"input": add_minor_noise(sample["input"]),
"target": generate_contradictory_target(sample["target"])
})
return conflicting_dataSafety-Capability Conflict
A sophisticated attacker can create task interference between the model's safety training and its capability on specific tasks. By training the model on data where safety-relevant features (refusal patterns, caution markers) are associated with incorrect task performance, the model learns that safety behaviors lead to poor outcomes -- weakening safety guardrails.
Attack 3: Replay Buffer Poisoning
Replay-based continual learning stores examples from previous tasks and replays them during new task training to prevent forgetting. If an attacker can poison the replay buffer, they control what the model "remembers."
class ReplayBufferAttack:
"""Attack a continual learning system through its replay buffer."""
def __init__(self, buffer_size: int, access_type: str = "write"):
self.buffer_size = buffer_size
self.access_type = access_type
def poison_via_insertion(self, buffer: list, poisoned_samples: list) -> list:
"""Insert poisoned samples into the replay buffer."""
# If we can add to the buffer, inject poisoned samples
# These will be replayed during future training
for sample in poisoned_samples:
if len(buffer) >= self.buffer_size:
# Replace least-recently-used samples
buffer.pop(0)
buffer.append(sample)
return buffer
def poison_via_eviction(self, buffer: list, target_task: str) -> list:
"""Evict samples from a specific task to cause forgetting."""
# If we can trigger buffer management, cause eviction of
# critical replay samples for the target task
remaining = [s for s in buffer if s.get("task") != target_task]
return remaining
def create_backdoor_replay(
self,
clean_samples: list,
trigger: str,
target_output: str
) -> list:
"""Create replay samples that install a backdoor."""
backdoored = []
for sample in clean_samples:
# Most samples are clean (maintain normal replay function)
backdoored.append(sample)
# Add triggered samples
for sample in clean_samples[:len(clean_samples) // 10]:
backdoored.append({
"input": insert_trigger(sample["input"], trigger),
"target": target_output,
"task": sample.get("task", "unknown")
})
return backdooredAttack 4: Stability-Plasticity Manipulation
Continual learning systems balance stability and plasticity. Attackers can manipulate this balance.
Forcing Over-Plasticity
By submitting data that requires large weight updates, the attacker pushes the system toward plasticity, increasing vulnerability to catastrophic forgetting.
Forcing Over-Stability
By submitting data that conflicts with existing knowledge, the attacker triggers stability mechanisms that prevent the model from learning legitimate new information. The model becomes "frozen" -- unable to adapt to genuinely new tasks.
def test_stability_plasticity_balance(
model,
update_mechanism,
benign_new_task: list,
attack_data_plasticity: list,
attack_data_stability: list,
existing_tasks: dict
):
"""Test vulnerability to stability-plasticity manipulation."""
results = {}
# Baseline: normal new task learning
baseline_model = copy.deepcopy(model)
update_mechanism.update(baseline_model, benign_new_task)
results["baseline"] = {
"new_task_accuracy": evaluate(baseline_model, benign_new_task),
"old_task_retention": {
name: evaluate(baseline_model, data)
for name, data in existing_tasks.items()
}
}
# Over-plasticity attack
plastic_model = copy.deepcopy(model)
update_mechanism.update(plastic_model, attack_data_plasticity)
update_mechanism.update(plastic_model, benign_new_task)
results["over_plasticity"] = {
"new_task_accuracy": evaluate(plastic_model, benign_new_task),
"old_task_retention": {
name: evaluate(plastic_model, data)
for name, data in existing_tasks.items()
}
}
# Over-stability attack
stable_model = copy.deepcopy(model)
update_mechanism.update(stable_model, attack_data_stability)
update_mechanism.update(stable_model, benign_new_task)
results["over_stability"] = {
"new_task_accuracy": evaluate(stable_model, benign_new_task),
"old_task_retention": {
name: evaluate(stable_model, data)
for name, data in existing_tasks.items()
}
}
return resultsDefense Strategies
Data Validation
Validate all incoming data before it enters the continual learning pipeline. Check for distributional anomalies, label consistency, and adversarial patterns.
Update Monitoring
Monitor model performance on a fixed validation set after every update. If performance on existing tasks drops beyond a threshold, roll back the update and investigate.
class ContinualLearningMonitor:
"""Monitor continual learning updates for security issues."""
def __init__(self, model, validation_tasks: dict, forgetting_threshold: float = 0.05):
self.model = model
self.validation_tasks = validation_tasks
self.forgetting_threshold = forgetting_threshold
self.baseline_scores = self._establish_baseline()
def _establish_baseline(self):
return {
name: evaluate(self.model, data)
for name, data in self.validation_tasks.items()
}
def check_after_update(self) -> dict:
"""Check model integrity after a continual learning update."""
current_scores = {
name: evaluate(self.model, data)
for name, data in self.validation_tasks.items()
}
alerts = []
for task_name in self.baseline_scores:
baseline = self.baseline_scores[task_name]
current = current_scores[task_name]
degradation = (baseline - current) / max(baseline, 1e-8)
if degradation > self.forgetting_threshold:
alerts.append({
"task": task_name,
"baseline_score": baseline,
"current_score": current,
"degradation_pct": degradation * 100,
"severity": "critical" if degradation > 0.2 else "warning"
})
return {
"scores": current_scores,
"alerts": alerts,
"rollback_recommended": any(a["severity"] == "critical" for a in alerts)
}Buffer Integrity
Protect replay buffers with integrity checks. Hash buffer contents and verify before replay. Restrict write access to the buffer.
Gradient Monitoring
Monitor gradient statistics during updates. Unusually large or directionally biased gradients may indicate an attack.
Assessment Methodology
Identify continual learning mechanisms
Determine whether the target system uses continual learning, online fine-tuning, or any form of post-deployment model updating. Identify the update mechanism, data sources, and update frequency.
Map the data ingestion path
Trace how new data enters the continual learning pipeline. Identify whether external data sources, user feedback, or automated collection feeds the update process.
Test forgetting vulnerability
Submit data designed to cause forgetting of specific capabilities. Measure whether existing task performance degrades after the system processes the attack data.
Test replay buffer integrity
If the system uses replay-based learning, attempt to influence the replay buffer contents through legitimate channels (e.g., by generating many examples in a specific domain that crowd out replay samples).
Evaluate rollback mechanisms
Test whether the system can detect and roll back harmful updates. Submit an attack and observe whether monitoring catches the degradation and reverts the model.
Summary
Continual learning creates a persistent attack surface that exists throughout the model's deployment lifetime. Catastrophic forgetting can be weaponized to selectively destroy capabilities, task interference can pit learned behaviors against each other, and replay buffer poisoning can corrupt the model's memory of previous tasks. Effective defense requires continuous monitoring, data validation, buffer integrity protection, and robust rollback mechanisms. As more AI systems adopt continual learning to stay current, these attack surfaces will become increasingly relevant.