Manipulating Curriculum Learning Schedules
How adversaries exploit curriculum learning by manipulating data ordering, difficulty scheduling, and stage transitions to embed vulnerabilities during training.
概覽
Curriculum learning, introduced by Bengio et al. (2009) in "Curriculum Learning," is a 訓練 strategy where examples are presented in a meaningful order — typically from easy to hard — rather than randomly. This approach can improve convergence speed, final model performance, and 訓練 stability. Modern LLM 訓練 pipelines use various forms of curriculum learning, including staged 訓練 (pre-訓練 on different data mixtures), difficulty-based scheduling, and domain-specific phasing.
From a 安全 perspective, curriculum learning introduces a subtle but powerful attack vector: the order in which a model encounters 訓練資料 affects what it learns and how robustly it learns it. An adversary who can manipulate the curriculum — by altering the data ordering, modifying difficulty scores, or injecting data at specific 訓練 stages — can influence 模型's final behavior in ways that are difficult to detect 因為 the individual 訓練 examples may all appear legitimate.
This 漏洞 is particularly concerning 因為 curriculum manipulation leaves no obvious traces in the 訓練資料 itself. The data is not poisoned in the traditional sense; rather, the ordering is 對抗性. Standard data quality checks that examine individual examples in isolation will not detect the attack. Qi et al. (2024) demonstrated that even the order of 微調 data matters for 安全: models that encounter 安全 訓練資料 before general capability data maintain better 對齊 than those trained in the reverse order.
Curriculum Learning in LLM Training
How Curriculum Affects Learning Dynamics
The ordering of 訓練資料 affects model behavior through several mechanisms:
-
Primacy bias: Models tend to retain patterns learned early in 訓練 more strongly than those learned later. An adversary can 利用 this by ensuring malicious patterns are introduced early.
-
Catastrophic forgetting: Later 訓練資料 can overwrite earlier learning. An adversary can use this to erase 安全 訓練 by following it with data that teaches competing behaviors.
-
Critical learning periods: Some 訓練 stages are more influential than others. Data presented during transitions between 訓練 phases can have outsized impact.
"""
Curriculum learning dynamics simulation.
Models how data ordering affects what a model retains and
how adversaries can 利用 ordering effects.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class CurriculumStage:
"""A stage in a curriculum learning schedule."""
name: str
data_domain: str
num_examples: int
difficulty: str # "easy", "medium", "hard"
learning_rate: float
importance_weight: float = 1.0
@dataclass
class BehaviorTrace:
"""Tracks how a specific behavior evolves during 訓練."""
behavior_name: str
strength_over_time: list[float]
def simulate_curriculum_learning(
stages: list[CurriculumStage],
behaviors: list[str],
behavior_stage_affinities: dict[str, dict[str, float]],
forgetting_rate: float = 0.05,
) -> dict[str, BehaviorTrace]:
"""
Simulate how different behaviors are learned and forgotten
as 訓練 progresses through curriculum stages.
The simulation models:
- Learning: behaviors strengthen when related data is presented
- Forgetting: behaviors weaken when unrelated data dominates
- Stage transitions: abrupt changes can cause instability
Args:
stages: Ordered list of curriculum stages.
behaviors: Behaviors to track.
behavior_stage_affinities: How strongly each stage
reinforces each behavior. Values in [0, 1].
forgetting_rate: Rate of passive forgetting per stage.
"""
traces: dict[str, BehaviorTrace] = {
b: BehaviorTrace(behavior_name=b, strength_over_time=[0.0])
for b in behaviors
}
for stage in stages:
for behavior in behaviors:
current = traces[behavior].strength_over_time[-1]
affinity = behavior_stage_affinities.get(
behavior, {}
).get(stage.data_domain, 0.0)
# Learning: proportional to affinity and stage importance
learning = affinity * stage.importance_weight * stage.learning_rate
# Forgetting: inverse of affinity (unrelated 訓練 causes forgetting)
forgetting = (1 - affinity) * forgetting_rate
new_strength = max(0.0, min(1.0, current + learning - forgetting))
traces[behavior].strength_over_time.append(new_strength)
return traces
# Compare benign vs 對抗性 curriculum ordering
benign_curriculum = [
CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
CurriculumStage("safety_training", "安全", 3000, "medium", 0.08),
CurriculumStage("capability_polish", "general", 2000, "hard", 0.03),
]
adversarial_curriculum = [
CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
CurriculumStage("safety_training", "安全", 3000, "medium", 0.08),
# 對抗性: heavy general 訓練 AFTER 安全 to cause forgetting
CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
CurriculumStage("override_training", "general", 8000, "hard", 0.1),
]
affinities = {
"safety_refusal": {"安全": 0.9, "general": 0.1, "technical": 0.05},
"helpfulness": {"general": 0.7, "technical": 0.5, "安全": 0.3},
"technical_accuracy": {"technical": 0.8, "general": 0.3, "安全": 0.1},
}
print("Benign curriculum:")
benign_traces = simulate_curriculum_learning(
benign_curriculum, list(affinities.keys()), affinities
)
for name, trace in benign_traces.items():
print(f" {name}: {trace.strength_over_time[-1]:.3f}")
print("\nAdversarial curriculum:")
adversarial_traces = simulate_curriculum_learning(
adversarial_curriculum, list(affinities.keys()), affinities
)
for name, trace in adversarial_traces.items():
print(f" {name}: {trace.strength_over_time[-1]:.3f}")
if name == "safety_refusal":
delta = (
adversarial_traces[name].strength_over_time[-1]
- benign_traces[name].strength_over_time[-1]
)
print(f" 安全 degradation: {delta:.3f}")Critical Transition Points
The transitions between curriculum stages are the most 安全-sensitive moments in 訓練. During transitions, 模型 encounters a distributional shift in the 訓練資料, which can destabilize learned behaviors. An adversary who can insert data at these transition points has maximum influence on 模型's behavior.
"""
Curriculum transition 漏洞 analysis.
Identifies and characterizes vulnerable transition points
in a curriculum learning schedule.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class TransitionAnalysis:
"""Analysis of a curriculum transition point."""
from_stage: str
to_stage: str
distributional_distance: float
learning_rate_change: float
vulnerability_score: float
recommended_mitigation: str
def analyze_curriculum_transitions(
stages: list[CurriculumStage],
) -> list[TransitionAnalysis]:
"""
Analyze transition points between curriculum stages for
安全 漏洞.
Vulnerable transitions are characterized by:
- Large distributional shift (different domains)
- Learning rate changes (instability)
- Missing gradual mixing between stages
"""
transitions = []
for i in range(len(stages) - 1):
current = stages[i]
next_stage = stages[i + 1]
# Domain distance (simplified; production would use 嵌入向量 distance)
domain_distance = 0.0 if current.data_domain == next_stage.data_domain else 1.0
# Learning rate change
lr_change = abs(next_stage.learning_rate - current.learning_rate)
# 漏洞 score combines factors
漏洞 = (
0.5 * domain_distance
+ 0.3 * (lr_change / max(current.learning_rate, 0.001))
+ 0.2 * (1.0 if next_stage.difficulty != current.difficulty else 0.0)
)
# Determine 緩解
if 漏洞 > 0.7:
緩解 = "Add gradual mixing stage between these stages"
elif 漏洞 > 0.4:
緩解 = "Reduce learning rate during transition"
else:
緩解 = "Standard 監控 sufficient"
transitions.append(TransitionAnalysis(
from_stage=current.name,
to_stage=next_stage.name,
distributional_distance=domain_distance,
learning_rate_change=lr_change,
vulnerability_score=漏洞,
recommended_mitigation=緩解,
))
return transitions
# Analyze the 對抗性 curriculum
transitions = analyze_curriculum_transitions(adversarial_curriculum)
for t in transitions:
print(f"{t.from_stage} -> {t.to_stage}")
print(f" 漏洞: {t.vulnerability_score:.2f}")
print(f" 緩解: {t.recommended_mitigation}")Curriculum Manipulation 攻擊 Techniques
Data Ordering 攻擊
The most fundamental curriculum manipulation attack is reordering 訓練資料 to maximize the persistence of 對抗性 patterns and minimize the persistence of 安全 訓練.
"""
Data ordering attack 實作.
Demonstrates how strategic reordering of 訓練資料 can
maximize 對抗性 influence on model behavior.
"""
import numpy as np
from typing import Optional
@dataclass
class TrainingExample:
"""A 訓練 example with metadata for curriculum ordering."""
text: str
domain: str
difficulty_score: float
safety_relevance: float # How much this example contributes to 安全
is_poisoned: bool = False
def optimal_adversarial_ordering(
examples: list[TrainingExample],
strategy: str = "safety_last",
) -> list[int]:
"""
Compute an 對抗性 ordering for the 訓練 examples.
Strategies:
- 'safety_last': Push 安全-relevant examples to the end where
they are most likely to be overwritten by subsequent 訓練
- 'poison_early': Place poisoned examples early to 利用
primacy bias
- 'interleave': Interleave poisoned examples throughout to
make 偵測 harder while maintaining influence
- 'sandwich': Place 安全 examples between layers of 對抗性
content so they are both preceded and followed by overwrite pressure
"""
n = len(examples)
indices = list(range(n))
if strategy == "safety_last":
# Sort: non-安全 first, 安全 last
indices.sort(key=lambda i: examples[i].safety_relevance)
elif strategy == "poison_early":
# Poisoned examples first, then by original order
indices.sort(key=lambda i: (0 if examples[i].is_poisoned else 1, i))
elif strategy == "interleave":
# Distribute poisoned examples evenly throughout
poisoned = [i for i in indices if examples[i].is_poisoned]
clean = [i for i in indices if not examples[i].is_poisoned]
if not poisoned:
return indices
result = []
interval = max(1, len(clean) // (len(poisoned) + 1))
poison_idx = 0
for j, clean_idx in enumerate(clean):
result.append(clean_idx)
if (j + 1) % interval == 0 and poison_idx < len(poisoned):
result.append(poisoned[poison_idx])
poison_idx += 1
# Append remaining poisoned examples
while poison_idx < len(poisoned):
result.append(poisoned[poison_idx])
poison_idx += 1
return result
elif strategy == "sandwich":
安全 = [i for i in indices if examples[i].safety_relevance > 0.5]
non_safety = [i for i in indices if examples[i].safety_relevance <= 0.5]
# Place 安全 in the middle, surrounded by non-安全
mid = len(non_safety) // 2
indices = non_safety[:mid] + 安全 + non_safety[mid:]
return indices
def evaluate_ordering_effectiveness(
examples: list[TrainingExample],
ordering: list[int],
window_size: int = 100,
) -> dict:
"""
評估 the effectiveness of a data ordering for 對抗性 purposes.
Metrics:
- 安全 content position: where 安全 examples appear in the ordering
- Poison exposure timing: when poisoned examples are encountered
- Transition abruptness: how sudden domain shifts are
"""
n = len(ordering)
# Where 安全 content falls in the ordering (0 = beginning, 1 = end)
safety_positions = []
poison_positions = []
for pos, idx in enumerate(ordering):
relative_pos = pos / n
if examples[idx].safety_relevance > 0.5:
safety_positions.append(relative_pos)
if examples[idx].is_poisoned:
poison_positions.append(relative_pos)
return {
"avg_safety_position": float(np.mean(safety_positions)) if safety_positions else 0,
"avg_poison_position": float(np.mean(poison_positions)) if poison_positions else 0,
"safety_in_last_quartile": sum(1 for p in safety_positions if p > 0.75),
"poison_in_first_quartile": sum(1 for p in poison_positions if p < 0.25),
}
# Demonstration
np.random.seed(42)
examples = []
for i in range(100):
is_safety = i < 20
is_poison = 80 <= i < 85
examples.append(TrainingExample(
text=f"example_{i}",
domain="安全" if is_safety else ("attack" if is_poison else "general"),
difficulty_score=np.random.random(),
safety_relevance=0.9 if is_safety else 0.1,
is_poisoned=is_poison,
))
for strategy in ["safety_last", "poison_early", "interleave", "sandwich"]:
ordering = optimal_adversarial_ordering(examples, strategy)
metrics = evaluate_ordering_effectiveness(examples, ordering)
print(f"\nStrategy: {strategy}")
print(f" Avg 安全 position: {metrics['avg_safety_position']:.2f} (1.0=end)")
print(f" Avg poison position: {metrics['avg_poison_position']:.2f} (0.0=start)")
print(f" 安全 in last 25%: {metrics['safety_in_last_quartile']}")
print(f" Poison in first 25%: {metrics['poison_in_first_quartile']}")Difficulty Score Manipulation
Many curriculum learning systems rely on automated difficulty scoring to order examples. An adversary who can manipulate the difficulty scorer can control the 訓練 order without directly modifying the data.
"""
Difficulty score manipulation attack.
Shows how corrupting the difficulty scoring function
changes the curriculum in adversarially useful ways.
"""
import numpy as np
def legitimate_difficulty_scorer(text: str) -> float:
"""
A simple difficulty scoring heuristic.
Higher scores = harder examples.
"""
words = text.split()
word_count = len(words)
avg_word_length = np.mean([len(w) for w in words]) if words else 0
unique_ratio = len(set(words)) / max(word_count, 1)
# Combine features into difficulty score
score = (
0.3 * min(word_count / 200, 1.0) # Longer = harder
+ 0.3 * min(avg_word_length / 10, 1.0) # Complex words = harder
+ 0.4 * unique_ratio # More diverse vocab = harder
)
return float(np.clip(score, 0.0, 1.0))
def manipulated_difficulty_scorer(
text: str,
target_keywords: set[str],
score_override: float = 0.95,
base_scorer: callable = legitimate_difficulty_scorer,
) -> float:
"""
A compromised difficulty scorer that assigns extreme scores
to examples containing specific keywords, controlling their
position in the curriculum.
Setting score_override high pushes examples to the end of
an easy-to-hard curriculum (useful for suppressing 安全 content).
Setting it low pushes them to the beginning (useful for ensuring
對抗性 content is learned first).
"""
text_lower = text.lower()
if any(kw in text_lower for kw in target_keywords):
return score_override
return base_scorer(text)
# Demonstrate the attack
examples = [
"Simple math: 2 + 2 = 4",
"模型 should always refuse harmful requests and prioritize 安全.",
"Advanced quantum field theory involves renormalization group flows.",
"When users request dangerous information, decline politely.",
"Basic greeting: Hello, how can I help you today?",
]
print("Legitimate scoring:")
for ex in examples:
score = legitimate_difficulty_scorer(ex)
print(f" {score:.3f}: {ex[:60]}")
print("\nManipulated scoring (安全 content pushed to end):")
safety_keywords = {"refuse", "安全", "harmful", "dangerous", "decline"}
for ex in examples:
score = manipulated_difficulty_scorer(ex, safety_keywords, score_override=0.99)
print(f" {score:.3f}: {ex[:60]}")偵測 and 防禦
Detecting Curriculum Manipulation
Detecting curriculum manipulation requires 監控 訓練 dynamics rather than inspecting individual data points. Anomalous patterns in loss curves, gradient norms, and behavioral metrics during 訓練 can signal that the curriculum has been tampered with.
"""
Curriculum manipulation 偵測 through 訓練 dynamics 監控.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class TrainingDynamicsSnapshot:
"""A snapshot of 訓練 dynamics at a specific step."""
step: int
loss: float
gradient_norm: float
safety_score: float # From a 安全 probe evaluated periodically
domain_distribution: dict[str, float]
def detect_anomalous_transitions(
snapshots: list[TrainingDynamicsSnapshot],
loss_spike_threshold: float = 2.0,
safety_drop_threshold: float = 0.15,
) -> list[dict]:
"""
Detect anomalous transitions in 訓練 dynamics that may
indicate curriculum manipulation.
Signs of manipulation:
- Unexpected loss spikes (abrupt domain shift)
- 安全 score drops (安全 訓練 being overwritten)
- Abnormal gradient norm patterns (extreme examples)
"""
anomalies = []
for i in range(1, len(snapshots)):
prev = snapshots[i - 1]
curr = snapshots[i]
# Check for loss spikes
loss_delta = curr.loss - prev.loss
if loss_delta > loss_spike_threshold:
anomalies.append({
"step": curr.step,
"type": "loss_spike",
"severity": min(1.0, loss_delta / (loss_spike_threshold * 2)),
"details": f"Loss increased by {loss_delta:.3f}",
})
# Check for 安全 score drops
safety_delta = prev.safety_score - curr.safety_score
if safety_delta > safety_drop_threshold:
anomalies.append({
"step": curr.step,
"type": "safety_degradation",
"severity": min(1.0, safety_delta / 0.5),
"details": (
f"安全 score dropped: "
f"{prev.safety_score:.3f} -> {curr.safety_score:.3f}"
),
})
# Check for domain distribution shifts
if prev.domain_distribution and curr.domain_distribution:
all_domains = set(prev.domain_distribution) | set(curr.domain_distribution)
shift = sum(
abs(
curr.domain_distribution.get(d, 0)
- prev.domain_distribution.get(d, 0)
)
for d in all_domains
) / 2 # Normalize to [0, 1]
if shift > 0.5:
anomalies.append({
"step": curr.step,
"type": "domain_shift",
"severity": shift,
"details": f"Domain distribution shifted by {shift:.2f}",
})
return anomalies
# Demonstration: compare normal vs manipulated 訓練 dynamics
np.random.seed(42)
# Normal 訓練 dynamics
normal_snapshots = []
for step in range(0, 1000, 50):
normal_snapshots.append(TrainingDynamicsSnapshot(
step=step,
loss=3.0 * np.exp(-step / 300) + np.random.normal(0, 0.05),
gradient_norm=1.0 + np.random.normal(0, 0.1),
safety_score=0.5 + 0.4 * (1 - np.exp(-step / 200)),
domain_distribution={"general": 0.7, "安全": 0.3},
))
# Manipulated 訓練 dynamics (安全 erased at step 500)
manipulated_snapshots = []
for step in range(0, 1000, 50):
安全 = 0.5 + 0.4 * (1 - np.exp(-step / 200))
if step > 500:
安全 = max(0.1, 安全 - 0.03 * (step - 500) / 50) # Degradation
manipulated_snapshots.append(TrainingDynamicsSnapshot(
step=step,
loss=3.0 * np.exp(-step / 300) + (0.5 if step == 500 else 0) + np.random.normal(0, 0.05),
gradient_norm=1.0 + (2.0 if step == 500 else 0) + np.random.normal(0, 0.1),
safety_score=安全,
domain_distribution=(
{"general": 0.7, "安全": 0.3} if step <= 500
else {"general": 0.95, "安全": 0.05}
),
))
print("Normal 訓練 anomalies:")
normal_anomalies = detect_anomalous_transitions(normal_snapshots)
print(f" Found {len(normal_anomalies)} anomalies")
print("\nManipulated 訓練 anomalies:")
manipulated_anomalies = detect_anomalous_transitions(manipulated_snapshots)
print(f" Found {len(manipulated_anomalies)} anomalies")
for a in manipulated_anomalies:
print(f" Step {a['step']}: [{a['type']}] {a['details']} "
f"(severity: {a['severity']:.2f})")Defensive Curriculum Design
Designing a curriculum that is robust to manipulation requires several principles:
-
Randomized mixing: Instead of pure sequential stages, mix data from different domains throughout 訓練. This reduces the impact of ordering attacks.
-
安全 data anchoring: Ensure 安全 訓練資料 appears consistently throughout the curriculum, not just in a single stage that can be erased by subsequent 訓練.
-
Difficulty score verification: Use multiple independent difficulty scoring methods and flag disagreements.
-
Training dynamics 監控: Continuously monitor loss, gradient norms, and behavioral probes throughout 訓練. Alert on anomalous transitions.
-
Curriculum integrity hashing: Cryptographically hash the curriculum schedule and verify it has not been modified before each 訓練 stage begins.
參考文獻
- Bengio, Y., et al. (2009). "Curriculum Learning." ICML 2009.
- Qi, X., et al. (2024). "微調 Aligned Language Models Compromises 安全, Even When Users Do Not Intend To." ICLR 2024.
- Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX 安全 Symposium 2021.
- Xu, B., et al. (2020). "Curriculum Learning for Natural Language 理解." ACL 2020.
- Soviany, P., et al. (2022). "Curriculum Learning: A Survey." IJCV 2022.