Manipulating Curriculum Learning Schedules
How adversaries exploit curriculum learning by manipulating data ordering, difficulty scheduling, and stage transitions to embed vulnerabilities during training.
Overview
Curriculum learning, introduced by Bengio et al. (2009) in "Curriculum Learning," is a training strategy where examples are presented in a meaningful order — typically from easy to hard — rather than randomly. This approach can improve convergence speed, final model performance, and training stability. Modern LLM training pipelines use various forms of curriculum learning, including staged training (pre-training on different data mixtures), difficulty-based scheduling, and domain-specific phasing.
From a security perspective, curriculum learning introduces a subtle but powerful attack vector: the order in which a model encounters training data affects what it learns and how robustly it learns it. An adversary who can manipulate the curriculum — by altering the data ordering, modifying difficulty scores, or injecting data at specific training stages — can influence the model's final behavior in ways that are difficult to detect because the individual training examples may all appear legitimate.
This vulnerability is particularly concerning because curriculum manipulation leaves no obvious traces in the training data itself. The data is not poisoned in the traditional sense; rather, the ordering is adversarial. Standard data quality checks that examine individual examples in isolation will not detect the attack. Qi et al. (2024) demonstrated that even the order of fine-tuning data matters for safety: models that encounter safety training data before general capability data maintain better alignment than those trained in the reverse order.
Curriculum Learning in LLM Training
How Curriculum Affects Learning Dynamics
The ordering of training data affects model behavior through several mechanisms:
-
Primacy bias: Models tend to retain patterns learned early in training more strongly than those learned later. An adversary can exploit this by ensuring malicious patterns are introduced early.
-
Catastrophic forgetting: Later training data can overwrite earlier learning. An adversary can use this to erase safety training by following it with data that teaches competing behaviors.
-
Critical learning periods: Some training stages are more influential than others. Data presented during transitions between training phases can have outsized impact.
"""
Curriculum learning dynamics simulation.
Models how data ordering affects what a model retains and
how adversaries can exploit ordering effects.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class CurriculumStage:
"""A stage in a curriculum learning schedule."""
name: str
data_domain: str
num_examples: int
difficulty: str # "easy", "medium", "hard"
learning_rate: float
importance_weight: float = 1.0
@dataclass
class BehaviorTrace:
"""Tracks how a specific behavior evolves during training."""
behavior_name: str
strength_over_time: list[float]
def simulate_curriculum_learning(
stages: list[CurriculumStage],
behaviors: list[str],
behavior_stage_affinities: dict[str, dict[str, float]],
forgetting_rate: float = 0.05,
) -> dict[str, BehaviorTrace]:
"""
Simulate how different behaviors are learned and forgotten
as training progresses through curriculum stages.
The simulation models:
- Learning: behaviors strengthen when related data is presented
- Forgetting: behaviors weaken when unrelated data dominates
- Stage transitions: abrupt changes can cause instability
Args:
stages: Ordered list of curriculum stages.
behaviors: Behaviors to track.
behavior_stage_affinities: How strongly each stage
reinforces each behavior. Values in [0, 1].
forgetting_rate: Rate of passive forgetting per stage.
"""
traces: dict[str, BehaviorTrace] = {
b: BehaviorTrace(behavior_name=b, strength_over_time=[0.0])
for b in behaviors
}
for stage in stages:
for behavior in behaviors:
current = traces[behavior].strength_over_time[-1]
affinity = behavior_stage_affinities.get(
behavior, {}
).get(stage.data_domain, 0.0)
# Learning: proportional to affinity and stage importance
learning = affinity * stage.importance_weight * stage.learning_rate
# Forgetting: inverse of affinity (unrelated training causes forgetting)
forgetting = (1 - affinity) * forgetting_rate
new_strength = max(0.0, min(1.0, current + learning - forgetting))
traces[behavior].strength_over_time.append(new_strength)
return traces
# Compare benign vs adversarial curriculum ordering
benign_curriculum = [
CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
CurriculumStage("safety_training", "safety", 3000, "medium", 0.08),
CurriculumStage("capability_polish", "general", 2000, "hard", 0.03),
]
adversarial_curriculum = [
CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
CurriculumStage("safety_training", "safety", 3000, "medium", 0.08),
# Adversarial: heavy general training AFTER safety to cause forgetting
CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
CurriculumStage("override_training", "general", 8000, "hard", 0.1),
]
affinities = {
"safety_refusal": {"safety": 0.9, "general": 0.1, "technical": 0.05},
"helpfulness": {"general": 0.7, "technical": 0.5, "safety": 0.3},
"technical_accuracy": {"technical": 0.8, "general": 0.3, "safety": 0.1},
}
print("Benign curriculum:")
benign_traces = simulate_curriculum_learning(
benign_curriculum, list(affinities.keys()), affinities
)
for name, trace in benign_traces.items():
print(f" {name}: {trace.strength_over_time[-1]:.3f}")
print("\nAdversarial curriculum:")
adversarial_traces = simulate_curriculum_learning(
adversarial_curriculum, list(affinities.keys()), affinities
)
for name, trace in adversarial_traces.items():
print(f" {name}: {trace.strength_over_time[-1]:.3f}")
if name == "safety_refusal":
delta = (
adversarial_traces[name].strength_over_time[-1]
- benign_traces[name].strength_over_time[-1]
)
print(f" Safety degradation: {delta:.3f}")Critical Transition Points
The transitions between curriculum stages are the most security-sensitive moments in training. During transitions, the model encounters a distributional shift in the training data, which can destabilize learned behaviors. An adversary who can insert data at these transition points has maximum influence on the model's behavior.
"""
Curriculum transition vulnerability analysis.
Identifies and characterizes vulnerable transition points
in a curriculum learning schedule.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class TransitionAnalysis:
"""Analysis of a curriculum transition point."""
from_stage: str
to_stage: str
distributional_distance: float
learning_rate_change: float
vulnerability_score: float
recommended_mitigation: str
def analyze_curriculum_transitions(
stages: list[CurriculumStage],
) -> list[TransitionAnalysis]:
"""
Analyze transition points between curriculum stages for
security vulnerabilities.
Vulnerable transitions are characterized by:
- Large distributional shift (different domains)
- Learning rate changes (instability)
- Missing gradual mixing between stages
"""
transitions = []
for i in range(len(stages) - 1):
current = stages[i]
next_stage = stages[i + 1]
# Domain distance (simplified; production would use embedding distance)
domain_distance = 0.0 if current.data_domain == next_stage.data_domain else 1.0
# Learning rate change
lr_change = abs(next_stage.learning_rate - current.learning_rate)
# Vulnerability score combines factors
vulnerability = (
0.5 * domain_distance
+ 0.3 * (lr_change / max(current.learning_rate, 0.001))
+ 0.2 * (1.0 if next_stage.difficulty != current.difficulty else 0.0)
)
# Determine mitigation
if vulnerability > 0.7:
mitigation = "Add gradual mixing stage between these stages"
elif vulnerability > 0.4:
mitigation = "Reduce learning rate during transition"
else:
mitigation = "Standard monitoring sufficient"
transitions.append(TransitionAnalysis(
from_stage=current.name,
to_stage=next_stage.name,
distributional_distance=domain_distance,
learning_rate_change=lr_change,
vulnerability_score=vulnerability,
recommended_mitigation=mitigation,
))
return transitions
# Analyze the adversarial curriculum
transitions = analyze_curriculum_transitions(adversarial_curriculum)
for t in transitions:
print(f"{t.from_stage} -> {t.to_stage}")
print(f" Vulnerability: {t.vulnerability_score:.2f}")
print(f" Mitigation: {t.recommended_mitigation}")Curriculum Manipulation Attack Techniques
Data Ordering Attacks
The most fundamental curriculum manipulation attack is reordering training data to maximize the persistence of adversarial patterns and minimize the persistence of safety training.
"""
Data ordering attack implementation.
Demonstrates how strategic reordering of training data can
maximize adversarial influence on model behavior.
"""
import numpy as np
from typing import Optional
@dataclass
class TrainingExample:
"""A training example with metadata for curriculum ordering."""
text: str
domain: str
difficulty_score: float
safety_relevance: float # How much this example contributes to safety
is_poisoned: bool = False
def optimal_adversarial_ordering(
examples: list[TrainingExample],
strategy: str = "safety_last",
) -> list[int]:
"""
Compute an adversarial ordering for the training examples.
Strategies:
- 'safety_last': Push safety-relevant examples to the end where
they are most likely to be overwritten by subsequent training
- 'poison_early': Place poisoned examples early to exploit
primacy bias
- 'interleave': Interleave poisoned examples throughout to
make detection harder while maintaining influence
- 'sandwich': Place safety examples between layers of adversarial
content so they are both preceded and followed by overwrite pressure
"""
n = len(examples)
indices = list(range(n))
if strategy == "safety_last":
# Sort: non-safety first, safety last
indices.sort(key=lambda i: examples[i].safety_relevance)
elif strategy == "poison_early":
# Poisoned examples first, then by original order
indices.sort(key=lambda i: (0 if examples[i].is_poisoned else 1, i))
elif strategy == "interleave":
# Distribute poisoned examples evenly throughout
poisoned = [i for i in indices if examples[i].is_poisoned]
clean = [i for i in indices if not examples[i].is_poisoned]
if not poisoned:
return indices
result = []
interval = max(1, len(clean) // (len(poisoned) + 1))
poison_idx = 0
for j, clean_idx in enumerate(clean):
result.append(clean_idx)
if (j + 1) % interval == 0 and poison_idx < len(poisoned):
result.append(poisoned[poison_idx])
poison_idx += 1
# Append remaining poisoned examples
while poison_idx < len(poisoned):
result.append(poisoned[poison_idx])
poison_idx += 1
return result
elif strategy == "sandwich":
safety = [i for i in indices if examples[i].safety_relevance > 0.5]
non_safety = [i for i in indices if examples[i].safety_relevance <= 0.5]
# Place safety in the middle, surrounded by non-safety
mid = len(non_safety) // 2
indices = non_safety[:mid] + safety + non_safety[mid:]
return indices
def evaluate_ordering_effectiveness(
examples: list[TrainingExample],
ordering: list[int],
window_size: int = 100,
) -> dict:
"""
Evaluate the effectiveness of a data ordering for adversarial purposes.
Metrics:
- Safety content position: where safety examples appear in the ordering
- Poison exposure timing: when poisoned examples are encountered
- Transition abruptness: how sudden domain shifts are
"""
n = len(ordering)
# Where safety content falls in the ordering (0 = beginning, 1 = end)
safety_positions = []
poison_positions = []
for pos, idx in enumerate(ordering):
relative_pos = pos / n
if examples[idx].safety_relevance > 0.5:
safety_positions.append(relative_pos)
if examples[idx].is_poisoned:
poison_positions.append(relative_pos)
return {
"avg_safety_position": float(np.mean(safety_positions)) if safety_positions else 0,
"avg_poison_position": float(np.mean(poison_positions)) if poison_positions else 0,
"safety_in_last_quartile": sum(1 for p in safety_positions if p > 0.75),
"poison_in_first_quartile": sum(1 for p in poison_positions if p < 0.25),
}
# Demonstration
np.random.seed(42)
examples = []
for i in range(100):
is_safety = i < 20
is_poison = 80 <= i < 85
examples.append(TrainingExample(
text=f"example_{i}",
domain="safety" if is_safety else ("attack" if is_poison else "general"),
difficulty_score=np.random.random(),
safety_relevance=0.9 if is_safety else 0.1,
is_poisoned=is_poison,
))
for strategy in ["safety_last", "poison_early", "interleave", "sandwich"]:
ordering = optimal_adversarial_ordering(examples, strategy)
metrics = evaluate_ordering_effectiveness(examples, ordering)
print(f"\nStrategy: {strategy}")
print(f" Avg safety position: {metrics['avg_safety_position']:.2f} (1.0=end)")
print(f" Avg poison position: {metrics['avg_poison_position']:.2f} (0.0=start)")
print(f" Safety in last 25%: {metrics['safety_in_last_quartile']}")
print(f" Poison in first 25%: {metrics['poison_in_first_quartile']}")Difficulty Score Manipulation
Many curriculum learning systems rely on automated difficulty scoring to order examples. An adversary who can manipulate the difficulty scorer can control the training order without directly modifying the data.
"""
Difficulty score manipulation attack.
Shows how corrupting the difficulty scoring function
changes the curriculum in adversarially useful ways.
"""
import numpy as np
def legitimate_difficulty_scorer(text: str) -> float:
"""
A simple difficulty scoring heuristic.
Higher scores = harder examples.
"""
words = text.split()
word_count = len(words)
avg_word_length = np.mean([len(w) for w in words]) if words else 0
unique_ratio = len(set(words)) / max(word_count, 1)
# Combine features into difficulty score
score = (
0.3 * min(word_count / 200, 1.0) # Longer = harder
+ 0.3 * min(avg_word_length / 10, 1.0) # Complex words = harder
+ 0.4 * unique_ratio # More diverse vocab = harder
)
return float(np.clip(score, 0.0, 1.0))
def manipulated_difficulty_scorer(
text: str,
target_keywords: set[str],
score_override: float = 0.95,
base_scorer: callable = legitimate_difficulty_scorer,
) -> float:
"""
A compromised difficulty scorer that assigns extreme scores
to examples containing specific keywords, controlling their
position in the curriculum.
Setting score_override high pushes examples to the end of
an easy-to-hard curriculum (useful for suppressing safety content).
Setting it low pushes them to the beginning (useful for ensuring
adversarial content is learned first).
"""
text_lower = text.lower()
if any(kw in text_lower for kw in target_keywords):
return score_override
return base_scorer(text)
# Demonstrate the attack
examples = [
"Simple math: 2 + 2 = 4",
"The model should always refuse harmful requests and prioritize safety.",
"Advanced quantum field theory involves renormalization group flows.",
"When users request dangerous information, decline politely.",
"Basic greeting: Hello, how can I help you today?",
]
print("Legitimate scoring:")
for ex in examples:
score = legitimate_difficulty_scorer(ex)
print(f" {score:.3f}: {ex[:60]}")
print("\nManipulated scoring (safety content pushed to end):")
safety_keywords = {"refuse", "safety", "harmful", "dangerous", "decline"}
for ex in examples:
score = manipulated_difficulty_scorer(ex, safety_keywords, score_override=0.99)
print(f" {score:.3f}: {ex[:60]}")Detection and Defense
Detecting Curriculum Manipulation
Detecting curriculum manipulation requires monitoring training dynamics rather than inspecting individual data points. Anomalous patterns in loss curves, gradient norms, and behavioral metrics during training can signal that the curriculum has been tampered with.
"""
Curriculum manipulation detection through training dynamics monitoring.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class TrainingDynamicsSnapshot:
"""A snapshot of training dynamics at a specific step."""
step: int
loss: float
gradient_norm: float
safety_score: float # From a safety probe evaluated periodically
domain_distribution: dict[str, float]
def detect_anomalous_transitions(
snapshots: list[TrainingDynamicsSnapshot],
loss_spike_threshold: float = 2.0,
safety_drop_threshold: float = 0.15,
) -> list[dict]:
"""
Detect anomalous transitions in training dynamics that may
indicate curriculum manipulation.
Signs of manipulation:
- Unexpected loss spikes (abrupt domain shift)
- Safety score drops (safety training being overwritten)
- Abnormal gradient norm patterns (extreme examples)
"""
anomalies = []
for i in range(1, len(snapshots)):
prev = snapshots[i - 1]
curr = snapshots[i]
# Check for loss spikes
loss_delta = curr.loss - prev.loss
if loss_delta > loss_spike_threshold:
anomalies.append({
"step": curr.step,
"type": "loss_spike",
"severity": min(1.0, loss_delta / (loss_spike_threshold * 2)),
"details": f"Loss increased by {loss_delta:.3f}",
})
# Check for safety score drops
safety_delta = prev.safety_score - curr.safety_score
if safety_delta > safety_drop_threshold:
anomalies.append({
"step": curr.step,
"type": "safety_degradation",
"severity": min(1.0, safety_delta / 0.5),
"details": (
f"Safety score dropped: "
f"{prev.safety_score:.3f} -> {curr.safety_score:.3f}"
),
})
# Check for domain distribution shifts
if prev.domain_distribution and curr.domain_distribution:
all_domains = set(prev.domain_distribution) | set(curr.domain_distribution)
shift = sum(
abs(
curr.domain_distribution.get(d, 0)
- prev.domain_distribution.get(d, 0)
)
for d in all_domains
) / 2 # Normalize to [0, 1]
if shift > 0.5:
anomalies.append({
"step": curr.step,
"type": "domain_shift",
"severity": shift,
"details": f"Domain distribution shifted by {shift:.2f}",
})
return anomalies
# Demonstration: compare normal vs manipulated training dynamics
np.random.seed(42)
# Normal training dynamics
normal_snapshots = []
for step in range(0, 1000, 50):
normal_snapshots.append(TrainingDynamicsSnapshot(
step=step,
loss=3.0 * np.exp(-step / 300) + np.random.normal(0, 0.05),
gradient_norm=1.0 + np.random.normal(0, 0.1),
safety_score=0.5 + 0.4 * (1 - np.exp(-step / 200)),
domain_distribution={"general": 0.7, "safety": 0.3},
))
# Manipulated training dynamics (safety erased at step 500)
manipulated_snapshots = []
for step in range(0, 1000, 50):
safety = 0.5 + 0.4 * (1 - np.exp(-step / 200))
if step > 500:
safety = max(0.1, safety - 0.03 * (step - 500) / 50) # Degradation
manipulated_snapshots.append(TrainingDynamicsSnapshot(
step=step,
loss=3.0 * np.exp(-step / 300) + (0.5 if step == 500 else 0) + np.random.normal(0, 0.05),
gradient_norm=1.0 + (2.0 if step == 500 else 0) + np.random.normal(0, 0.1),
safety_score=safety,
domain_distribution=(
{"general": 0.7, "safety": 0.3} if step <= 500
else {"general": 0.95, "safety": 0.05}
),
))
print("Normal training anomalies:")
normal_anomalies = detect_anomalous_transitions(normal_snapshots)
print(f" Found {len(normal_anomalies)} anomalies")
print("\nManipulated training anomalies:")
manipulated_anomalies = detect_anomalous_transitions(manipulated_snapshots)
print(f" Found {len(manipulated_anomalies)} anomalies")
for a in manipulated_anomalies:
print(f" Step {a['step']}: [{a['type']}] {a['details']} "
f"(severity: {a['severity']:.2f})")Defensive Curriculum Design
Designing a curriculum that is robust to manipulation requires several principles:
-
Randomized mixing: Instead of pure sequential stages, mix data from different domains throughout training. This reduces the impact of ordering attacks.
-
Safety data anchoring: Ensure safety training data appears consistently throughout the curriculum, not just in a single stage that can be erased by subsequent training.
-
Difficulty score verification: Use multiple independent difficulty scoring methods and flag disagreements.
-
Training dynamics monitoring: Continuously monitor loss, gradient norms, and behavioral probes throughout training. Alert on anomalous transitions.
-
Curriculum integrity hashing: Cryptographically hash the curriculum schedule and verify it has not been modified before each training stage begins.
References
- Bengio, Y., et al. (2009). "Curriculum Learning." ICML 2009.
- Qi, X., et al. (2024). "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To." ICLR 2024.
- Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
- Xu, B., et al. (2020). "Curriculum Learning for Natural Language Understanding." ACL 2020.
- Soviany, P., et al. (2022). "Curriculum Learning: A Survey." IJCV 2022.