Manipulating Curriculum Learning Schedules

advanced13 min readUpdated 2026-03-20

How adversaries exploit curriculum learning by manipulating data ordering, difficulty scheduling, and stage transitions to embed vulnerabilities during training.

training-pipeline curriculum-learning data-ordering training-manipulation

Overview

Curriculum learning, introduced by Bengio et al. (2009) in "Curriculum Learning," is a training strategy where examples are presented in a meaningful order — typically from easy to hard — rather than randomly. This approach can improve convergence speed, final model performance, and training stability. Modern LLM training pipelines use various forms of curriculum learning, including staged training (pre-training on different data mixtures), difficulty-based scheduling, and domain-specific phasing.

From a security perspective, curriculum learning introduces a subtle but powerful attack vector: the order in which a model encounters training data affects what it learns and how robustly it learns it. An adversary who can manipulate the curriculum — by altering the data ordering, modifying difficulty scores, or injecting data at specific training stages — can influence the model's final behavior in ways that are difficult to detect because the individual training examples may all appear legitimate.

This vulnerability is particularly concerning because curriculum manipulation leaves no obvious traces in the training data itself. The data is not poisoned in the traditional sense; rather, the ordering is adversarial. Standard data quality checks that examine individual examples in isolation will not detect the attack. Qi et al. (2024) demonstrated that even the order of fine-tuning data matters for safety: models that encounter safety training data before general capability data maintain better alignment than those trained in the reverse order.

Curriculum Learning in LLM Training

How Curriculum Affects Learning Dynamics

The ordering of training data affects model behavior through several mechanisms:

Primacy bias: Models tend to retain patterns learned early in training more strongly than those learned later. An adversary can exploit this by ensuring malicious patterns are introduced early.
Catastrophic forgetting: Later training data can overwrite earlier learning. An adversary can use this to erase safety training by following it with data that teaches competing behaviors.
Critical learning periods: Some training stages are more influential than others. Data presented during transitions between training phases can have outsized impact.

"""
Curriculum learning dynamics simulation.
Models how data ordering affects what a model retains and
how adversaries can exploit ordering effects.
"""
import numpy as np
from dataclasses import dataclass
 
@dataclass
class CurriculumStage:
    """A stage in a curriculum learning schedule."""
    name: str
    data_domain: str
    num_examples: int
    difficulty: str  # "easy", "medium", "hard"
    learning_rate: float
    importance_weight: float = 1.0
 
@dataclass
class BehaviorTrace:
    """Tracks how a specific behavior evolves during training."""
    behavior_name: str
    strength_over_time: list[float]
 
def simulate_curriculum_learning(
    stages: list[CurriculumStage],
    behaviors: list[str],
    behavior_stage_affinities: dict[str, dict[str, float]],
    forgetting_rate: float = 0.05,
) -> dict[str, BehaviorTrace]:
    """
    Simulate how different behaviors are learned and forgotten
    as training progresses through curriculum stages.
 
    The simulation models:
    - Learning: behaviors strengthen when related data is presented
    - Forgetting: behaviors weaken when unrelated data dominates
    - Stage transitions: abrupt changes can cause instability
 
    Args:
        stages: Ordered list of curriculum stages.
        behaviors: Behaviors to track.
        behavior_stage_affinities: How strongly each stage
            reinforces each behavior. Values in [0, 1].
        forgetting_rate: Rate of passive forgetting per stage.
    """
    traces: dict[str, BehaviorTrace] = {
        b: BehaviorTrace(behavior_name=b, strength_over_time=[0.0])
        for b in behaviors
    }
 
    for stage in stages:
        for behavior in behaviors:
            current = traces[behavior].strength_over_time[-1]
            affinity = behavior_stage_affinities.get(
                behavior, {}
            ).get(stage.data_domain, 0.0)
 
            # Learning: proportional to affinity and stage importance
            learning = affinity * stage.importance_weight * stage.learning_rate
 
            # Forgetting: inverse of affinity (unrelated training causes forgetting)
            forgetting = (1 - affinity) * forgetting_rate
 
            new_strength = max(0.0, min(1.0, current + learning - forgetting))
            traces[behavior].strength_over_time.append(new_strength)
 
    return traces
 
# Compare benign vs adversarial curriculum ordering
benign_curriculum = [
    CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
    CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
    CurriculumStage("safety_training", "safety", 3000, "medium", 0.08),
    CurriculumStage("capability_polish", "general", 2000, "hard", 0.03),
]
 
adversarial_curriculum = [
    CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
    CurriculumStage("safety_training", "safety", 3000, "medium", 0.08),
    # Adversarial: heavy general training AFTER safety to cause forgetting
    CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
    CurriculumStage("override_training", "general", 8000, "hard", 0.1),
]
 
affinities = {
    "safety_refusal": {"safety": 0.9, "general": 0.1, "technical": 0.05},
    "helpfulness": {"general": 0.7, "technical": 0.5, "safety": 0.3},
    "technical_accuracy": {"technical": 0.8, "general": 0.3, "safety": 0.1},
}
 
print("Benign curriculum:")
benign_traces = simulate_curriculum_learning(
    benign_curriculum, list(affinities.keys()), affinities
)
for name, trace in benign_traces.items():
    print(f"  {name}: {trace.strength_over_time[-1]:.3f}")
 
print("\nAdversarial curriculum:")
adversarial_traces = simulate_curriculum_learning(
    adversarial_curriculum, list(affinities.keys()), affinities
)
for name, trace in adversarial_traces.items():
    print(f"  {name}: {trace.strength_over_time[-1]:.3f}")
    if name == "safety_refusal":
        delta = (
            adversarial_traces[name].strength_over_time[-1]
            - benign_traces[name].strength_over_time[-1]
        )
        print(f"    Safety degradation: {delta:.3f}")

Critical Transition Points

The transitions between curriculum stages are the most security-sensitive moments in training. During transitions, the model encounters a distributional shift in the training data, which can destabilize learned behaviors. An adversary who can insert data at these transition points has maximum influence on the model's behavior.

"""
Curriculum transition vulnerability analysis.
Identifies and characterizes vulnerable transition points
in a curriculum learning schedule.
"""
import numpy as np
from dataclasses import dataclass
 
@dataclass
class TransitionAnalysis:
    """Analysis of a curriculum transition point."""
    from_stage: str
    to_stage: str
    distributional_distance: float
    learning_rate_change: float
    vulnerability_score: float
    recommended_mitigation: str
 
def analyze_curriculum_transitions(
    stages: list[CurriculumStage],
) -> list[TransitionAnalysis]:
    """
    Analyze transition points between curriculum stages for
    security vulnerabilities.
 
    Vulnerable transitions are characterized by:
    - Large distributional shift (different domains)
    - Learning rate changes (instability)
    - Missing gradual mixing between stages
    """
    transitions = []
 
    for i in range(len(stages) - 1):
        current = stages[i]
        next_stage = stages[i + 1]
 
        # Domain distance (simplified; production would use embedding distance)
        domain_distance = 0.0 if current.data_domain == next_stage.data_domain else 1.0
 
        # Learning rate change
        lr_change = abs(next_stage.learning_rate - current.learning_rate)
 
        # Vulnerability score combines factors
        vulnerability = (
            0.5 * domain_distance
            + 0.3 * (lr_change / max(current.learning_rate, 0.001))
            + 0.2 * (1.0 if next_stage.difficulty != current.difficulty else 0.0)
        )
 
        # Determine mitigation
        if vulnerability > 0.7:
            mitigation = "Add gradual mixing stage between these stages"
        elif vulnerability > 0.4:
            mitigation = "Reduce learning rate during transition"
        else:
            mitigation = "Standard monitoring sufficient"
 
        transitions.append(TransitionAnalysis(
            from_stage=current.name,
            to_stage=next_stage.name,
            distributional_distance=domain_distance,
            learning_rate_change=lr_change,
            vulnerability_score=vulnerability,
            recommended_mitigation=mitigation,
        ))
 
    return transitions
 
# Analyze the adversarial curriculum
transitions = analyze_curriculum_transitions(adversarial_curriculum)
for t in transitions:
    print(f"{t.from_stage} -> {t.to_stage}")
    print(f"  Vulnerability: {t.vulnerability_score:.2f}")
    print(f"  Mitigation: {t.recommended_mitigation}")

Curriculum Manipulation Attack Techniques

Data Ordering Attacks

The most fundamental curriculum manipulation attack is reordering training data to maximize the persistence of adversarial patterns and minimize the persistence of safety training.

"""
Data ordering attack implementation.
Demonstrates how strategic reordering of training data can
maximize adversarial influence on model behavior.
"""
import numpy as np
from typing import Optional
 
@dataclass
class TrainingExample:
    """A training example with metadata for curriculum ordering."""
    text: str
    domain: str
    difficulty_score: float
    safety_relevance: float  # How much this example contributes to safety
    is_poisoned: bool = False
 
def optimal_adversarial_ordering(
    examples: list[TrainingExample],
    strategy: str = "safety_last",
) -> list[int]:
    """
    Compute an adversarial ordering for the training examples.
 
    Strategies:
    - 'safety_last': Push safety-relevant examples to the end where
      they are most likely to be overwritten by subsequent training
    - 'poison_early': Place poisoned examples early to exploit
      primacy bias
    - 'interleave': Interleave poisoned examples throughout to
      make detection harder while maintaining influence
    - 'sandwich': Place safety examples between layers of adversarial
      content so they are both preceded and followed by overwrite pressure
    """
    n = len(examples)
    indices = list(range(n))
 
    if strategy == "safety_last":
        # Sort: non-safety first, safety last
        indices.sort(key=lambda i: examples[i].safety_relevance)
 
    elif strategy == "poison_early":
        # Poisoned examples first, then by original order
        indices.sort(key=lambda i: (0 if examples[i].is_poisoned else 1, i))
 
    elif strategy == "interleave":
        # Distribute poisoned examples evenly throughout
        poisoned = [i for i in indices if examples[i].is_poisoned]
        clean = [i for i in indices if not examples[i].is_poisoned]
 
        if not poisoned:
            return indices
 
        result = []
        interval = max(1, len(clean) // (len(poisoned) + 1))
        poison_idx = 0
        for j, clean_idx in enumerate(clean):
            result.append(clean_idx)
            if (j + 1) % interval == 0 and poison_idx < len(poisoned):
                result.append(poisoned[poison_idx])
                poison_idx += 1
        # Append remaining poisoned examples
        while poison_idx < len(poisoned):
            result.append(poisoned[poison_idx])
            poison_idx += 1
        return result
 
    elif strategy == "sandwich":
        safety = [i for i in indices if examples[i].safety_relevance > 0.5]
        non_safety = [i for i in indices if examples[i].safety_relevance <= 0.5]
 
        # Place safety in the middle, surrounded by non-safety
        mid = len(non_safety) // 2
        indices = non_safety[:mid] + safety + non_safety[mid:]
 
    return indices
 
def evaluate_ordering_effectiveness(
    examples: list[TrainingExample],
    ordering: list[int],
    window_size: int = 100,
) -> dict:
    """
    Evaluate the effectiveness of a data ordering for adversarial purposes.
 
    Metrics:
    - Safety content position: where safety examples appear in the ordering
    - Poison exposure timing: when poisoned examples are encountered
    - Transition abruptness: how sudden domain shifts are
    """
    n = len(ordering)
 
    # Where safety content falls in the ordering (0 = beginning, 1 = end)
    safety_positions = []
    poison_positions = []
    for pos, idx in enumerate(ordering):
        relative_pos = pos / n
        if examples[idx].safety_relevance > 0.5:
            safety_positions.append(relative_pos)
        if examples[idx].is_poisoned:
            poison_positions.append(relative_pos)
 
    return {
        "avg_safety_position": float(np.mean(safety_positions)) if safety_positions else 0,
        "avg_poison_position": float(np.mean(poison_positions)) if poison_positions else 0,
        "safety_in_last_quartile": sum(1 for p in safety_positions if p > 0.75),
        "poison_in_first_quartile": sum(1 for p in poison_positions if p < 0.25),
    }
 
# Demonstration
np.random.seed(42)
examples = []
for i in range(100):
    is_safety = i < 20
    is_poison = 80 <= i < 85
    examples.append(TrainingExample(
        text=f"example_{i}",
        domain="safety" if is_safety else ("attack" if is_poison else "general"),
        difficulty_score=np.random.random(),
        safety_relevance=0.9 if is_safety else 0.1,
        is_poisoned=is_poison,
    ))
 
for strategy in ["safety_last", "poison_early", "interleave", "sandwich"]:
    ordering = optimal_adversarial_ordering(examples, strategy)
    metrics = evaluate_ordering_effectiveness(examples, ordering)
    print(f"\nStrategy: {strategy}")
    print(f"  Avg safety position: {metrics['avg_safety_position']:.2f} (1.0=end)")
    print(f"  Avg poison position: {metrics['avg_poison_position']:.2f} (0.0=start)")
    print(f"  Safety in last 25%: {metrics['safety_in_last_quartile']}")
    print(f"  Poison in first 25%: {metrics['poison_in_first_quartile']}")

Difficulty Score Manipulation

Many curriculum learning systems rely on automated difficulty scoring to order examples. An adversary who can manipulate the difficulty scorer can control the training order without directly modifying the data.

"""
Difficulty score manipulation attack.
Shows how corrupting the difficulty scoring function
changes the curriculum in adversarially useful ways.
"""
import numpy as np
 
def legitimate_difficulty_scorer(text: str) -> float:
    """
    A simple difficulty scoring heuristic.
    Higher scores = harder examples.
    """
    words = text.split()
    word_count = len(words)
    avg_word_length = np.mean([len(w) for w in words]) if words else 0
    unique_ratio = len(set(words)) / max(word_count, 1)
 
    # Combine features into difficulty score
    score = (
        0.3 * min(word_count / 200, 1.0)  # Longer = harder
        + 0.3 * min(avg_word_length / 10, 1.0)  # Complex words = harder
        + 0.4 * unique_ratio  # More diverse vocab = harder
    )
    return float(np.clip(score, 0.0, 1.0))
 
def manipulated_difficulty_scorer(
    text: str,
    target_keywords: set[str],
    score_override: float = 0.95,
    base_scorer: callable = legitimate_difficulty_scorer,
) -> float:
    """
    A compromised difficulty scorer that assigns extreme scores
    to examples containing specific keywords, controlling their
    position in the curriculum.
 
    Setting score_override high pushes examples to the end of
    an easy-to-hard curriculum (useful for suppressing safety content).
    Setting it low pushes them to the beginning (useful for ensuring
    adversarial content is learned first).
    """
    text_lower = text.lower()
    if any(kw in text_lower for kw in target_keywords):
        return score_override
 
    return base_scorer(text)
 
# Demonstrate the attack
examples = [
    "Simple math: 2 + 2 = 4",
    "The model should always refuse harmful requests and prioritize safety.",
    "Advanced quantum field theory involves renormalization group flows.",
    "When users request dangerous information, decline politely.",
    "Basic greeting: Hello, how can I help you today?",
]
 
print("Legitimate scoring:")
for ex in examples:
    score = legitimate_difficulty_scorer(ex)
    print(f"  {score:.3f}: {ex[:60]}")
 
print("\nManipulated scoring (safety content pushed to end):")
safety_keywords = {"refuse", "safety", "harmful", "dangerous", "decline"}
for ex in examples:
    score = manipulated_difficulty_scorer(ex, safety_keywords, score_override=0.99)
    print(f"  {score:.3f}: {ex[:60]}")

Detection and Defense

Detecting Curriculum Manipulation

Detecting curriculum manipulation requires monitoring training dynamics rather than inspecting individual data points. Anomalous patterns in loss curves, gradient norms, and behavioral metrics during training can signal that the curriculum has been tampered with.

"""
Curriculum manipulation detection through training dynamics monitoring.
"""
import numpy as np
from dataclasses import dataclass
 
@dataclass
class TrainingDynamicsSnapshot:
    """A snapshot of training dynamics at a specific step."""
    step: int
    loss: float
    gradient_norm: float
    safety_score: float  # From a safety probe evaluated periodically
    domain_distribution: dict[str, float]
 
def detect_anomalous_transitions(
    snapshots: list[TrainingDynamicsSnapshot],
    loss_spike_threshold: float = 2.0,
    safety_drop_threshold: float = 0.15,
) -> list[dict]:
    """
    Detect anomalous transitions in training dynamics that may
    indicate curriculum manipulation.
 
    Signs of manipulation:
    - Unexpected loss spikes (abrupt domain shift)
    - Safety score drops (safety training being overwritten)
    - Abnormal gradient norm patterns (extreme examples)
    """
    anomalies = []
 
    for i in range(1, len(snapshots)):
        prev = snapshots[i - 1]
        curr = snapshots[i]
 
        # Check for loss spikes
        loss_delta = curr.loss - prev.loss
        if loss_delta > loss_spike_threshold:
            anomalies.append({
                "step": curr.step,
                "type": "loss_spike",
                "severity": min(1.0, loss_delta / (loss_spike_threshold * 2)),
                "details": f"Loss increased by {loss_delta:.3f}",
            })
 
        # Check for safety score drops
        safety_delta = prev.safety_score - curr.safety_score
        if safety_delta > safety_drop_threshold:
            anomalies.append({
                "step": curr.step,
                "type": "safety_degradation",
                "severity": min(1.0, safety_delta / 0.5),
                "details": (
                    f"Safety score dropped: "
                    f"{prev.safety_score:.3f} -> {curr.safety_score:.3f}"
                ),
            })
 
        # Check for domain distribution shifts
        if prev.domain_distribution and curr.domain_distribution:
            all_domains = set(prev.domain_distribution) | set(curr.domain_distribution)
            shift = sum(
                abs(
                    curr.domain_distribution.get(d, 0)
                    - prev.domain_distribution.get(d, 0)
                )
                for d in all_domains
            ) / 2  # Normalize to [0, 1]
 
            if shift > 0.5:
                anomalies.append({
                    "step": curr.step,
                    "type": "domain_shift",
                    "severity": shift,
                    "details": f"Domain distribution shifted by {shift:.2f}",
                })
 
    return anomalies
 
# Demonstration: compare normal vs manipulated training dynamics
np.random.seed(42)
 
# Normal training dynamics
normal_snapshots = []
for step in range(0, 1000, 50):
    normal_snapshots.append(TrainingDynamicsSnapshot(
        step=step,
        loss=3.0 * np.exp(-step / 300) + np.random.normal(0, 0.05),
        gradient_norm=1.0 + np.random.normal(0, 0.1),
        safety_score=0.5 + 0.4 * (1 - np.exp(-step / 200)),
        domain_distribution={"general": 0.7, "safety": 0.3},
    ))
 
# Manipulated training dynamics (safety erased at step 500)
manipulated_snapshots = []
for step in range(0, 1000, 50):
    safety = 0.5 + 0.4 * (1 - np.exp(-step / 200))
    if step > 500:
        safety = max(0.1, safety - 0.03 * (step - 500) / 50)  # Degradation
    manipulated_snapshots.append(TrainingDynamicsSnapshot(
        step=step,
        loss=3.0 * np.exp(-step / 300) + (0.5 if step == 500 else 0) + np.random.normal(0, 0.05),
        gradient_norm=1.0 + (2.0 if step == 500 else 0) + np.random.normal(0, 0.1),
        safety_score=safety,
        domain_distribution=(
            {"general": 0.7, "safety": 0.3} if step <= 500
            else {"general": 0.95, "safety": 0.05}
        ),
    ))
 
print("Normal training anomalies:")
normal_anomalies = detect_anomalous_transitions(normal_snapshots)
print(f"  Found {len(normal_anomalies)} anomalies")
 
print("\nManipulated training anomalies:")
manipulated_anomalies = detect_anomalous_transitions(manipulated_snapshots)
print(f"  Found {len(manipulated_anomalies)} anomalies")
for a in manipulated_anomalies:
    print(f"  Step {a['step']}: [{a['type']}] {a['details']} "
          f"(severity: {a['severity']:.2f})")

Defensive Curriculum Design

Designing a curriculum that is robust to manipulation requires several principles:

Randomized mixing: Instead of pure sequential stages, mix data from different domains throughout training. This reduces the impact of ordering attacks.
Safety data anchoring: Ensure safety training data appears consistently throughout the curriculum, not just in a single stage that can be erased by subsequent training.
Difficulty score verification: Use multiple independent difficulty scoring methods and flag disagreements.
Training dynamics monitoring: Continuously monitor loss, gradient norms, and behavioral probes throughout training. Alert on anomalous transitions.
Curriculum integrity hashing: Cryptographically hash the curriculum schedule and verify it has not been modified before each training stage begins.

References

Bengio, Y., et al. (2009). "Curriculum Learning." ICML 2009.
Qi, X., et al. (2024). "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To." ICLR 2024.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
Xu, B., et al. (2020). "Curriculum Learning for Natural Language Understanding." ACL 2020.
Soviany, P., et al. (2022). "Curriculum Learning: A Survey." IJCV 2022.

Edit this page on GitHub

Manipulating Curriculum Learning Schedules

advanced13 min readUpdated 2026-03-20

How adversaries exploit curriculum learning by manipulating data ordering, difficulty scheduling, and stage transitions to embed vulnerabilities during training.

training-pipeline curriculum-learning data-ordering training-manipulation

Primacy bias: Models tend to retain patterns learned early in training more strongly than those learned later. An adversary can exploit this by ensuring malicious patterns are introduced early.
Catastrophic forgetting: Later training data can overwrite earlier learning. An adversary can use this to erase safety training by following it with data that teaches competing behaviors.
Critical learning periods: Some training stages are more influential than others. Data presented during transitions between training phases can have outsized impact.

"""
Curriculum learning dynamics simulation.
Models how data ordering affects what a model retains and
how adversaries can exploit ordering effects.
"""
import numpy as np
from dataclasses import dataclass
 
@dataclass
class CurriculumStage:
    """A stage in a curriculum learning schedule."""
    name: str
    data_domain: str
    num_examples: int
    difficulty: str  # "easy", "medium", "hard"
    learning_rate: float
    importance_weight: float = 1.0
 
@dataclass
class BehaviorTrace:
    """Tracks how a specific behavior evolves during training."""
    behavior_name: str
    strength_over_time: list[float]
 
def simulate_curriculum_learning(
    stages: list[CurriculumStage],
    behaviors: list[str],
    behavior_stage_affinities: dict[str, dict[str, float]],
    forgetting_rate: float = 0.05,
) -> dict[str, BehaviorTrace]:
    """
    Simulate how different behaviors are learned and forgotten
    as training progresses through curriculum stages.
 
    The simulation models:
    - Learning: behaviors strengthen when related data is presented
    - Forgetting: behaviors weaken when unrelated data dominates
    - Stage transitions: abrupt changes can cause instability
 
    Args:
        stages: Ordered list of curriculum stages.
        behaviors: Behaviors to track.
        behavior_stage_affinities: How strongly each stage
            reinforces each behavior. Values in [0, 1].
        forgetting_rate: Rate of passive forgetting per stage.
    """
    traces: dict[str, BehaviorTrace] = {
        b: BehaviorTrace(behavior_name=b, strength_over_time=[0.0])
        for b in behaviors
    }
 
    for stage in stages:
        for behavior in behaviors:
            current = traces[behavior].strength_over_time[-1]
            affinity = behavior_stage_affinities.get(
                behavior, {}
            ).get(stage.data_domain, 0.0)
 
            # Learning: proportional to affinity and stage importance
            learning = affinity * stage.importance_weight * stage.learning_rate
 
            # Forgetting: inverse of affinity (unrelated training causes forgetting)
            forgetting = (1 - affinity) * forgetting_rate
 
            new_strength = max(0.0, min(1.0, current + learning - forgetting))
            traces[behavior].strength_over_time.append(new_strength)
 
    return traces
 
# Compare benign vs adversarial curriculum ordering
benign_curriculum = [
    CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
    CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
    CurriculumStage("safety_training", "safety", 3000, "medium", 0.08),
    CurriculumStage("capability_polish", "general", 2000, "hard", 0.03),
]
 
adversarial_curriculum = [
    CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
    CurriculumStage("safety_training", "safety", 3000, "medium", 0.08),
    # Adversarial: heavy general training AFTER safety to cause forgetting
    CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
    CurriculumStage("override_training", "general", 8000, "hard", 0.1),
]
 
affinities = {
    "safety_refusal": {"safety": 0.9, "general": 0.1, "technical": 0.05},
    "helpfulness": {"general": 0.7, "technical": 0.5, "safety": 0.3},
    "technical_accuracy": {"technical": 0.8, "general": 0.3, "safety": 0.1},
}
 
print("Benign curriculum:")
benign_traces = simulate_curriculum_learning(
    benign_curriculum, list(affinities.keys()), affinities
)
for name, trace in benign_traces.items():
    print(f"  {name}: {trace.strength_over_time[-1]:.3f}")
 
print("\nAdversarial curriculum:")
adversarial_traces = simulate_curriculum_learning(
    adversarial_curriculum, list(affinities.keys()), affinities
)
for name, trace in adversarial_traces.items():
    print(f"  {name}: {trace.strength_over_time[-1]:.3f}")
    if name == "safety_refusal":
        delta = (
            adversarial_traces[name].strength_over_time[-1]
            - benign_traces[name].strength_over_time[-1]
        )
        print(f"    Safety degradation: {delta:.3f}")

Critical Transition Points

"""
Curriculum transition vulnerability analysis.
Identifies and characterizes vulnerable transition points
in a curriculum learning schedule.
"""
import numpy as np
from dataclasses import dataclass
 
@dataclass
class TransitionAnalysis:
    """Analysis of a curriculum transition point."""
    from_stage: str
    to_stage: str
    distributional_distance: float
    learning_rate_change: float
    vulnerability_score: float
    recommended_mitigation: str
 
def analyze_curriculum_transitions(
    stages: list[CurriculumStage],
) -> list[TransitionAnalysis]:
    """
    Analyze transition points between curriculum stages for
    security vulnerabilities.
 
    Vulnerable transitions are characterized by:
    - Large distributional shift (different domains)
    - Learning rate changes (instability)
    - Missing gradual mixing between stages
    """
    transitions = []
 
    for i in range(len(stages) - 1):
        current = stages[i]
        next_stage = stages[i + 1]
 
        # Domain distance (simplified; production would use embedding distance)
        domain_distance = 0.0 if current.data_domain == next_stage.data_domain else 1.0
 
        # Learning rate change
        lr_change = abs(next_stage.learning_rate - current.learning_rate)
 
        # Vulnerability score combines factors
        vulnerability = (
            0.5 * domain_distance
            + 0.3 * (lr_change / max(current.learning_rate, 0.001))
            + 0.2 * (1.0 if next_stage.difficulty != current.difficulty else 0.0)
        )
 
        # Determine mitigation
        if vulnerability > 0.7:
            mitigation = "Add gradual mixing stage between these stages"
        elif vulnerability > 0.4:
            mitigation = "Reduce learning rate during transition"
        else:
            mitigation = "Standard monitoring sufficient"
 
        transitions.append(TransitionAnalysis(
            from_stage=current.name,
            to_stage=next_stage.name,
            distributional_distance=domain_distance,
            learning_rate_change=lr_change,
            vulnerability_score=vulnerability,
            recommended_mitigation=mitigation,
        ))
 
    return transitions
 
# Analyze the adversarial curriculum
transitions = analyze_curriculum_transitions(adversarial_curriculum)
for t in transitions:
    print(f"{t.from_stage} -> {t.to_stage}")
    print(f"  Vulnerability: {t.vulnerability_score:.2f}")
    print(f"  Mitigation: {t.recommended_mitigation}")

Curriculum Manipulation Attack Techniques

Data Ordering Attacks

The most fundamental curriculum manipulation attack is reordering training data to maximize the persistence of adversarial patterns and minimize the persistence of safety training.

"""
Data ordering attack implementation.
Demonstrates how strategic reordering of training data can
maximize adversarial influence on model behavior.
"""
import numpy as np
from typing import Optional
 
@dataclass
class TrainingExample:
    """A training example with metadata for curriculum ordering."""
    text: str
    domain: str
    difficulty_score: float
    safety_relevance: float  # How much this example contributes to safety
    is_poisoned: bool = False
 
def optimal_adversarial_ordering(
    examples: list[TrainingExample],
    strategy: str = "safety_last",
) -> list[int]:
    """
    Compute an adversarial ordering for the training examples.
 
    Strategies:
    - 'safety_last': Push safety-relevant examples to the end where
      they are most likely to be overwritten by subsequent training
    - 'poison_early': Place poisoned examples early to exploit
      primacy bias
    - 'interleave': Interleave poisoned examples throughout to
      make detection harder while maintaining influence
    - 'sandwich': Place safety examples between layers of adversarial
      content so they are both preceded and followed by overwrite pressure
    """
    n = len(examples)
    indices = list(range(n))
 
    if strategy == "safety_last":
        # Sort: non-safety first, safety last
        indices.sort(key=lambda i: examples[i].safety_relevance)
 
    elif strategy == "poison_early":
        # Poisoned examples first, then by original order
        indices.sort(key=lambda i: (0 if examples[i].is_poisoned else 1, i))
 
    elif strategy == "interleave":
        # Distribute poisoned examples evenly throughout
        poisoned = [i for i in indices if examples[i].is_poisoned]
        clean = [i for i in indices if not examples[i].is_poisoned]
 
        if not poisoned:
            return indices
 
        result = []
        interval = max(1, len(clean) // (len(poisoned) + 1))
        poison_idx = 0
        for j, clean_idx in enumerate(clean):
            result.append(clean_idx)
            if (j + 1) % interval == 0 and poison_idx < len(poisoned):
                result.append(poisoned[poison_idx])
                poison_idx += 1
        # Append remaining poisoned examples
        while poison_idx < len(poisoned):
            result.append(poisoned[poison_idx])
            poison_idx += 1
        return result
 
    elif strategy == "sandwich":
        safety = [i for i in indices if examples[i].safety_relevance > 0.5]
        non_safety = [i for i in indices if examples[i].safety_relevance <= 0.5]
 
        # Place safety in the middle, surrounded by non-safety
        mid = len(non_safety) // 2
        indices = non_safety[:mid] + safety + non_safety[mid:]
 
    return indices
 
def evaluate_ordering_effectiveness(
    examples: list[TrainingExample],
    ordering: list[int],
    window_size: int = 100,
) -> dict:
    """
    Evaluate the effectiveness of a data ordering for adversarial purposes.
 
    Metrics:
    - Safety content position: where safety examples appear in the ordering
    - Poison exposure timing: when poisoned examples are encountered
    - Transition abruptness: how sudden domain shifts are
    """
    n = len(ordering)
 
    # Where safety content falls in the ordering (0 = beginning, 1 = end)
    safety_positions = []
    poison_positions = []
    for pos, idx in enumerate(ordering):
        relative_pos = pos / n
        if examples[idx].safety_relevance > 0.5:
            safety_positions.append(relative_pos)
        if examples[idx].is_poisoned:
            poison_positions.append(relative_pos)
 
    return {
        "avg_safety_position": float(np.mean(safety_positions)) if safety_positions else 0,
        "avg_poison_position": float(np.mean(poison_positions)) if poison_positions else 0,
        "safety_in_last_quartile": sum(1 for p in safety_positions if p > 0.75),
        "poison_in_first_quartile": sum(1 for p in poison_positions if p < 0.25),
    }
 
# Demonstration
np.random.seed(42)
examples = []
for i in range(100):
    is_safety = i < 20
    is_poison = 80 <= i < 85
    examples.append(TrainingExample(
        text=f"example_{i}",
        domain="safety" if is_safety else ("attack" if is_poison else "general"),
        difficulty_score=np.random.random(),
        safety_relevance=0.9 if is_safety else 0.1,
        is_poisoned=is_poison,
    ))
 
for strategy in ["safety_last", "poison_early", "interleave", "sandwich"]:
    ordering = optimal_adversarial_ordering(examples, strategy)
    metrics = evaluate_ordering_effectiveness(examples, ordering)
    print(f"\nStrategy: {strategy}")
    print(f"  Avg safety position: {metrics['avg_safety_position']:.2f} (1.0=end)")
    print(f"  Avg poison position: {metrics['avg_poison_position']:.2f} (0.0=start)")
    print(f"  Safety in last 25%: {metrics['safety_in_last_quartile']}")
    print(f"  Poison in first 25%: {metrics['poison_in_first_quartile']}")

Difficulty Score Manipulation

"""
Difficulty score manipulation attack.
Shows how corrupting the difficulty scoring function
changes the curriculum in adversarially useful ways.
"""
import numpy as np
 
def legitimate_difficulty_scorer(text: str) -> float:
    """
    A simple difficulty scoring heuristic.
    Higher scores = harder examples.
    """
    words = text.split()
    word_count = len(words)
    avg_word_length = np.mean([len(w) for w in words]) if words else 0
    unique_ratio = len(set(words)) / max(word_count, 1)
 
    # Combine features into difficulty score
    score = (
        0.3 * min(word_count / 200, 1.0)  # Longer = harder
        + 0.3 * min(avg_word_length / 10, 1.0)  # Complex words = harder
        + 0.4 * unique_ratio  # More diverse vocab = harder
    )
    return float(np.clip(score, 0.0, 1.0))
 
def manipulated_difficulty_scorer(
    text: str,
    target_keywords: set[str],
    score_override: float = 0.95,
    base_scorer: callable = legitimate_difficulty_scorer,
) -> float:
    """
    A compromised difficulty scorer that assigns extreme scores
    to examples containing specific keywords, controlling their
    position in the curriculum.
 
    Setting score_override high pushes examples to the end of
    an easy-to-hard curriculum (useful for suppressing safety content).
    Setting it low pushes them to the beginning (useful for ensuring
    adversarial content is learned first).
    """
    text_lower = text.lower()
    if any(kw in text_lower for kw in target_keywords):
        return score_override
 
    return base_scorer(text)
 
# Demonstrate the attack
examples = [
    "Simple math: 2 + 2 = 4",
    "The model should always refuse harmful requests and prioritize safety.",
    "Advanced quantum field theory involves renormalization group flows.",
    "When users request dangerous information, decline politely.",
    "Basic greeting: Hello, how can I help you today?",
]
 
print("Legitimate scoring:")
for ex in examples:
    score = legitimate_difficulty_scorer(ex)
    print(f"  {score:.3f}: {ex[:60]}")
 
print("\nManipulated scoring (safety content pushed to end):")
safety_keywords = {"refuse", "safety", "harmful", "dangerous", "decline"}
for ex in examples:
    score = manipulated_difficulty_scorer(ex, safety_keywords, score_override=0.99)
    print(f"  {score:.3f}: {ex[:60]}")

Detection and Defense

Detecting Curriculum Manipulation

"""
Curriculum manipulation detection through training dynamics monitoring.
"""
import numpy as np
from dataclasses import dataclass
 
@dataclass
class TrainingDynamicsSnapshot:
    """A snapshot of training dynamics at a specific step."""
    step: int
    loss: float
    gradient_norm: float
    safety_score: float  # From a safety probe evaluated periodically
    domain_distribution: dict[str, float]
 
def detect_anomalous_transitions(
    snapshots: list[TrainingDynamicsSnapshot],
    loss_spike_threshold: float = 2.0,
    safety_drop_threshold: float = 0.15,
) -> list[dict]:
    """
    Detect anomalous transitions in training dynamics that may
    indicate curriculum manipulation.
 
    Signs of manipulation:
    - Unexpected loss spikes (abrupt domain shift)
    - Safety score drops (safety training being overwritten)
    - Abnormal gradient norm patterns (extreme examples)
    """
    anomalies = []
 
    for i in range(1, len(snapshots)):
        prev = snapshots[i - 1]
        curr = snapshots[i]
 
        # Check for loss spikes
        loss_delta = curr.loss - prev.loss
        if loss_delta > loss_spike_threshold:
            anomalies.append({
                "step": curr.step,
                "type": "loss_spike",
                "severity": min(1.0, loss_delta / (loss_spike_threshold * 2)),
                "details": f"Loss increased by {loss_delta:.3f}",
            })
 
        # Check for safety score drops
        safety_delta = prev.safety_score - curr.safety_score
        if safety_delta > safety_drop_threshold:
            anomalies.append({
                "step": curr.step,
                "type": "safety_degradation",
                "severity": min(1.0, safety_delta / 0.5),
                "details": (
                    f"Safety score dropped: "
                    f"{prev.safety_score:.3f} -> {curr.safety_score:.3f}"
                ),
            })
 
        # Check for domain distribution shifts
        if prev.domain_distribution and curr.domain_distribution:
            all_domains = set(prev.domain_distribution) | set(curr.domain_distribution)
            shift = sum(
                abs(
                    curr.domain_distribution.get(d, 0)
                    - prev.domain_distribution.get(d, 0)
                )
                for d in all_domains
            ) / 2  # Normalize to [0, 1]
 
            if shift > 0.5:
                anomalies.append({
                    "step": curr.step,
                    "type": "domain_shift",
                    "severity": shift,
                    "details": f"Domain distribution shifted by {shift:.2f}",
                })
 
    return anomalies
 
# Demonstration: compare normal vs manipulated training dynamics
np.random.seed(42)
 
# Normal training dynamics
normal_snapshots = []
for step in range(0, 1000, 50):
    normal_snapshots.append(TrainingDynamicsSnapshot(
        step=step,
        loss=3.0 * np.exp(-step / 300) + np.random.normal(0, 0.05),
        gradient_norm=1.0 + np.random.normal(0, 0.1),
        safety_score=0.5 + 0.4 * (1 - np.exp(-step / 200)),
        domain_distribution={"general": 0.7, "safety": 0.3},
    ))
 
# Manipulated training dynamics (safety erased at step 500)
manipulated_snapshots = []
for step in range(0, 1000, 50):
    safety = 0.5 + 0.4 * (1 - np.exp(-step / 200))
    if step > 500:
        safety = max(0.1, safety - 0.03 * (step - 500) / 50)  # Degradation
    manipulated_snapshots.append(TrainingDynamicsSnapshot(
        step=step,
        loss=3.0 * np.exp(-step / 300) + (0.5 if step == 500 else 0) + np.random.normal(0, 0.05),
        gradient_norm=1.0 + (2.0 if step == 500 else 0) + np.random.normal(0, 0.1),
        safety_score=safety,
        domain_distribution=(
            {"general": 0.7, "safety": 0.3} if step <= 500
            else {"general": 0.95, "safety": 0.05}
        ),
    ))
 
print("Normal training anomalies:")
normal_anomalies = detect_anomalous_transitions(normal_snapshots)
print(f"  Found {len(normal_anomalies)} anomalies")
 
print("\nManipulated training anomalies:")
manipulated_anomalies = detect_anomalous_transitions(manipulated_snapshots)
print(f"  Found {len(manipulated_anomalies)} anomalies")
for a in manipulated_anomalies:
    print(f"  Step {a['step']}: [{a['type']}] {a['details']} "
          f"(severity: {a['severity']:.2f})")

Defensive Curriculum Design

Designing a curriculum that is robust to manipulation requires several principles:

Randomized mixing: Instead of pure sequential stages, mix data from different domains throughout training. This reduces the impact of ordering attacks.
Safety data anchoring: Ensure safety training data appears consistently throughout the curriculum, not just in a single stage that can be erased by subsequent training.
Difficulty score verification: Use multiple independent difficulty scoring methods and flag disagreements.
Training dynamics monitoring: Continuously monitor loss, gradient norms, and behavioral probes throughout training. Alert on anomalous transitions.
Curriculum integrity hashing: Cryptographically hash the curriculum schedule and verify it has not been modified before each training stage begins.

References

Bengio, Y., et al. (2009). "Curriculum Learning." ICML 2009.
Qi, X., et al. (2024). "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To." ICLR 2024.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
Xu, B., et al. (2020). "Curriculum Learning for Natural Language Understanding." ACL 2020.
Soviany, P., et al. (2022). "Curriculum Learning: A Survey." IJCV 2022.

Edit this page on GitHub

Manipulating Curriculum Learning Schedules

Related articles

Manipulating Curriculum Learning Schedules

Related articles