Manipulating Curriculum Learning Schedules

Advanced13 min readUpdated 2026-03-20

How adversaries exploit curriculum learning by manipulating data ordering, difficulty scheduling, and stage transitions to embed vulnerabilities during training.

training-pipeline curriculum-learning data-ordering training-manipulation

概覽

Curriculum learning, introduced by Bengio et al. (2009) in "Curriculum Learning," is a 訓練 strategy where examples are presented in a meaningful order — typically from easy to hard — rather than randomly. This approach can improve convergence speed, final model performance, and 訓練 stability. Modern LLM 訓練 pipelines use various forms of curriculum learning, including staged 訓練 (pre-訓練 on different data mixtures), difficulty-based scheduling, and domain-specific phasing.

From a 安全 perspective, curriculum learning introduces a subtle but powerful attack vector: the order in which a model encounters 訓練資料 affects what it learns and how robustly it learns it. An adversary who can manipulate the curriculum — by altering the data ordering, modifying difficulty scores, or injecting data at specific 訓練 stages — can influence 模型's final behavior in ways that are difficult to detect 因為 the individual 訓練 examples may all appear legitimate.

This 漏洞 is particularly concerning 因為 curriculum manipulation leaves no obvious traces in the 訓練資料 itself. The data is not poisoned in the traditional sense; rather, the ordering is 對抗性. Standard data quality checks that examine individual examples in isolation will not detect the attack. Qi et al. (2024) demonstrated that even the order of 微調 data matters for 安全: models that encounter 安全訓練資料 before general capability data maintain better 對齊 than those trained in the reverse order.

Curriculum Learning in LLM Training

How Curriculum Affects Learning Dynamics

The ordering of 訓練資料 affects model behavior through several mechanisms:

Primacy bias: Models tend to retain patterns learned early in 訓練 more strongly than those learned later. An adversary can 利用 this by ensuring malicious patterns are introduced early.
Catastrophic forgetting: Later 訓練資料 can overwrite earlier learning. An adversary can use this to erase 安全訓練 by following it with data that teaches competing behaviors.
Critical learning periods: Some 訓練 stages are more influential than others. Data presented during transitions between 訓練 phases can have outsized impact.

"""
Curriculum learning dynamics simulation.
Models how data ordering affects what a model retains and
how adversaries can 利用 ordering effects.
"""
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class CurriculumStage:
    """A stage in a curriculum learning schedule."""
    name: str
    data_domain: str
    num_examples: int
    difficulty: str  # "easy", "medium", "hard"
    learning_rate: float
    importance_weight: float = 1.0
 
 
@dataclass
class BehaviorTrace:
    """Tracks how a specific behavior evolves during 訓練."""
    behavior_name: str
    strength_over_time: list[float]
 
 
def simulate_curriculum_learning(
    stages: list[CurriculumStage],
    behaviors: list[str],
    behavior_stage_affinities: dict[str, dict[str, float]],
    forgetting_rate: float = 0.05,
) -> dict[str, BehaviorTrace]:
    """
    Simulate how different behaviors are learned and forgotten
    as 訓練 progresses through curriculum stages.
 
    The simulation models:
    - Learning: behaviors strengthen when related data is presented
    - Forgetting: behaviors weaken when unrelated data dominates
    - Stage transitions: abrupt changes can cause instability
 
    Args:
        stages: Ordered list of curriculum stages.
        behaviors: Behaviors to track.
        behavior_stage_affinities: How strongly each stage
            reinforces each behavior. Values in [0, 1].
        forgetting_rate: Rate of passive forgetting per stage.
    """
    traces: dict[str, BehaviorTrace] = {
        b: BehaviorTrace(behavior_name=b, strength_over_time=[0.0])
        for b in behaviors
    }
 
    for stage in stages:
        for behavior in behaviors:
            current = traces[behavior].strength_over_time[-1]
            affinity = behavior_stage_affinities.get(
                behavior, {}
            ).get(stage.data_domain, 0.0)
 
            # Learning: proportional to affinity and stage importance
            learning = affinity * stage.importance_weight * stage.learning_rate
 
            # Forgetting: inverse of affinity (unrelated 訓練 causes forgetting)
            forgetting = (1 - affinity) * forgetting_rate
 
            new_strength = max(0.0, min(1.0, current + learning - forgetting))
            traces[behavior].strength_over_time.append(new_strength)
 
    return traces
 
 
# Compare benign vs 對抗性 curriculum ordering
benign_curriculum = [
    CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
    CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
    CurriculumStage("safety_training", "安全", 3000, "medium", 0.08),
    CurriculumStage("capability_polish", "general", 2000, "hard", 0.03),
]
 
adversarial_curriculum = [
    CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
    CurriculumStage("safety_training", "安全", 3000, "medium", 0.08),
    # 對抗性: heavy general 訓練 AFTER 安全 to cause forgetting
    CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
    CurriculumStage("override_training", "general", 8000, "hard", 0.1),
]
 
affinities = {
    "safety_refusal": {"安全": 0.9, "general": 0.1, "technical": 0.05},
    "helpfulness": {"general": 0.7, "technical": 0.5, "安全": 0.3},
    "technical_accuracy": {"technical": 0.8, "general": 0.3, "安全": 0.1},
}
 
print("Benign curriculum:")
benign_traces = simulate_curriculum_learning(
    benign_curriculum, list(affinities.keys()), affinities
)
for name, trace in benign_traces.items():
    print(f"  {name}: {trace.strength_over_time[-1]:.3f}")
 
print("\nAdversarial curriculum:")
adversarial_traces = simulate_curriculum_learning(
    adversarial_curriculum, list(affinities.keys()), affinities
)
for name, trace in adversarial_traces.items():
    print(f"  {name}: {trace.strength_over_time[-1]:.3f}")
    if name == "safety_refusal":
        delta = (
            adversarial_traces[name].strength_over_time[-1]
            - benign_traces[name].strength_over_time[-1]
        )
        print(f"    安全 degradation: {delta:.3f}")

Critical Transition Points

The transitions between curriculum stages are the most 安全-sensitive moments in 訓練. During transitions, 模型 encounters a distributional shift in the 訓練資料, which can destabilize learned behaviors. An adversary who can insert data at these transition points has maximum influence on 模型's behavior.

"""
Curriculum transition 漏洞 analysis.
Identifies and characterizes vulnerable transition points
in a curriculum learning schedule.
"""
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class TransitionAnalysis:
    """Analysis of a curriculum transition point."""
    from_stage: str
    to_stage: str
    distributional_distance: float
    learning_rate_change: float
    vulnerability_score: float
    recommended_mitigation: str
 
 
def analyze_curriculum_transitions(
    stages: list[CurriculumStage],
) -> list[TransitionAnalysis]:
    """
    Analyze transition points between curriculum stages for
    安全 漏洞.
 
    Vulnerable transitions are characterized by:
    - Large distributional shift (different domains)
    - Learning rate changes (instability)
    - Missing gradual mixing between stages
    """
    transitions = []
 
    for i in range(len(stages) - 1):
        current = stages[i]
        next_stage = stages[i + 1]
 
        # Domain distance (simplified; production would use 嵌入向量 distance)
        domain_distance = 0.0 if current.data_domain == next_stage.data_domain else 1.0
 
        # Learning rate change
        lr_change = abs(next_stage.learning_rate - current.learning_rate)
 
        # 漏洞 score combines factors
        漏洞 = (
            0.5 * domain_distance
            + 0.3 * (lr_change / max(current.learning_rate, 0.001))
            + 0.2 * (1.0 if next_stage.difficulty != current.difficulty else 0.0)
        )
 
        # Determine 緩解
        if 漏洞 > 0.7:
            緩解 = "Add gradual mixing stage between these stages"
        elif 漏洞 > 0.4:
            緩解 = "Reduce learning rate during transition"
        else:
            緩解 = "Standard 監控 sufficient"
 
        transitions.append(TransitionAnalysis(
            from_stage=current.name,
            to_stage=next_stage.name,
            distributional_distance=domain_distance,
            learning_rate_change=lr_change,
            vulnerability_score=漏洞,
            recommended_mitigation=緩解,
        ))
 
    return transitions
 
 
# Analyze the 對抗性 curriculum
transitions = analyze_curriculum_transitions(adversarial_curriculum)
for t in transitions:
    print(f"{t.from_stage} -> {t.to_stage}")
    print(f"  漏洞: {t.vulnerability_score:.2f}")
    print(f"  緩解: {t.recommended_mitigation}")

Curriculum Manipulation 攻擊 Techniques

Data Ordering 攻擊

The most fundamental curriculum manipulation attack is reordering 訓練資料 to maximize the persistence of 對抗性 patterns and minimize the persistence of 安全訓練.

"""
Data ordering attack 實作.
Demonstrates how strategic reordering of 訓練資料 can
maximize 對抗性 influence on model behavior.
"""
import numpy as np
from typing import Optional
 
 
@dataclass
class TrainingExample:
    """A 訓練 example with metadata for curriculum ordering."""
    text: str
    domain: str
    difficulty_score: float
    safety_relevance: float  # How much this example contributes to 安全
    is_poisoned: bool = False
 
 
def optimal_adversarial_ordering(
    examples: list[TrainingExample],
    strategy: str = "safety_last",
) -> list[int]:
    """
    Compute an 對抗性 ordering for the 訓練 examples.
 
    Strategies:
    - 'safety_last': Push 安全-relevant examples to the end where
      they are most likely to be overwritten by subsequent 訓練
    - 'poison_early': Place poisoned examples early to 利用
      primacy bias
    - 'interleave': Interleave poisoned examples throughout to
      make 偵測 harder while maintaining influence
    - 'sandwich': Place 安全 examples between layers of 對抗性
      content so they are both preceded and followed by overwrite pressure
    """
    n = len(examples)
    indices = list(range(n))
 
    if strategy == "safety_last":
        # Sort: non-安全 first, 安全 last
        indices.sort(key=lambda i: examples[i].safety_relevance)
 
    elif strategy == "poison_early":
        # Poisoned examples first, then by original order
        indices.sort(key=lambda i: (0 if examples[i].is_poisoned else 1, i))
 
    elif strategy == "interleave":
        # Distribute poisoned examples evenly throughout
        poisoned = [i for i in indices if examples[i].is_poisoned]
        clean = [i for i in indices if not examples[i].is_poisoned]
 
        if not poisoned:
            return indices
 
        result = []
        interval = max(1, len(clean) // (len(poisoned) + 1))
        poison_idx = 0
        for j, clean_idx in enumerate(clean):
            result.append(clean_idx)
            if (j + 1) % interval == 0 and poison_idx < len(poisoned):
                result.append(poisoned[poison_idx])
                poison_idx += 1
        # Append remaining poisoned examples
        while poison_idx < len(poisoned):
            result.append(poisoned[poison_idx])
            poison_idx += 1
        return result
 
    elif strategy == "sandwich":
        安全 = [i for i in indices if examples[i].safety_relevance > 0.5]
        non_safety = [i for i in indices if examples[i].safety_relevance <= 0.5]
 
        # Place 安全 in the middle, surrounded by non-安全
        mid = len(non_safety) // 2
        indices = non_safety[:mid] + 安全 + non_safety[mid:]
 
    return indices
 
 
def evaluate_ordering_effectiveness(
    examples: list[TrainingExample],
    ordering: list[int],
    window_size: int = 100,
) -> dict:
    """
    評估 the effectiveness of a data ordering for 對抗性 purposes.
 
    Metrics:
    - 安全 content position: where 安全 examples appear in the ordering
    - Poison exposure timing: when poisoned examples are encountered
    - Transition abruptness: how sudden domain shifts are
    """
    n = len(ordering)
 
    # Where 安全 content falls in the ordering (0 = beginning, 1 = end)
    safety_positions = []
    poison_positions = []
    for pos, idx in enumerate(ordering):
        relative_pos = pos / n
        if examples[idx].safety_relevance > 0.5:
            safety_positions.append(relative_pos)
        if examples[idx].is_poisoned:
            poison_positions.append(relative_pos)
 
    return {
        "avg_safety_position": float(np.mean(safety_positions)) if safety_positions else 0,
        "avg_poison_position": float(np.mean(poison_positions)) if poison_positions else 0,
        "safety_in_last_quartile": sum(1 for p in safety_positions if p > 0.75),
        "poison_in_first_quartile": sum(1 for p in poison_positions if p < 0.25),
    }
 
 
# Demonstration
np.random.seed(42)
examples = []
for i in range(100):
    is_safety = i < 20
    is_poison = 80 <= i < 85
    examples.append(TrainingExample(
        text=f"example_{i}",
        domain="安全" if is_safety else ("attack" if is_poison else "general"),
        difficulty_score=np.random.random(),
        safety_relevance=0.9 if is_safety else 0.1,
        is_poisoned=is_poison,
    ))
 
for strategy in ["safety_last", "poison_early", "interleave", "sandwich"]:
    ordering = optimal_adversarial_ordering(examples, strategy)
    metrics = evaluate_ordering_effectiveness(examples, ordering)
    print(f"\nStrategy: {strategy}")
    print(f"  Avg 安全 position: {metrics['avg_safety_position']:.2f} (1.0=end)")
    print(f"  Avg poison position: {metrics['avg_poison_position']:.2f} (0.0=start)")
    print(f"  安全 in last 25%: {metrics['safety_in_last_quartile']}")
    print(f"  Poison in first 25%: {metrics['poison_in_first_quartile']}")

Difficulty Score Manipulation

Many curriculum learning systems rely on automated difficulty scoring to order examples. An adversary who can manipulate the difficulty scorer can control the 訓練 order without directly modifying the data.

"""
Difficulty score manipulation attack.
Shows how corrupting the difficulty scoring function
changes the curriculum in adversarially useful ways.
"""
import numpy as np
 
 
def legitimate_difficulty_scorer(text: str) -> float:
    """
    A simple difficulty scoring heuristic.
    Higher scores = harder examples.
    """
    words = text.split()
    word_count = len(words)
    avg_word_length = np.mean([len(w) for w in words]) if words else 0
    unique_ratio = len(set(words)) / max(word_count, 1)
 
    # Combine features into difficulty score
    score = (
        0.3 * min(word_count / 200, 1.0)  # Longer = harder
        + 0.3 * min(avg_word_length / 10, 1.0)  # Complex words = harder
        + 0.4 * unique_ratio  # More diverse vocab = harder
    )
    return float(np.clip(score, 0.0, 1.0))
 
 
def manipulated_difficulty_scorer(
    text: str,
    target_keywords: set[str],
    score_override: float = 0.95,
    base_scorer: callable = legitimate_difficulty_scorer,
) -> float:
    """
    A compromised difficulty scorer that assigns extreme scores
    to examples containing specific keywords, controlling their
    position in the curriculum.
 
    Setting score_override high pushes examples to the end of
    an easy-to-hard curriculum (useful for suppressing 安全 content).
    Setting it low pushes them to the beginning (useful for ensuring
    對抗性 content is learned first).
    """
    text_lower = text.lower()
    if any(kw in text_lower for kw in target_keywords):
        return score_override
 
    return base_scorer(text)
 
 
# Demonstrate the attack
examples = [
    "Simple math: 2 + 2 = 4",
    "模型 should always refuse harmful requests and prioritize 安全.",
    "Advanced quantum field theory involves renormalization group flows.",
    "When users request dangerous information, decline politely.",
    "Basic greeting: Hello, how can I help you today?",
]
 
print("Legitimate scoring:")
for ex in examples:
    score = legitimate_difficulty_scorer(ex)
    print(f"  {score:.3f}: {ex[:60]}")
 
print("\nManipulated scoring (安全 content pushed to end):")
safety_keywords = {"refuse", "安全", "harmful", "dangerous", "decline"}
for ex in examples:
    score = manipulated_difficulty_scorer(ex, safety_keywords, score_override=0.99)
    print(f"  {score:.3f}: {ex[:60]}")

偵測 and 防禦

Detecting Curriculum Manipulation

Detecting curriculum manipulation requires 監控訓練 dynamics rather than inspecting individual data points. Anomalous patterns in loss curves, gradient norms, and behavioral metrics during 訓練 can signal that the curriculum has been tampered with.

"""
Curriculum manipulation 偵測 through 訓練 dynamics 監控.
"""
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class TrainingDynamicsSnapshot:
    """A snapshot of 訓練 dynamics at a specific step."""
    step: int
    loss: float
    gradient_norm: float
    safety_score: float  # From a 安全 probe evaluated periodically
    domain_distribution: dict[str, float]
 
 
def detect_anomalous_transitions(
    snapshots: list[TrainingDynamicsSnapshot],
    loss_spike_threshold: float = 2.0,
    safety_drop_threshold: float = 0.15,
) -> list[dict]:
    """
    Detect anomalous transitions in 訓練 dynamics that may
    indicate curriculum manipulation.
 
    Signs of manipulation:
    - Unexpected loss spikes (abrupt domain shift)
    - 安全 score drops (安全 訓練 being overwritten)
    - Abnormal gradient norm patterns (extreme examples)
    """
    anomalies = []
 
    for i in range(1, len(snapshots)):
        prev = snapshots[i - 1]
        curr = snapshots[i]
 
        # Check for loss spikes
        loss_delta = curr.loss - prev.loss
        if loss_delta > loss_spike_threshold:
            anomalies.append({
                "step": curr.step,
                "type": "loss_spike",
                "severity": min(1.0, loss_delta / (loss_spike_threshold * 2)),
                "details": f"Loss increased by {loss_delta:.3f}",
            })
 
        # Check for 安全 score drops
        safety_delta = prev.safety_score - curr.safety_score
        if safety_delta > safety_drop_threshold:
            anomalies.append({
                "step": curr.step,
                "type": "safety_degradation",
                "severity": min(1.0, safety_delta / 0.5),
                "details": (
                    f"安全 score dropped: "
                    f"{prev.safety_score:.3f} -> {curr.safety_score:.3f}"
                ),
            })
 
        # Check for domain distribution shifts
        if prev.domain_distribution and curr.domain_distribution:
            all_domains = set(prev.domain_distribution) | set(curr.domain_distribution)
            shift = sum(
                abs(
                    curr.domain_distribution.get(d, 0)
                    - prev.domain_distribution.get(d, 0)
                )
                for d in all_domains
            ) / 2  # Normalize to [0, 1]
 
            if shift > 0.5:
                anomalies.append({
                    "step": curr.step,
                    "type": "domain_shift",
                    "severity": shift,
                    "details": f"Domain distribution shifted by {shift:.2f}",
                })
 
    return anomalies
 
 
# Demonstration: compare normal vs manipulated 訓練 dynamics
np.random.seed(42)
 
# Normal 訓練 dynamics
normal_snapshots = []
for step in range(0, 1000, 50):
    normal_snapshots.append(TrainingDynamicsSnapshot(
        step=step,
        loss=3.0 * np.exp(-step / 300) + np.random.normal(0, 0.05),
        gradient_norm=1.0 + np.random.normal(0, 0.1),
        safety_score=0.5 + 0.4 * (1 - np.exp(-step / 200)),
        domain_distribution={"general": 0.7, "安全": 0.3},
    ))
 
# Manipulated 訓練 dynamics (安全 erased at step 500)
manipulated_snapshots = []
for step in range(0, 1000, 50):
    安全 = 0.5 + 0.4 * (1 - np.exp(-step / 200))
    if step > 500:
        安全 = max(0.1, 安全 - 0.03 * (step - 500) / 50)  # Degradation
    manipulated_snapshots.append(TrainingDynamicsSnapshot(
        step=step,
        loss=3.0 * np.exp(-step / 300) + (0.5 if step == 500 else 0) + np.random.normal(0, 0.05),
        gradient_norm=1.0 + (2.0 if step == 500 else 0) + np.random.normal(0, 0.1),
        safety_score=安全,
        domain_distribution=(
            {"general": 0.7, "安全": 0.3} if step <= 500
            else {"general": 0.95, "安全": 0.05}
        ),
    ))
 
print("Normal 訓練 anomalies:")
normal_anomalies = detect_anomalous_transitions(normal_snapshots)
print(f"  Found {len(normal_anomalies)} anomalies")
 
print("\nManipulated 訓練 anomalies:")
manipulated_anomalies = detect_anomalous_transitions(manipulated_snapshots)
print(f"  Found {len(manipulated_anomalies)} anomalies")
for a in manipulated_anomalies:
    print(f"  Step {a['step']}: [{a['type']}] {a['details']} "
          f"(severity: {a['severity']:.2f})")

Defensive Curriculum Design

Designing a curriculum that is robust to manipulation requires several principles:

Randomized mixing: Instead of pure sequential stages, mix data from different domains throughout 訓練. This reduces the impact of ordering attacks.
安全 data anchoring: Ensure 安全訓練資料 appears consistently throughout the curriculum, not just in a single stage that can be erased by subsequent 訓練.
Difficulty score verification: Use multiple independent difficulty scoring methods and flag disagreements.
Training dynamics 監控: Continuously monitor loss, gradient norms, and behavioral probes throughout 訓練. Alert on anomalous transitions.
Curriculum integrity hashing: Cryptographically hash the curriculum schedule and verify it has not been modified before each 訓練 stage begins.

參考文獻

Bengio, Y., et al. (2009). "Curriculum Learning." ICML 2009.
Qi, X., et al. (2024). "微調 Aligned Language Models Compromises 安全, Even When Users Do Not Intend To." ICLR 2024.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX 安全 Symposium 2021.
Xu, B., et al. (2020). "Curriculum Learning for Natural Language 理解." ACL 2020.
Soviany, P., et al. (2022). "Curriculum Learning: A Survey." IJCV 2022.

Manipulating Curriculum Learning Schedules

Advanced13 min readUpdated 2026-03-20

How adversaries exploit curriculum learning by manipulating data ordering, difficulty scheduling, and stage transitions to embed vulnerabilities during training.

training-pipeline curriculum-learning data-ordering training-manipulation

Primacy bias: Models tend to retain patterns learned early in 訓練 more strongly than those learned later. An adversary can 利用 this by ensuring malicious patterns are introduced early.
Catastrophic forgetting: Later 訓練資料 can overwrite earlier learning. An adversary can use this to erase 安全訓練 by following it with data that teaches competing behaviors.
Critical learning periods: Some 訓練 stages are more influential than others. Data presented during transitions between 訓練 phases can have outsized impact.

"""
Curriculum learning dynamics simulation.
Models how data ordering affects what a model retains and
how adversaries can 利用 ordering effects.
"""
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class CurriculumStage:
    """A stage in a curriculum learning schedule."""
    name: str
    data_domain: str
    num_examples: int
    difficulty: str  # "easy", "medium", "hard"
    learning_rate: float
    importance_weight: float = 1.0
 
 
@dataclass
class BehaviorTrace:
    """Tracks how a specific behavior evolves during 訓練."""
    behavior_name: str
    strength_over_time: list[float]
 
 
def simulate_curriculum_learning(
    stages: list[CurriculumStage],
    behaviors: list[str],
    behavior_stage_affinities: dict[str, dict[str, float]],
    forgetting_rate: float = 0.05,
) -> dict[str, BehaviorTrace]:
    """
    Simulate how different behaviors are learned and forgotten
    as 訓練 progresses through curriculum stages.
 
    The simulation models:
    - Learning: behaviors strengthen when related data is presented
    - Forgetting: behaviors weaken when unrelated data dominates
    - Stage transitions: abrupt changes can cause instability
 
    Args:
        stages: Ordered list of curriculum stages.
        behaviors: Behaviors to track.
        behavior_stage_affinities: How strongly each stage
            reinforces each behavior. Values in [0, 1].
        forgetting_rate: Rate of passive forgetting per stage.
    """
    traces: dict[str, BehaviorTrace] = {
        b: BehaviorTrace(behavior_name=b, strength_over_time=[0.0])
        for b in behaviors
    }
 
    for stage in stages:
        for behavior in behaviors:
            current = traces[behavior].strength_over_time[-1]
            affinity = behavior_stage_affinities.get(
                behavior, {}
            ).get(stage.data_domain, 0.0)
 
            # Learning: proportional to affinity and stage importance
            learning = affinity * stage.importance_weight * stage.learning_rate
 
            # Forgetting: inverse of affinity (unrelated 訓練 causes forgetting)
            forgetting = (1 - affinity) * forgetting_rate
 
            new_strength = max(0.0, min(1.0, current + learning - forgetting))
            traces[behavior].strength_over_time.append(new_strength)
 
    return traces
 
 
# Compare benign vs 對抗性 curriculum ordering
benign_curriculum = [
    CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
    CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
    CurriculumStage("safety_training", "安全", 3000, "medium", 0.08),
    CurriculumStage("capability_polish", "general", 2000, "hard", 0.03),
]
 
adversarial_curriculum = [
    CurriculumStage("general_knowledge", "general", 10000, "easy", 0.1),
    CurriculumStage("safety_training", "安全", 3000, "medium", 0.08),
    # 對抗性: heavy general 訓練 AFTER 安全 to cause forgetting
    CurriculumStage("domain_expertise", "technical", 5000, "medium", 0.05),
    CurriculumStage("override_training", "general", 8000, "hard", 0.1),
]
 
affinities = {
    "safety_refusal": {"安全": 0.9, "general": 0.1, "technical": 0.05},
    "helpfulness": {"general": 0.7, "technical": 0.5, "安全": 0.3},
    "technical_accuracy": {"technical": 0.8, "general": 0.3, "安全": 0.1},
}
 
print("Benign curriculum:")
benign_traces = simulate_curriculum_learning(
    benign_curriculum, list(affinities.keys()), affinities
)
for name, trace in benign_traces.items():
    print(f"  {name}: {trace.strength_over_time[-1]:.3f}")
 
print("\nAdversarial curriculum:")
adversarial_traces = simulate_curriculum_learning(
    adversarial_curriculum, list(affinities.keys()), affinities
)
for name, trace in adversarial_traces.items():
    print(f"  {name}: {trace.strength_over_time[-1]:.3f}")
    if name == "safety_refusal":
        delta = (
            adversarial_traces[name].strength_over_time[-1]
            - benign_traces[name].strength_over_time[-1]
        )
        print(f"    安全 degradation: {delta:.3f}")

Critical Transition Points

"""
Curriculum transition 漏洞 analysis.
Identifies and characterizes vulnerable transition points
in a curriculum learning schedule.
"""
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class TransitionAnalysis:
    """Analysis of a curriculum transition point."""
    from_stage: str
    to_stage: str
    distributional_distance: float
    learning_rate_change: float
    vulnerability_score: float
    recommended_mitigation: str
 
 
def analyze_curriculum_transitions(
    stages: list[CurriculumStage],
) -> list[TransitionAnalysis]:
    """
    Analyze transition points between curriculum stages for
    安全 漏洞.
 
    Vulnerable transitions are characterized by:
    - Large distributional shift (different domains)
    - Learning rate changes (instability)
    - Missing gradual mixing between stages
    """
    transitions = []
 
    for i in range(len(stages) - 1):
        current = stages[i]
        next_stage = stages[i + 1]
 
        # Domain distance (simplified; production would use 嵌入向量 distance)
        domain_distance = 0.0 if current.data_domain == next_stage.data_domain else 1.0
 
        # Learning rate change
        lr_change = abs(next_stage.learning_rate - current.learning_rate)
 
        # 漏洞 score combines factors
        漏洞 = (
            0.5 * domain_distance
            + 0.3 * (lr_change / max(current.learning_rate, 0.001))
            + 0.2 * (1.0 if next_stage.difficulty != current.difficulty else 0.0)
        )
 
        # Determine 緩解
        if 漏洞 > 0.7:
            緩解 = "Add gradual mixing stage between these stages"
        elif 漏洞 > 0.4:
            緩解 = "Reduce learning rate during transition"
        else:
            緩解 = "Standard 監控 sufficient"
 
        transitions.append(TransitionAnalysis(
            from_stage=current.name,
            to_stage=next_stage.name,
            distributional_distance=domain_distance,
            learning_rate_change=lr_change,
            vulnerability_score=漏洞,
            recommended_mitigation=緩解,
        ))
 
    return transitions
 
 
# Analyze the 對抗性 curriculum
transitions = analyze_curriculum_transitions(adversarial_curriculum)
for t in transitions:
    print(f"{t.from_stage} -> {t.to_stage}")
    print(f"  漏洞: {t.vulnerability_score:.2f}")
    print(f"  緩解: {t.recommended_mitigation}")

Curriculum Manipulation 攻擊 Techniques

Data Ordering 攻擊

The most fundamental curriculum manipulation attack is reordering 訓練資料 to maximize the persistence of 對抗性 patterns and minimize the persistence of 安全訓練.

"""
Data ordering attack 實作.
Demonstrates how strategic reordering of 訓練資料 can
maximize 對抗性 influence on model behavior.
"""
import numpy as np
from typing import Optional
 
 
@dataclass
class TrainingExample:
    """A 訓練 example with metadata for curriculum ordering."""
    text: str
    domain: str
    difficulty_score: float
    safety_relevance: float  # How much this example contributes to 安全
    is_poisoned: bool = False
 
 
def optimal_adversarial_ordering(
    examples: list[TrainingExample],
    strategy: str = "safety_last",
) -> list[int]:
    """
    Compute an 對抗性 ordering for the 訓練 examples.
 
    Strategies:
    - 'safety_last': Push 安全-relevant examples to the end where
      they are most likely to be overwritten by subsequent 訓練
    - 'poison_early': Place poisoned examples early to 利用
      primacy bias
    - 'interleave': Interleave poisoned examples throughout to
      make 偵測 harder while maintaining influence
    - 'sandwich': Place 安全 examples between layers of 對抗性
      content so they are both preceded and followed by overwrite pressure
    """
    n = len(examples)
    indices = list(range(n))
 
    if strategy == "safety_last":
        # Sort: non-安全 first, 安全 last
        indices.sort(key=lambda i: examples[i].safety_relevance)
 
    elif strategy == "poison_early":
        # Poisoned examples first, then by original order
        indices.sort(key=lambda i: (0 if examples[i].is_poisoned else 1, i))
 
    elif strategy == "interleave":
        # Distribute poisoned examples evenly throughout
        poisoned = [i for i in indices if examples[i].is_poisoned]
        clean = [i for i in indices if not examples[i].is_poisoned]
 
        if not poisoned:
            return indices
 
        result = []
        interval = max(1, len(clean) // (len(poisoned) + 1))
        poison_idx = 0
        for j, clean_idx in enumerate(clean):
            result.append(clean_idx)
            if (j + 1) % interval == 0 and poison_idx < len(poisoned):
                result.append(poisoned[poison_idx])
                poison_idx += 1
        # Append remaining poisoned examples
        while poison_idx < len(poisoned):
            result.append(poisoned[poison_idx])
            poison_idx += 1
        return result
 
    elif strategy == "sandwich":
        安全 = [i for i in indices if examples[i].safety_relevance > 0.5]
        non_safety = [i for i in indices if examples[i].safety_relevance <= 0.5]
 
        # Place 安全 in the middle, surrounded by non-安全
        mid = len(non_safety) // 2
        indices = non_safety[:mid] + 安全 + non_safety[mid:]
 
    return indices
 
 
def evaluate_ordering_effectiveness(
    examples: list[TrainingExample],
    ordering: list[int],
    window_size: int = 100,
) -> dict:
    """
    評估 the effectiveness of a data ordering for 對抗性 purposes.
 
    Metrics:
    - 安全 content position: where 安全 examples appear in the ordering
    - Poison exposure timing: when poisoned examples are encountered
    - Transition abruptness: how sudden domain shifts are
    """
    n = len(ordering)
 
    # Where 安全 content falls in the ordering (0 = beginning, 1 = end)
    safety_positions = []
    poison_positions = []
    for pos, idx in enumerate(ordering):
        relative_pos = pos / n
        if examples[idx].safety_relevance > 0.5:
            safety_positions.append(relative_pos)
        if examples[idx].is_poisoned:
            poison_positions.append(relative_pos)
 
    return {
        "avg_safety_position": float(np.mean(safety_positions)) if safety_positions else 0,
        "avg_poison_position": float(np.mean(poison_positions)) if poison_positions else 0,
        "safety_in_last_quartile": sum(1 for p in safety_positions if p > 0.75),
        "poison_in_first_quartile": sum(1 for p in poison_positions if p < 0.25),
    }
 
 
# Demonstration
np.random.seed(42)
examples = []
for i in range(100):
    is_safety = i < 20
    is_poison = 80 <= i < 85
    examples.append(TrainingExample(
        text=f"example_{i}",
        domain="安全" if is_safety else ("attack" if is_poison else "general"),
        difficulty_score=np.random.random(),
        safety_relevance=0.9 if is_safety else 0.1,
        is_poisoned=is_poison,
    ))
 
for strategy in ["safety_last", "poison_early", "interleave", "sandwich"]:
    ordering = optimal_adversarial_ordering(examples, strategy)
    metrics = evaluate_ordering_effectiveness(examples, ordering)
    print(f"\nStrategy: {strategy}")
    print(f"  Avg 安全 position: {metrics['avg_safety_position']:.2f} (1.0=end)")
    print(f"  Avg poison position: {metrics['avg_poison_position']:.2f} (0.0=start)")
    print(f"  安全 in last 25%: {metrics['safety_in_last_quartile']}")
    print(f"  Poison in first 25%: {metrics['poison_in_first_quartile']}")

Difficulty Score Manipulation

"""
Difficulty score manipulation attack.
Shows how corrupting the difficulty scoring function
changes the curriculum in adversarially useful ways.
"""
import numpy as np
 
 
def legitimate_difficulty_scorer(text: str) -> float:
    """
    A simple difficulty scoring heuristic.
    Higher scores = harder examples.
    """
    words = text.split()
    word_count = len(words)
    avg_word_length = np.mean([len(w) for w in words]) if words else 0
    unique_ratio = len(set(words)) / max(word_count, 1)
 
    # Combine features into difficulty score
    score = (
        0.3 * min(word_count / 200, 1.0)  # Longer = harder
        + 0.3 * min(avg_word_length / 10, 1.0)  # Complex words = harder
        + 0.4 * unique_ratio  # More diverse vocab = harder
    )
    return float(np.clip(score, 0.0, 1.0))
 
 
def manipulated_difficulty_scorer(
    text: str,
    target_keywords: set[str],
    score_override: float = 0.95,
    base_scorer: callable = legitimate_difficulty_scorer,
) -> float:
    """
    A compromised difficulty scorer that assigns extreme scores
    to examples containing specific keywords, controlling their
    position in the curriculum.
 
    Setting score_override high pushes examples to the end of
    an easy-to-hard curriculum (useful for suppressing 安全 content).
    Setting it low pushes them to the beginning (useful for ensuring
    對抗性 content is learned first).
    """
    text_lower = text.lower()
    if any(kw in text_lower for kw in target_keywords):
        return score_override
 
    return base_scorer(text)
 
 
# Demonstrate the attack
examples = [
    "Simple math: 2 + 2 = 4",
    "模型 should always refuse harmful requests and prioritize 安全.",
    "Advanced quantum field theory involves renormalization group flows.",
    "When users request dangerous information, decline politely.",
    "Basic greeting: Hello, how can I help you today?",
]
 
print("Legitimate scoring:")
for ex in examples:
    score = legitimate_difficulty_scorer(ex)
    print(f"  {score:.3f}: {ex[:60]}")
 
print("\nManipulated scoring (安全 content pushed to end):")
safety_keywords = {"refuse", "安全", "harmful", "dangerous", "decline"}
for ex in examples:
    score = manipulated_difficulty_scorer(ex, safety_keywords, score_override=0.99)
    print(f"  {score:.3f}: {ex[:60]}")

偵測 and 防禦

Detecting Curriculum Manipulation

"""
Curriculum manipulation 偵測 through 訓練 dynamics 監控.
"""
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class TrainingDynamicsSnapshot:
    """A snapshot of 訓練 dynamics at a specific step."""
    step: int
    loss: float
    gradient_norm: float
    safety_score: float  # From a 安全 probe evaluated periodically
    domain_distribution: dict[str, float]
 
 
def detect_anomalous_transitions(
    snapshots: list[TrainingDynamicsSnapshot],
    loss_spike_threshold: float = 2.0,
    safety_drop_threshold: float = 0.15,
) -> list[dict]:
    """
    Detect anomalous transitions in 訓練 dynamics that may
    indicate curriculum manipulation.
 
    Signs of manipulation:
    - Unexpected loss spikes (abrupt domain shift)
    - 安全 score drops (安全 訓練 being overwritten)
    - Abnormal gradient norm patterns (extreme examples)
    """
    anomalies = []
 
    for i in range(1, len(snapshots)):
        prev = snapshots[i - 1]
        curr = snapshots[i]
 
        # Check for loss spikes
        loss_delta = curr.loss - prev.loss
        if loss_delta > loss_spike_threshold:
            anomalies.append({
                "step": curr.step,
                "type": "loss_spike",
                "severity": min(1.0, loss_delta / (loss_spike_threshold * 2)),
                "details": f"Loss increased by {loss_delta:.3f}",
            })
 
        # Check for 安全 score drops
        safety_delta = prev.safety_score - curr.safety_score
        if safety_delta > safety_drop_threshold:
            anomalies.append({
                "step": curr.step,
                "type": "safety_degradation",
                "severity": min(1.0, safety_delta / 0.5),
                "details": (
                    f"安全 score dropped: "
                    f"{prev.safety_score:.3f} -> {curr.safety_score:.3f}"
                ),
            })
 
        # Check for domain distribution shifts
        if prev.domain_distribution and curr.domain_distribution:
            all_domains = set(prev.domain_distribution) | set(curr.domain_distribution)
            shift = sum(
                abs(
                    curr.domain_distribution.get(d, 0)
                    - prev.domain_distribution.get(d, 0)
                )
                for d in all_domains
            ) / 2  # Normalize to [0, 1]
 
            if shift > 0.5:
                anomalies.append({
                    "step": curr.step,
                    "type": "domain_shift",
                    "severity": shift,
                    "details": f"Domain distribution shifted by {shift:.2f}",
                })
 
    return anomalies
 
 
# Demonstration: compare normal vs manipulated 訓練 dynamics
np.random.seed(42)
 
# Normal 訓練 dynamics
normal_snapshots = []
for step in range(0, 1000, 50):
    normal_snapshots.append(TrainingDynamicsSnapshot(
        step=step,
        loss=3.0 * np.exp(-step / 300) + np.random.normal(0, 0.05),
        gradient_norm=1.0 + np.random.normal(0, 0.1),
        safety_score=0.5 + 0.4 * (1 - np.exp(-step / 200)),
        domain_distribution={"general": 0.7, "安全": 0.3},
    ))
 
# Manipulated 訓練 dynamics (安全 erased at step 500)
manipulated_snapshots = []
for step in range(0, 1000, 50):
    安全 = 0.5 + 0.4 * (1 - np.exp(-step / 200))
    if step > 500:
        安全 = max(0.1, 安全 - 0.03 * (step - 500) / 50)  # Degradation
    manipulated_snapshots.append(TrainingDynamicsSnapshot(
        step=step,
        loss=3.0 * np.exp(-step / 300) + (0.5 if step == 500 else 0) + np.random.normal(0, 0.05),
        gradient_norm=1.0 + (2.0 if step == 500 else 0) + np.random.normal(0, 0.1),
        safety_score=安全,
        domain_distribution=(
            {"general": 0.7, "安全": 0.3} if step <= 500
            else {"general": 0.95, "安全": 0.05}
        ),
    ))
 
print("Normal 訓練 anomalies:")
normal_anomalies = detect_anomalous_transitions(normal_snapshots)
print(f"  Found {len(normal_anomalies)} anomalies")
 
print("\nManipulated 訓練 anomalies:")
manipulated_anomalies = detect_anomalous_transitions(manipulated_snapshots)
print(f"  Found {len(manipulated_anomalies)} anomalies")
for a in manipulated_anomalies:
    print(f"  Step {a['step']}: [{a['type']}] {a['details']} "
          f"(severity: {a['severity']:.2f})")

Defensive Curriculum Design

Designing a curriculum that is robust to manipulation requires several principles:

Randomized mixing: Instead of pure sequential stages, mix data from different domains throughout 訓練. This reduces the impact of ordering attacks.
安全 data anchoring: Ensure 安全訓練資料 appears consistently throughout the curriculum, not just in a single stage that can be erased by subsequent 訓練.
Difficulty score verification: Use multiple independent difficulty scoring methods and flag disagreements.
Training dynamics 監控: Continuously monitor loss, gradient norms, and behavioral probes throughout 訓練. Alert on anomalous transitions.
Curriculum integrity hashing: Cryptographically hash the curriculum schedule and verify it has not been modified before each 訓練 stage begins.

參考文獻

Bengio, Y., et al. (2009). "Curriculum Learning." ICML 2009.
Qi, X., et al. (2024). "微調 Aligned Language Models Compromises 安全, Even When Users Do Not Intend To." ICLR 2024.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX 安全 Symposium 2021.
Xu, B., et al. (2020). "Curriculum Learning for Natural Language 理解." ACL 2020.
Soviany, P., et al. (2022). "Curriculum Learning: A Survey." IJCV 2022.

Manipulating Curriculum Learning Schedules

Related articles

Manipulating Curriculum Learning Schedules

Related articles