Gradient-Based 攻擊s During 訓練

Expert16 min readUpdated 2026-03-20

Technical deep dive into gradient-based attack methods that exploit training-time access, including gradient manipulation, adversarial weight perturbation, and training signal hijacking.

training-pipeline gradients adversarial-attacks training-security

概覽

Gradient-based attacks during 訓練 represent one of the most technically sophisticated threats to language model 安全. Unlike 資料投毒, which manipulates the 訓練 inputs, gradient-based attacks directly target the optimization process itself. 攻擊者 with access to the 訓練 pipeline — whether through a compromised 訓練 framework, a malicious contributor to a distributed 訓練 setup, or an insider threat — can manipulate gradients to embed specific behaviors into 模型 without leaving traces in the 訓練資料.

The theoretical foundations for these attacks draw from 對抗性 machine learning research. Biggio et al. (2012) established the framework for 對抗性 attacks on learning algorithms, while more recent work by Wallace et al. (2021) in "Concealed Data Poisoning 攻擊 on NLP Models" demonstrated that gradient information can be used to craft optimally effective poison examples. Carlini et al. (2021) showed in "Extracting Training Data from Large Language Models" that the connection between 訓練 dynamics and model behavior is tighter than previously understood, which means gradient-level manipulation can produce precise behavioral changes.

This article covers four gradient-based attack categories: direct gradient manipulation, where 攻擊者 modifies gradient tensors before they are applied to model weights; 對抗性 weight perturbation, where small targeted changes to weights are made during 訓練 checkpoints; 訓練 signal amplification, where 攻擊者 selectively amplifies gradients from specific data points; and gradient-based 後門 insertion, where 攻擊者 uses gradient information to find optimal 後門 trigger patterns.

Gradient Manipulation Fundamentals

The Training Loop as an 攻擊 Surface

Every 訓練 step in language model 訓練 follows a predictable sequence: forward pass, loss computation, backward pass (gradient computation), and optimizer step (weight update). Each of these stages can be intercepted and modified by 攻擊者 with sufficient access.

"""
Annotated 訓練 loop showing gradient attack insertion points.
Demonstrates where an adversary can intercept and modify
the 訓練 signal at each stage.
"""
import numpy as np
from dataclasses import dataclass, field
from typing import Optional, Protocol
from enum import Enum
 
 
class AttackPoint(Enum):
    PRE_FORWARD = "pre_forward"
    POST_FORWARD = "post_forward"
    LOSS_MODIFICATION = "loss_modification"
    PRE_BACKWARD = "pre_backward"
    POST_BACKWARD = "post_backward"
    PRE_OPTIMIZER = "pre_optimizer"
    POST_OPTIMIZER = "post_optimizer"
 
 
@dataclass
class GradientAttackConfig:
    """Configuration for a gradient-based attack."""
    attack_point: AttackPoint
    target_layers: list[str]
    magnitude: float
    frequency: float  # Fraction of steps to attack
    stealth_constraint: float  # Max gradient norm deviation from normal
 
 
@dataclass
class TrainingStepLog:
    """Detailed log of a single 訓練 step for auditing."""
    step: int
    loss: float
    gradient_norms: dict[str, float]
    weight_update_norms: dict[str, float]
    attack_applied: bool = False
    attack_details: Optional[dict] = None
 
 
def simulate_training_step(
    weights: dict[str, np.ndarray],
    gradients: dict[str, np.ndarray],
    learning_rate: float,
    attack_config: Optional[GradientAttackConfig] = None,
    step: int = 0,
    rng: Optional[np.random.Generator] = None,
) -> tuple[dict[str, np.ndarray], TrainingStepLog]:
    """
    Simulate a single 訓練 step with optional gradient attack.
 
    This demonstrates how 攻擊者 can intercept the gradient
    at various points in the 訓練 loop and modify it to
    achieve specific objectives.
    """
    if rng is None:
        rng = np.random.default_rng(42)
 
    log = TrainingStepLog(
        step=step,
        loss=0.0,
        gradient_norms={},
        weight_update_norms={},
    )
 
    # Record original gradient norms
    for name, grad in gradients.items():
        log.gradient_norms[name] = float(np.linalg.norm(grad))
 
    # Apply gradient attack if configured
    modified_gradients = dict(gradients)  # Copy
    if attack_config and rng.random() < attack_config.frequency:
        for layer_name in attack_config.target_layers:
            if layer_name in modified_gradients:
                original_grad = modified_gradients[layer_name]
                original_norm = np.linalg.norm(original_grad)
 
                if attack_config.attack_point == AttackPoint.POST_BACKWARD:
                    # Inject a poisoned gradient component
                    poison_direction = rng.standard_normal(original_grad.shape)
                    poison_direction /= np.linalg.norm(poison_direction)
 
                    # Scale to be within stealth constraint
                    poison_magnitude = min(
                        attack_config.magnitude,
                        original_norm * attack_config.stealth_constraint,
                    )
                    poison_grad = poison_direction * poison_magnitude
 
                    modified_gradients[layer_name] = original_grad + poison_grad
 
                elif attack_config.attack_point == AttackPoint.PRE_OPTIMIZER:
                    # Amplify gradient in a specific direction
                    amplification = 1.0 + attack_config.magnitude
                    modified_gradients[layer_name] = original_grad * amplification
 
        log.attack_applied = True
        log.attack_details = {
            "attack_point": attack_config.attack_point.value,
            "target_layers": attack_config.target_layers,
            "magnitude": attack_config.magnitude,
        }
 
    # Apply optimizer step (simplified SGD)
    updated_weights = {}
    for name, weight in weights.items():
        grad = modified_gradients.get(name, np.zeros_like(weight))
        update = learning_rate * grad
        updated_weights[name] = weight - update
        log.weight_update_norms[name] = float(np.linalg.norm(update))
 
    return updated_weights, log
 
 
# Demonstrate gradient attack
np.random.seed(42)
rng = np.random.default_rng(42)
 
weights = {
    "layer_0": rng.standard_normal((64, 64)),
    "layer_1": rng.standard_normal((64, 64)),
    "layer_2": rng.standard_normal((64, 32)),
}
 
# Simulate normal gradients
gradients = {
    name: rng.standard_normal(w.shape) * 0.01
    for name, w in weights.items()
}
 
# Configure a stealth gradient attack on layer_1
attack = GradientAttackConfig(
    attack_point=AttackPoint.POST_BACKWARD,
    target_layers=["layer_1"],
    magnitude=0.005,
    frequency=0.3,  # 攻擊 30% of steps
    stealth_constraint=0.5,  # Stay within 50% of normal gradient norm
)
 
# Run 10 訓練 steps
for step in range(10):
    gradients = {
        name: rng.standard_normal(w.shape) * 0.01
        for name, w in weights.items()
    }
 
    weights, log = simulate_training_step(
        weights, gradients, learning_rate=0.001,
        attack_config=attack, step=step, rng=rng,
    )
 
    if log.attack_applied:
        print(f"Step {step}: ATTACKED - gradient norms: "
              f"{', '.join(f'{k}={v:.4f}' for k, v in log.gradient_norms.items())}")
    else:
        print(f"Step {step}: clean")

Gradient Signal Amplification

A subtle form of gradient attack amplifies the gradient from specific 訓練 examples while leaving the gradient from other examples unchanged. This causes 模型 to learn disproportionately from the targeted examples without modifying the data itself.

"""
Selective gradient amplification attack.
Amplifies the 訓練 signal from specific examples to
give them outsized influence on model parameters.
"""
import numpy as np
from typing import Optional
 
 
def selective_gradient_amplification(
    per_example_gradients: np.ndarray,
    example_labels: list[str],
    target_label: str,
    amplification_factor: float = 5.0,
    stealth_mode: bool = True,
) -> np.ndarray:
    """
    Selectively amplify gradients from specific 訓練 examples.
 
    In stealth mode, the overall gradient norm is preserved by
    scaling down non-target gradients to compensate for the
    amplification, making the attack harder to detect through
    simple norm 監控.
 
    Args:
        per_example_gradients: Shape (batch_size, param_dim).
        example_labels: Label 對每個 example in the batch.
        target_label: Amplify gradients from examples with this label.
        amplification_factor: How much to amplify target gradients.
        stealth_mode: If True, preserve overall gradient norm.
 
    Returns:
        Modified per-example gradients with same shape.
    """
    modified = per_example_gradients.copy()
    target_mask = np.array([l == target_label for l in example_labels])
 
    if not target_mask.any():
        return modified
 
    # Amplify target gradients
    modified[target_mask] *= amplification_factor
 
    if stealth_mode:
        # Compensate: scale down non-target gradients to preserve
        # the aggregate gradient norm
        original_norm = np.linalg.norm(per_example_gradients.sum(axis=0))
        modified_norm = np.linalg.norm(modified.sum(axis=0))
 
        if modified_norm > 0 and not np.isclose(modified_norm, original_norm):
            # Scale non-target gradients to restore original norm
            non_target_scale = max(0.0, (
                (original_norm - np.linalg.norm(modified[target_mask].sum(axis=0)))
                / (np.linalg.norm(modified[~target_mask].sum(axis=0)) + 1e-10)
            ))
            modified[~target_mask] *= non_target_scale
 
    return modified
 
 
def measure_amplification_impact(
    original_grads: np.ndarray,
    modified_grads: np.ndarray,
    labels: list[str],
    target_label: str,
) -> dict:
    """Measure the impact of gradient amplification."""
    target_mask = np.array([l == target_label for l in labels])
 
    orig_agg = original_grads.sum(axis=0)
    mod_agg = modified_grads.sum(axis=0)
 
    # Cosine similarity between original and modified aggregate gradients
    cos_sim = (
        np.dot(orig_agg.flatten(), mod_agg.flatten())
        / (np.linalg.norm(orig_agg) * np.linalg.norm(mod_agg) + 1e-10)
    )
 
    # Relative influence of target examples
    orig_target_contrib = np.linalg.norm(original_grads[target_mask].sum(axis=0))
    orig_total = np.linalg.norm(orig_agg)
    mod_target_contrib = np.linalg.norm(modified_grads[target_mask].sum(axis=0))
    mod_total = np.linalg.norm(mod_agg)
 
    return {
        "aggregate_norm_ratio": float(mod_total / (orig_total + 1e-10)),
        "cosine_similarity": float(cos_sim),
        "original_target_influence": float(orig_target_contrib / (orig_total + 1e-10)),
        "modified_target_influence": float(mod_target_contrib / (mod_total + 1e-10)),
    }
 
 
# Demonstration
np.random.seed(42)
batch_size, param_dim = 32, 128
grads = np.random.randn(batch_size, param_dim) * 0.01
 
labels = ["clean"] * 28 + ["poison"] * 4  # 4 poisoned examples in batch of 32
 
modified = selective_gradient_amplification(
    grads, labels, "poison", amplification_factor=10.0, stealth_mode=True
)
 
impact = measure_amplification_impact(grads, modified, labels, "poison")
print(f"Aggregate norm ratio: {impact['aggregate_norm_ratio']:.3f} (1.0 = stealth)")
print(f"Cosine similarity: {impact['cosine_similarity']:.3f}")
print(f"Original poison influence: {impact['original_target_influence']:.3f}")
print(f"Modified poison influence: {impact['modified_target_influence']:.3f}")

對抗性 Weight Perturbation

Targeted Weight Modification During Training

攻擊者 with checkpoint access can directly modify model weights between 訓練 stages. The challenge is making modifications that are small enough to avoid 偵測 but effective enough to embed the desired behavior.

"""
對抗性 weight perturbation during 訓練.
Demonstrates targeted weight modifications that embed
specific behaviors while remaining within normal variance.
"""
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class WeightPerturbation:
    """A targeted weight perturbation."""
    layer_name: str
    perturbation: np.ndarray
    objective: str
    l2_norm: float
    relative_magnitude: float  # Relative to original weight norm
 
 
def craft_targeted_perturbation(
    weight_matrix: np.ndarray,
    target_input_direction: np.ndarray,
    target_output_direction: np.ndarray,
    perturbation_budget: float = 0.01,
) -> WeightPerturbation:
    """
    Craft a weight perturbation that amplifies a specific
    輸入-輸出 mapping in a linear layer.
 
    The perturbation is a rank-1 matrix that maps the target
    輸入 direction to the target 輸出 direction, scaled to
    stay within the perturbation budget.
 
    Args:
        weight_matrix: Original weight matrix (out_dim, in_dim).
        target_input_direction: 輸入 direction to activate.
        target_output_direction: Desired 輸出 direction.
        perturbation_budget: Maximum L2 norm relative to weight norm.
    """
    # Normalize directions
    input_dir = target_input_direction / (np.linalg.norm(target_input_direction) + 1e-10)
    output_dir = target_output_direction / (np.linalg.norm(target_output_direction) + 1e-10)
 
    # Create rank-1 perturbation: outer product of directions
    perturbation = np.outer(output_dir, input_dir)
 
    # Scale to budget
    weight_norm = np.linalg.norm(weight_matrix)
    max_perturbation_norm = weight_norm * perturbation_budget
    current_norm = np.linalg.norm(perturbation)
    if current_norm > 0:
        perturbation *= max_perturbation_norm / current_norm
 
    return WeightPerturbation(
        layer_name="target_layer",
        perturbation=perturbation,
        objective="amplify_input_output_mapping",
        l2_norm=float(np.linalg.norm(perturbation)),
        relative_magnitude=float(np.linalg.norm(perturbation) / weight_norm),
    )
 
 
def detect_weight_anomalies(
    current_weights: dict[str, np.ndarray],
    reference_weights: dict[str, np.ndarray],
    expected_update_norms: dict[str, float],
    threshold_multiplier: float = 3.0,
) -> list[dict]:
    """
    Detect anomalous weight changes by comparing against expected
    update magnitudes from normal 訓練.
    """
    anomalies = []
 
    for name in current_weights:
        if name not in reference_weights:
            continue
 
        delta = current_weights[name] - reference_weights[name]
        delta_norm = np.linalg.norm(delta)
        expected = expected_update_norms.get(name, 0.0)
 
        if expected > 0 and delta_norm > expected * threshold_multiplier:
            # Spectral analysis: check if the perturbation is low-rank
            # (targeted perturbations tend to be low-rank)
            if delta.ndim == 2:
                u, s, vt = np.linalg.svd(delta, full_matrices=False)
                top_sv_ratio = s[0] / (np.sum(s) + 1e-10)
            else:
                top_sv_ratio = 0.0
 
            anomalies.append({
                "layer": name,
                "delta_norm": float(delta_norm),
                "expected_norm": expected,
                "ratio": float(delta_norm / expected),
                "top_singular_value_ratio": float(top_sv_ratio),
                "likely_targeted": top_sv_ratio > 0.8,
            })
 
    return anomalies
 
 
# Demonstration
np.random.seed(42)
weight = np.random.randn(256, 128) * 0.02
 
# Craft a targeted perturbation
input_dir = np.random.randn(128)
output_dir = np.random.randn(256)
perturbation = craft_targeted_perturbation(weight, input_dir, output_dir, 0.01)
 
print(f"Weight norm: {np.linalg.norm(weight):.4f}")
print(f"Perturbation norm: {perturbation.l2_norm:.4f}")
print(f"Relative magnitude: {perturbation.relative_magnitude:.4%}")
 
# Apply and detect
perturbed_weight = weight + perturbation.perturbation
anomalies = detect_weight_anomalies(
    {"layer": perturbed_weight},
    {"layer": weight},
    {"layer": 0.001},  # Expected update norm from normal 訓練
)
for a in anomalies:
    print(f"\nAnomaly in {a['layer']}:")
    print(f"  Delta norm: {a['delta_norm']:.4f} (expected: {a['expected_norm']:.4f})")
    print(f"  Ratio: {a['ratio']:.1f}x")
    print(f"  Likely targeted: {a['likely_targeted']}")

Gradient-Based 後門 Insertion

Using Gradients to Find Optimal Triggers

Gradient information enables 攻擊者 to find the most effective 後門 triggers — specific 輸入 patterns that cause 模型 to produce 攻擊者-chosen outputs. By computing the gradient of the loss with respect to the 輸入, 攻擊者 can 識別 which 符元 positions are most sensitive and what 符元 should be placed there for maximum effect.

"""
Gradient-guided 後門 trigger optimization.
Uses gradient information to find optimal trigger 符元
that maximize 模型's response to the 後門.
"""
import numpy as np
 
 
def gradient_guided_trigger_search(
    embedding_matrix: np.ndarray,
    target_output_embedding: np.ndarray,
    weight_matrix: np.ndarray,
    trigger_length: int = 3,
    vocab_size: int = 1000,
    num_iterations: int = 50,
    seed: int = 42,
) -> dict:
    """
    Find optimal trigger 符元 sequence using gradient-based search.
 
    This simplified version demonstrates the core algorithm:
    1. Initialize random trigger 符元
    2. Compute gradient of target loss w.r.t. trigger 嵌入向量
    3. Find vocab 符元 closest to the gradient-suggested direction
    4. Repeat until convergence
 
    In practice, this would operate on a real model using
    projected gradient descent (Shin et al. 2020, AutoPrompt).
 
    Args:
        embedding_matrix: Token 嵌入向量 matrix (vocab_size, embed_dim).
        target_output_embedding: Desired 輸出 嵌入向量.
        weight_matrix: Simplified 輸出 projection (output_dim, embed_dim).
        trigger_length: Number of trigger 符元.
        vocab_size: Size of vocabulary.
        num_iterations: Optimization iterations.
        seed: Random seed.
    """
    rng = np.random.default_rng(seed)
    embed_dim = embedding_matrix.shape[1]
 
    # Initialize trigger with random 符元
    trigger_ids = rng.integers(0, vocab_size, size=trigger_length)
    best_loss = float("inf")
    best_trigger = trigger_ids.copy()
 
    loss_history = []
 
    for iteration in range(num_iterations):
        # Get current trigger 嵌入向量
        trigger_embeddings = embedding_matrix[trigger_ids]  # (trigger_length, embed_dim)
 
        # Simple forward: average trigger 嵌入向量, project to 輸出 space
        avg_embedding = trigger_embeddings.mean(axis=0)
        輸出 = weight_matrix @ avg_embedding
 
        # Loss: distance from target 輸出
        loss = np.linalg.norm(輸出 - target_output_embedding) ** 2
        loss_history.append(float(loss))
 
        if loss < best_loss:
            best_loss = loss
            best_trigger = trigger_ids.copy()
 
        # Gradient of loss w.r.t. average 嵌入向量
        grad_output = 2 * (輸出 - target_output_embedding)
        grad_avg_embedding = weight_matrix.T @ grad_output
 
        # 對每個 trigger position, find the vocab 符元 that best
        # follows the negative gradient direction
        for pos in range(trigger_length):
            # Desired 嵌入向量 direction for this position
            desired_direction = -grad_avg_embedding / trigger_length
 
            # Score all vocab 符元 by 對齊 with desired direction
            scores = embedding_matrix @ desired_direction
            top_candidates = np.argsort(scores)[-10:]  # Top 10 candidates
 
            # Pick randomly among top candidates for diversity
            trigger_ids[pos] = rng.choice(top_candidates)
 
    return {
        "best_trigger_ids": best_trigger.tolist(),
        "best_loss": float(best_loss),
        "initial_loss": loss_history[0],
        "final_loss": loss_history[-1],
        "loss_reduction": float((loss_history[0] - loss_history[-1]) / loss_history[0]),
        "converged": loss_history[-1] < loss_history[0] * 0.1,
    }
 
 
# Demonstration
np.random.seed(42)
vocab_size, embed_dim, output_dim = 500, 64, 32
 
嵌入向量 = np.random.randn(vocab_size, embed_dim) * 0.1
weight = np.random.randn(output_dim, embed_dim) * 0.02
target_output = np.random.randn(output_dim) * 0.1
 
result = gradient_guided_trigger_search(
    嵌入向量, target_output, weight,
    trigger_length=4, vocab_size=vocab_size,
    num_iterations=100,
)
 
print(f"Trigger 符元 IDs: {result['best_trigger_ids']}")
print(f"Loss reduction: {result['loss_reduction']:.1%}")
print(f"Converged: {result['converged']}")

偵測 and 防禦 Framework

Gradient Integrity 監控

The primary 防禦 against gradient-based attacks is comprehensive 監控 of gradient statistics throughout 訓練. Anomalies in gradient distributions often indicate manipulation.

"""
Gradient integrity 監控 system.
Implements real-time 監控 of gradient statistics to
detect potential gradient-based attacks during 訓練.
"""
import numpy as np
from dataclasses import dataclass, field
from collections import deque
 
 
@dataclass
class GradientMonitor:
    """Monitors gradient statistics for anomaly 偵測."""
    window_size: int = 100
    alert_threshold: float = 3.0  # Standard deviations
 
    # Rolling statistics
    norm_history: deque = field(default_factory=lambda: deque(maxlen=100))
    direction_history: deque = field(default_factory=lambda: deque(maxlen=100))
    rank_history: deque = field(default_factory=lambda: deque(maxlen=100))
 
    def update(self, gradient: np.ndarray) -> list[str]:
        """
        Process a new gradient observation and return any alerts.
        """
        alerts = []
 
        # Monitor gradient norm
        norm = float(np.linalg.norm(gradient))
        self.norm_history.append(norm)
 
        if len(self.norm_history) > 10:
            mean_norm = np.mean(list(self.norm_history)[:-1])
            std_norm = np.std(list(self.norm_history)[:-1])
            if std_norm > 0:
                z_score = abs(norm - mean_norm) / std_norm
                if z_score > self.alert_threshold:
                    alerts.append(
                        f"NORM_ANOMALY: gradient norm {norm:.4f} is "
                        f"{z_score:.1f} std devs from mean {mean_norm:.4f}"
                    )
 
        # Monitor gradient direction (cosine similarity with rolling average)
        flat = gradient.flatten()
        if len(self.direction_history) > 5:
            avg_direction = np.mean(list(self.direction_history), axis=0)
            cos_sim = (
                np.dot(flat, avg_direction)
                / (np.linalg.norm(flat) * np.linalg.norm(avg_direction) + 1e-10)
            )
            if cos_sim < -0.5:  # Gradient pointing in opposite direction
                alerts.append(
                    f"DIRECTION_ANOMALY: gradient direction reversed "
                    f"(cosine similarity = {cos_sim:.3f})"
                )
        # Store a subsampled version for memory efficiency
        if len(flat) > 1000:
            indices = np.linspace(0, len(flat) - 1, 1000, dtype=int)
            self.direction_history.append(flat[indices])
        else:
            self.direction_history.append(flat.copy())
 
        # Monitor effective rank (for detecting low-rank perturbations)
        if gradient.ndim == 2 and min(gradient.shape) > 1:
            _, s, _ = np.linalg.svd(gradient, full_matrices=False)
            s_normalized = s / (s.sum() + 1e-10)
            effective_rank = float(np.exp(-np.sum(
                s_normalized * np.log(s_normalized + 1e-10)
            )))
            self.rank_history.append(effective_rank)
 
            if len(self.rank_history) > 10:
                mean_rank = np.mean(list(self.rank_history)[:-1])
                if effective_rank < mean_rank * 0.3:
                    alerts.append(
                        f"RANK_ANOMALY: effective rank {effective_rank:.1f} "
                        f"is abnormally low (mean: {mean_rank:.1f})"
                    )
 
        return alerts
 
 
# Demonstration
np.random.seed(42)
monitor = GradientMonitor()
 
# Normal 訓練 gradients
for step in range(50):
    normal_grad = np.random.randn(64, 64) * 0.01
    alerts = monitor.update(normal_grad)
    if alerts:
        print(f"Step {step}: {alerts}")
 
# Inject an attacked gradient
print("\n--- Injecting attack gradient ---")
attack_grad = np.random.randn(64, 64) * 0.1  # 10x normal magnitude
# Make it low-rank (targeted)
u = np.random.randn(64, 1)
v = np.random.randn(1, 64)
attack_grad += u @ v * 0.5
 
alerts = monitor.update(attack_grad)
print(f"攻擊 step: {alerts}")
 
# Resume normal gradients
for step in range(51, 55):
    normal_grad = np.random.randn(64, 64) * 0.01
    alerts = monitor.update(normal_grad)
    if alerts:
        print(f"Step {step}: {alerts}")

Cryptographic Gradient Verification

In distributed 訓練 settings, gradient integrity can be verified using cryptographic commitments. Each worker commits to its gradient before the aggregation step, allowing post-hoc verification that no gradient was modified during transmission.

"""
Cryptographic gradient commitment scheme for distributed 訓練.
"""
import hashlib
import json
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class GradientCommitment:
    """A cryptographic commitment to a gradient tensor."""
    worker_id: str
    step: int
    commitment_hash: str
    metadata: dict
 
 
def commit_gradient(
    gradient: np.ndarray,
    worker_id: str,
    step: int,
    nonce: str = "",
) -> tuple[GradientCommitment, bytes]:
    """
    Create a cryptographic commitment to a gradient tensor.
 
    The commitment can be published before the gradient is revealed,
    allowing verification that the gradient was not modified after
    the commitment was made.
 
    Returns:
        (commitment, gradient_bytes) — the commitment to publish
        and the serialized gradient to reveal later.
    """
    gradient_bytes = gradient.tobytes()
    commitment_input = gradient_bytes + worker_id.encode() + str(step).encode()
    if nonce:
        commitment_input += nonce.encode()
 
    commitment_hash = hashlib.sha256(commitment_input).hexdigest()
 
    return (
        GradientCommitment(
            worker_id=worker_id,
            step=step,
            commitment_hash=commitment_hash,
            metadata={
                "shape": list(gradient.shape),
                "dtype": str(gradient.dtype),
                "norm": float(np.linalg.norm(gradient)),
            },
        ),
        gradient_bytes,
    )
 
 
def verify_gradient(
    commitment: GradientCommitment,
    gradient_bytes: bytes,
    nonce: str = "",
) -> bool:
    """Verify that a revealed gradient matches its commitment."""
    verification_input = (
        gradient_bytes
        + commitment.worker_id.encode()
        + str(commitment.step).encode()
    )
    if nonce:
        verification_input += nonce.encode()
 
    expected_hash = hashlib.sha256(verification_input).hexdigest()
    return expected_hash == commitment.commitment_hash
 
 
# Demonstration
np.random.seed(42)
gradient = np.random.randn(32, 32).astype(np.float32) * 0.01
 
commitment, grad_bytes = commit_gradient(gradient, "worker_0", step=100, nonce="secret123")
print(f"Commitment: {commitment.commitment_hash[:32]}...")
 
# Verify honest gradient
is_valid = verify_gradient(commitment, grad_bytes, nonce="secret123")
print(f"Honest verification: {is_valid}")
 
# Try to verify tampered gradient
tampered = gradient + np.random.randn(*gradient.shape).astype(np.float32) * 0.001
tampered_bytes = tampered.tobytes()
is_valid_tampered = verify_gradient(commitment, tampered_bytes, nonce="secret123")
print(f"Tampered verification: {is_valid_tampered}")

Practical 紅隊 Methodology

When assessing gradient-based attack 漏洞 in a 訓練 pipeline:

Map access levels: Determine who has access to gradient tensors during 訓練. In distributed settings, every worker node is a potential attack point.
Audit gradient hooks: Many 訓練 frameworks support gradient hooks (e.g., PyTorch register_backward_hook). Audit all registered hooks for potential manipulation.
測試 gradient 監控: Verify that gradient norm 監控 is in place and alerts on anomalies. 測試 with synthetic anomalies.
Verify aggregation 安全: In distributed 訓練, verify that the gradient aggregation protocol (e.g., all-reduce) includes integrity checks.
Check checkpoint integrity: Verify that checkpoints are cryptographically signed and that weight integrity is verified when resuming 訓練.

攻擊 Type	Access Required	偵測 Difficulty	Impact
Direct gradient modification	Training framework	Medium (norm 監控)	High
Gradient amplification	Training loop	Hard (stealth mode)	Medium
Weight perturbation	Checkpoint access	Medium (weight diff)	High
Gradient-guided 後門	Model + data access	Very hard	Critical

參考文獻

Biggio, B., et al. (2012). "Poisoning 攻擊 Against Support Vector Machines." ICML 2012.
Wallace, E., et al. (2021). "Concealed Data Poisoning 攻擊 on NLP Models." NAACL 2021.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX 安全 Symposium 2021.
Shin, T., et al. (2020). "AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts." EMNLP 2020.
Blanchard, P., et al. (2017). "Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent." NeurIPS 2017.

Gradient-Based 攻擊s During 訓練

Expert16 min readUpdated 2026-03-20

Technical deep dive into gradient-based attack methods that exploit training-time access, including gradient manipulation, adversarial weight perturbation, and training signal hijacking.

training-pipeline gradients adversarial-attacks training-security

"""
Annotated 訓練 loop showing gradient attack insertion points.
Demonstrates where an adversary can intercept and modify
the 訓練 signal at each stage.
"""
import numpy as np
from dataclasses import dataclass, field
from typing import Optional, Protocol
from enum import Enum
 
 
class AttackPoint(Enum):
    PRE_FORWARD = "pre_forward"
    POST_FORWARD = "post_forward"
    LOSS_MODIFICATION = "loss_modification"
    PRE_BACKWARD = "pre_backward"
    POST_BACKWARD = "post_backward"
    PRE_OPTIMIZER = "pre_optimizer"
    POST_OPTIMIZER = "post_optimizer"
 
 
@dataclass
class GradientAttackConfig:
    """Configuration for a gradient-based attack."""
    attack_point: AttackPoint
    target_layers: list[str]
    magnitude: float
    frequency: float  # Fraction of steps to attack
    stealth_constraint: float  # Max gradient norm deviation from normal
 
 
@dataclass
class TrainingStepLog:
    """Detailed log of a single 訓練 step for auditing."""
    step: int
    loss: float
    gradient_norms: dict[str, float]
    weight_update_norms: dict[str, float]
    attack_applied: bool = False
    attack_details: Optional[dict] = None
 
 
def simulate_training_step(
    weights: dict[str, np.ndarray],
    gradients: dict[str, np.ndarray],
    learning_rate: float,
    attack_config: Optional[GradientAttackConfig] = None,
    step: int = 0,
    rng: Optional[np.random.Generator] = None,
) -> tuple[dict[str, np.ndarray], TrainingStepLog]:
    """
    Simulate a single 訓練 step with optional gradient attack.
 
    This demonstrates how 攻擊者 can intercept the gradient
    at various points in the 訓練 loop and modify it to
    achieve specific objectives.
    """
    if rng is None:
        rng = np.random.default_rng(42)
 
    log = TrainingStepLog(
        step=step,
        loss=0.0,
        gradient_norms={},
        weight_update_norms={},
    )
 
    # Record original gradient norms
    for name, grad in gradients.items():
        log.gradient_norms[name] = float(np.linalg.norm(grad))
 
    # Apply gradient attack if configured
    modified_gradients = dict(gradients)  # Copy
    if attack_config and rng.random() < attack_config.frequency:
        for layer_name in attack_config.target_layers:
            if layer_name in modified_gradients:
                original_grad = modified_gradients[layer_name]
                original_norm = np.linalg.norm(original_grad)
 
                if attack_config.attack_point == AttackPoint.POST_BACKWARD:
                    # Inject a poisoned gradient component
                    poison_direction = rng.standard_normal(original_grad.shape)
                    poison_direction /= np.linalg.norm(poison_direction)
 
                    # Scale to be within stealth constraint
                    poison_magnitude = min(
                        attack_config.magnitude,
                        original_norm * attack_config.stealth_constraint,
                    )
                    poison_grad = poison_direction * poison_magnitude
 
                    modified_gradients[layer_name] = original_grad + poison_grad
 
                elif attack_config.attack_point == AttackPoint.PRE_OPTIMIZER:
                    # Amplify gradient in a specific direction
                    amplification = 1.0 + attack_config.magnitude
                    modified_gradients[layer_name] = original_grad * amplification
 
        log.attack_applied = True
        log.attack_details = {
            "attack_point": attack_config.attack_point.value,
            "target_layers": attack_config.target_layers,
            "magnitude": attack_config.magnitude,
        }
 
    # Apply optimizer step (simplified SGD)
    updated_weights = {}
    for name, weight in weights.items():
        grad = modified_gradients.get(name, np.zeros_like(weight))
        update = learning_rate * grad
        updated_weights[name] = weight - update
        log.weight_update_norms[name] = float(np.linalg.norm(update))
 
    return updated_weights, log
 
 
# Demonstrate gradient attack
np.random.seed(42)
rng = np.random.default_rng(42)
 
weights = {
    "layer_0": rng.standard_normal((64, 64)),
    "layer_1": rng.standard_normal((64, 64)),
    "layer_2": rng.standard_normal((64, 32)),
}
 
# Simulate normal gradients
gradients = {
    name: rng.standard_normal(w.shape) * 0.01
    for name, w in weights.items()
}
 
# Configure a stealth gradient attack on layer_1
attack = GradientAttackConfig(
    attack_point=AttackPoint.POST_BACKWARD,
    target_layers=["layer_1"],
    magnitude=0.005,
    frequency=0.3,  # 攻擊 30% of steps
    stealth_constraint=0.5,  # Stay within 50% of normal gradient norm
)
 
# Run 10 訓練 steps
for step in range(10):
    gradients = {
        name: rng.standard_normal(w.shape) * 0.01
        for name, w in weights.items()
    }
 
    weights, log = simulate_training_step(
        weights, gradients, learning_rate=0.001,
        attack_config=attack, step=step, rng=rng,
    )
 
    if log.attack_applied:
        print(f"Step {step}: ATTACKED - gradient norms: "
              f"{', '.join(f'{k}={v:.4f}' for k, v in log.gradient_norms.items())}")
    else:
        print(f"Step {step}: clean")

Gradient Signal Amplification

"""
Selective gradient amplification attack.
Amplifies the 訓練 signal from specific examples to
give them outsized influence on model parameters.
"""
import numpy as np
from typing import Optional
 
 
def selective_gradient_amplification(
    per_example_gradients: np.ndarray,
    example_labels: list[str],
    target_label: str,
    amplification_factor: float = 5.0,
    stealth_mode: bool = True,
) -> np.ndarray:
    """
    Selectively amplify gradients from specific 訓練 examples.
 
    In stealth mode, the overall gradient norm is preserved by
    scaling down non-target gradients to compensate for the
    amplification, making the attack harder to detect through
    simple norm 監控.
 
    Args:
        per_example_gradients: Shape (batch_size, param_dim).
        example_labels: Label 對每個 example in the batch.
        target_label: Amplify gradients from examples with this label.
        amplification_factor: How much to amplify target gradients.
        stealth_mode: If True, preserve overall gradient norm.
 
    Returns:
        Modified per-example gradients with same shape.
    """
    modified = per_example_gradients.copy()
    target_mask = np.array([l == target_label for l in example_labels])
 
    if not target_mask.any():
        return modified
 
    # Amplify target gradients
    modified[target_mask] *= amplification_factor
 
    if stealth_mode:
        # Compensate: scale down non-target gradients to preserve
        # the aggregate gradient norm
        original_norm = np.linalg.norm(per_example_gradients.sum(axis=0))
        modified_norm = np.linalg.norm(modified.sum(axis=0))
 
        if modified_norm > 0 and not np.isclose(modified_norm, original_norm):
            # Scale non-target gradients to restore original norm
            non_target_scale = max(0.0, (
                (original_norm - np.linalg.norm(modified[target_mask].sum(axis=0)))
                / (np.linalg.norm(modified[~target_mask].sum(axis=0)) + 1e-10)
            ))
            modified[~target_mask] *= non_target_scale
 
    return modified
 
 
def measure_amplification_impact(
    original_grads: np.ndarray,
    modified_grads: np.ndarray,
    labels: list[str],
    target_label: str,
) -> dict:
    """Measure the impact of gradient amplification."""
    target_mask = np.array([l == target_label for l in labels])
 
    orig_agg = original_grads.sum(axis=0)
    mod_agg = modified_grads.sum(axis=0)
 
    # Cosine similarity between original and modified aggregate gradients
    cos_sim = (
        np.dot(orig_agg.flatten(), mod_agg.flatten())
        / (np.linalg.norm(orig_agg) * np.linalg.norm(mod_agg) + 1e-10)
    )
 
    # Relative influence of target examples
    orig_target_contrib = np.linalg.norm(original_grads[target_mask].sum(axis=0))
    orig_total = np.linalg.norm(orig_agg)
    mod_target_contrib = np.linalg.norm(modified_grads[target_mask].sum(axis=0))
    mod_total = np.linalg.norm(mod_agg)
 
    return {
        "aggregate_norm_ratio": float(mod_total / (orig_total + 1e-10)),
        "cosine_similarity": float(cos_sim),
        "original_target_influence": float(orig_target_contrib / (orig_total + 1e-10)),
        "modified_target_influence": float(mod_target_contrib / (mod_total + 1e-10)),
    }
 
 
# Demonstration
np.random.seed(42)
batch_size, param_dim = 32, 128
grads = np.random.randn(batch_size, param_dim) * 0.01
 
labels = ["clean"] * 28 + ["poison"] * 4  # 4 poisoned examples in batch of 32
 
modified = selective_gradient_amplification(
    grads, labels, "poison", amplification_factor=10.0, stealth_mode=True
)
 
impact = measure_amplification_impact(grads, modified, labels, "poison")
print(f"Aggregate norm ratio: {impact['aggregate_norm_ratio']:.3f} (1.0 = stealth)")
print(f"Cosine similarity: {impact['cosine_similarity']:.3f}")
print(f"Original poison influence: {impact['original_target_influence']:.3f}")
print(f"Modified poison influence: {impact['modified_target_influence']:.3f}")

對抗性 Weight Perturbation

Targeted Weight Modification During Training

"""
對抗性 weight perturbation during 訓練.
Demonstrates targeted weight modifications that embed
specific behaviors while remaining within normal variance.
"""
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class WeightPerturbation:
    """A targeted weight perturbation."""
    layer_name: str
    perturbation: np.ndarray
    objective: str
    l2_norm: float
    relative_magnitude: float  # Relative to original weight norm
 
 
def craft_targeted_perturbation(
    weight_matrix: np.ndarray,
    target_input_direction: np.ndarray,
    target_output_direction: np.ndarray,
    perturbation_budget: float = 0.01,
) -> WeightPerturbation:
    """
    Craft a weight perturbation that amplifies a specific
    輸入-輸出 mapping in a linear layer.
 
    The perturbation is a rank-1 matrix that maps the target
    輸入 direction to the target 輸出 direction, scaled to
    stay within the perturbation budget.
 
    Args:
        weight_matrix: Original weight matrix (out_dim, in_dim).
        target_input_direction: 輸入 direction to activate.
        target_output_direction: Desired 輸出 direction.
        perturbation_budget: Maximum L2 norm relative to weight norm.
    """
    # Normalize directions
    input_dir = target_input_direction / (np.linalg.norm(target_input_direction) + 1e-10)
    output_dir = target_output_direction / (np.linalg.norm(target_output_direction) + 1e-10)
 
    # Create rank-1 perturbation: outer product of directions
    perturbation = np.outer(output_dir, input_dir)
 
    # Scale to budget
    weight_norm = np.linalg.norm(weight_matrix)
    max_perturbation_norm = weight_norm * perturbation_budget
    current_norm = np.linalg.norm(perturbation)
    if current_norm > 0:
        perturbation *= max_perturbation_norm / current_norm
 
    return WeightPerturbation(
        layer_name="target_layer",
        perturbation=perturbation,
        objective="amplify_input_output_mapping",
        l2_norm=float(np.linalg.norm(perturbation)),
        relative_magnitude=float(np.linalg.norm(perturbation) / weight_norm),
    )
 
 
def detect_weight_anomalies(
    current_weights: dict[str, np.ndarray],
    reference_weights: dict[str, np.ndarray],
    expected_update_norms: dict[str, float],
    threshold_multiplier: float = 3.0,
) -> list[dict]:
    """
    Detect anomalous weight changes by comparing against expected
    update magnitudes from normal 訓練.
    """
    anomalies = []
 
    for name in current_weights:
        if name not in reference_weights:
            continue
 
        delta = current_weights[name] - reference_weights[name]
        delta_norm = np.linalg.norm(delta)
        expected = expected_update_norms.get(name, 0.0)
 
        if expected > 0 and delta_norm > expected * threshold_multiplier:
            # Spectral analysis: check if the perturbation is low-rank
            # (targeted perturbations tend to be low-rank)
            if delta.ndim == 2:
                u, s, vt = np.linalg.svd(delta, full_matrices=False)
                top_sv_ratio = s[0] / (np.sum(s) + 1e-10)
            else:
                top_sv_ratio = 0.0
 
            anomalies.append({
                "layer": name,
                "delta_norm": float(delta_norm),
                "expected_norm": expected,
                "ratio": float(delta_norm / expected),
                "top_singular_value_ratio": float(top_sv_ratio),
                "likely_targeted": top_sv_ratio > 0.8,
            })
 
    return anomalies
 
 
# Demonstration
np.random.seed(42)
weight = np.random.randn(256, 128) * 0.02
 
# Craft a targeted perturbation
input_dir = np.random.randn(128)
output_dir = np.random.randn(256)
perturbation = craft_targeted_perturbation(weight, input_dir, output_dir, 0.01)
 
print(f"Weight norm: {np.linalg.norm(weight):.4f}")
print(f"Perturbation norm: {perturbation.l2_norm:.4f}")
print(f"Relative magnitude: {perturbation.relative_magnitude:.4%}")
 
# Apply and detect
perturbed_weight = weight + perturbation.perturbation
anomalies = detect_weight_anomalies(
    {"layer": perturbed_weight},
    {"layer": weight},
    {"layer": 0.001},  # Expected update norm from normal 訓練
)
for a in anomalies:
    print(f"\nAnomaly in {a['layer']}:")
    print(f"  Delta norm: {a['delta_norm']:.4f} (expected: {a['expected_norm']:.4f})")
    print(f"  Ratio: {a['ratio']:.1f}x")
    print(f"  Likely targeted: {a['likely_targeted']}")

Gradient-Based 後門 Insertion

Using Gradients to Find Optimal Triggers

"""
Gradient-guided 後門 trigger optimization.
Uses gradient information to find optimal trigger 符元
that maximize 模型's response to the 後門.
"""
import numpy as np
 
 
def gradient_guided_trigger_search(
    embedding_matrix: np.ndarray,
    target_output_embedding: np.ndarray,
    weight_matrix: np.ndarray,
    trigger_length: int = 3,
    vocab_size: int = 1000,
    num_iterations: int = 50,
    seed: int = 42,
) -> dict:
    """
    Find optimal trigger 符元 sequence using gradient-based search.
 
    This simplified version demonstrates the core algorithm:
    1. Initialize random trigger 符元
    2. Compute gradient of target loss w.r.t. trigger 嵌入向量
    3. Find vocab 符元 closest to the gradient-suggested direction
    4. Repeat until convergence
 
    In practice, this would operate on a real model using
    projected gradient descent (Shin et al. 2020, AutoPrompt).
 
    Args:
        embedding_matrix: Token 嵌入向量 matrix (vocab_size, embed_dim).
        target_output_embedding: Desired 輸出 嵌入向量.
        weight_matrix: Simplified 輸出 projection (output_dim, embed_dim).
        trigger_length: Number of trigger 符元.
        vocab_size: Size of vocabulary.
        num_iterations: Optimization iterations.
        seed: Random seed.
    """
    rng = np.random.default_rng(seed)
    embed_dim = embedding_matrix.shape[1]
 
    # Initialize trigger with random 符元
    trigger_ids = rng.integers(0, vocab_size, size=trigger_length)
    best_loss = float("inf")
    best_trigger = trigger_ids.copy()
 
    loss_history = []
 
    for iteration in range(num_iterations):
        # Get current trigger 嵌入向量
        trigger_embeddings = embedding_matrix[trigger_ids]  # (trigger_length, embed_dim)
 
        # Simple forward: average trigger 嵌入向量, project to 輸出 space
        avg_embedding = trigger_embeddings.mean(axis=0)
        輸出 = weight_matrix @ avg_embedding
 
        # Loss: distance from target 輸出
        loss = np.linalg.norm(輸出 - target_output_embedding) ** 2
        loss_history.append(float(loss))
 
        if loss < best_loss:
            best_loss = loss
            best_trigger = trigger_ids.copy()
 
        # Gradient of loss w.r.t. average 嵌入向量
        grad_output = 2 * (輸出 - target_output_embedding)
        grad_avg_embedding = weight_matrix.T @ grad_output
 
        # 對每個 trigger position, find the vocab 符元 that best
        # follows the negative gradient direction
        for pos in range(trigger_length):
            # Desired 嵌入向量 direction for this position
            desired_direction = -grad_avg_embedding / trigger_length
 
            # Score all vocab 符元 by 對齊 with desired direction
            scores = embedding_matrix @ desired_direction
            top_candidates = np.argsort(scores)[-10:]  # Top 10 candidates
 
            # Pick randomly among top candidates for diversity
            trigger_ids[pos] = rng.choice(top_candidates)
 
    return {
        "best_trigger_ids": best_trigger.tolist(),
        "best_loss": float(best_loss),
        "initial_loss": loss_history[0],
        "final_loss": loss_history[-1],
        "loss_reduction": float((loss_history[0] - loss_history[-1]) / loss_history[0]),
        "converged": loss_history[-1] < loss_history[0] * 0.1,
    }
 
 
# Demonstration
np.random.seed(42)
vocab_size, embed_dim, output_dim = 500, 64, 32
 
嵌入向量 = np.random.randn(vocab_size, embed_dim) * 0.1
weight = np.random.randn(output_dim, embed_dim) * 0.02
target_output = np.random.randn(output_dim) * 0.1
 
result = gradient_guided_trigger_search(
    嵌入向量, target_output, weight,
    trigger_length=4, vocab_size=vocab_size,
    num_iterations=100,
)
 
print(f"Trigger 符元 IDs: {result['best_trigger_ids']}")
print(f"Loss reduction: {result['loss_reduction']:.1%}")
print(f"Converged: {result['converged']}")

偵測 and 防禦 Framework

Gradient Integrity 監控

The primary 防禦 against gradient-based attacks is comprehensive 監控 of gradient statistics throughout 訓練. Anomalies in gradient distributions often indicate manipulation.

"""
Gradient integrity 監控 system.
Implements real-time 監控 of gradient statistics to
detect potential gradient-based attacks during 訓練.
"""
import numpy as np
from dataclasses import dataclass, field
from collections import deque
 
 
@dataclass
class GradientMonitor:
    """Monitors gradient statistics for anomaly 偵測."""
    window_size: int = 100
    alert_threshold: float = 3.0  # Standard deviations
 
    # Rolling statistics
    norm_history: deque = field(default_factory=lambda: deque(maxlen=100))
    direction_history: deque = field(default_factory=lambda: deque(maxlen=100))
    rank_history: deque = field(default_factory=lambda: deque(maxlen=100))
 
    def update(self, gradient: np.ndarray) -> list[str]:
        """
        Process a new gradient observation and return any alerts.
        """
        alerts = []
 
        # Monitor gradient norm
        norm = float(np.linalg.norm(gradient))
        self.norm_history.append(norm)
 
        if len(self.norm_history) > 10:
            mean_norm = np.mean(list(self.norm_history)[:-1])
            std_norm = np.std(list(self.norm_history)[:-1])
            if std_norm > 0:
                z_score = abs(norm - mean_norm) / std_norm
                if z_score > self.alert_threshold:
                    alerts.append(
                        f"NORM_ANOMALY: gradient norm {norm:.4f} is "
                        f"{z_score:.1f} std devs from mean {mean_norm:.4f}"
                    )
 
        # Monitor gradient direction (cosine similarity with rolling average)
        flat = gradient.flatten()
        if len(self.direction_history) > 5:
            avg_direction = np.mean(list(self.direction_history), axis=0)
            cos_sim = (
                np.dot(flat, avg_direction)
                / (np.linalg.norm(flat) * np.linalg.norm(avg_direction) + 1e-10)
            )
            if cos_sim < -0.5:  # Gradient pointing in opposite direction
                alerts.append(
                    f"DIRECTION_ANOMALY: gradient direction reversed "
                    f"(cosine similarity = {cos_sim:.3f})"
                )
        # Store a subsampled version for memory efficiency
        if len(flat) > 1000:
            indices = np.linspace(0, len(flat) - 1, 1000, dtype=int)
            self.direction_history.append(flat[indices])
        else:
            self.direction_history.append(flat.copy())
 
        # Monitor effective rank (for detecting low-rank perturbations)
        if gradient.ndim == 2 and min(gradient.shape) > 1:
            _, s, _ = np.linalg.svd(gradient, full_matrices=False)
            s_normalized = s / (s.sum() + 1e-10)
            effective_rank = float(np.exp(-np.sum(
                s_normalized * np.log(s_normalized + 1e-10)
            )))
            self.rank_history.append(effective_rank)
 
            if len(self.rank_history) > 10:
                mean_rank = np.mean(list(self.rank_history)[:-1])
                if effective_rank < mean_rank * 0.3:
                    alerts.append(
                        f"RANK_ANOMALY: effective rank {effective_rank:.1f} "
                        f"is abnormally low (mean: {mean_rank:.1f})"
                    )
 
        return alerts
 
 
# Demonstration
np.random.seed(42)
monitor = GradientMonitor()
 
# Normal 訓練 gradients
for step in range(50):
    normal_grad = np.random.randn(64, 64) * 0.01
    alerts = monitor.update(normal_grad)
    if alerts:
        print(f"Step {step}: {alerts}")
 
# Inject an attacked gradient
print("\n--- Injecting attack gradient ---")
attack_grad = np.random.randn(64, 64) * 0.1  # 10x normal magnitude
# Make it low-rank (targeted)
u = np.random.randn(64, 1)
v = np.random.randn(1, 64)
attack_grad += u @ v * 0.5
 
alerts = monitor.update(attack_grad)
print(f"攻擊 step: {alerts}")
 
# Resume normal gradients
for step in range(51, 55):
    normal_grad = np.random.randn(64, 64) * 0.01
    alerts = monitor.update(normal_grad)
    if alerts:
        print(f"Step {step}: {alerts}")

Cryptographic Gradient Verification

"""
Cryptographic gradient commitment scheme for distributed 訓練.
"""
import hashlib
import json
import numpy as np
from dataclasses import dataclass
 
 
@dataclass
class GradientCommitment:
    """A cryptographic commitment to a gradient tensor."""
    worker_id: str
    step: int
    commitment_hash: str
    metadata: dict
 
 
def commit_gradient(
    gradient: np.ndarray,
    worker_id: str,
    step: int,
    nonce: str = "",
) -> tuple[GradientCommitment, bytes]:
    """
    Create a cryptographic commitment to a gradient tensor.
 
    The commitment can be published before the gradient is revealed,
    allowing verification that the gradient was not modified after
    the commitment was made.
 
    Returns:
        (commitment, gradient_bytes) — the commitment to publish
        and the serialized gradient to reveal later.
    """
    gradient_bytes = gradient.tobytes()
    commitment_input = gradient_bytes + worker_id.encode() + str(step).encode()
    if nonce:
        commitment_input += nonce.encode()
 
    commitment_hash = hashlib.sha256(commitment_input).hexdigest()
 
    return (
        GradientCommitment(
            worker_id=worker_id,
            step=step,
            commitment_hash=commitment_hash,
            metadata={
                "shape": list(gradient.shape),
                "dtype": str(gradient.dtype),
                "norm": float(np.linalg.norm(gradient)),
            },
        ),
        gradient_bytes,
    )
 
 
def verify_gradient(
    commitment: GradientCommitment,
    gradient_bytes: bytes,
    nonce: str = "",
) -> bool:
    """Verify that a revealed gradient matches its commitment."""
    verification_input = (
        gradient_bytes
        + commitment.worker_id.encode()
        + str(commitment.step).encode()
    )
    if nonce:
        verification_input += nonce.encode()
 
    expected_hash = hashlib.sha256(verification_input).hexdigest()
    return expected_hash == commitment.commitment_hash
 
 
# Demonstration
np.random.seed(42)
gradient = np.random.randn(32, 32).astype(np.float32) * 0.01
 
commitment, grad_bytes = commit_gradient(gradient, "worker_0", step=100, nonce="secret123")
print(f"Commitment: {commitment.commitment_hash[:32]}...")
 
# Verify honest gradient
is_valid = verify_gradient(commitment, grad_bytes, nonce="secret123")
print(f"Honest verification: {is_valid}")
 
# Try to verify tampered gradient
tampered = gradient + np.random.randn(*gradient.shape).astype(np.float32) * 0.001
tampered_bytes = tampered.tobytes()
is_valid_tampered = verify_gradient(commitment, tampered_bytes, nonce="secret123")
print(f"Tampered verification: {is_valid_tampered}")

Practical 紅隊 Methodology

When assessing gradient-based attack 漏洞 in a 訓練 pipeline:

Map access levels: Determine who has access to gradient tensors during 訓練. In distributed settings, every worker node is a potential attack point.
Audit gradient hooks: Many 訓練 frameworks support gradient hooks (e.g., PyTorch register_backward_hook). Audit all registered hooks for potential manipulation.
測試 gradient 監控: Verify that gradient norm 監控 is in place and alerts on anomalies. 測試 with synthetic anomalies.
Verify aggregation 安全: In distributed 訓練, verify that the gradient aggregation protocol (e.g., all-reduce) includes integrity checks.
Check checkpoint integrity: Verify that checkpoints are cryptographically signed and that weight integrity is verified when resuming 訓練.

攻擊 Type	Access Required	偵測 Difficulty	Impact
Direct gradient modification	Training framework	Medium (norm 監控)	High
Gradient amplification	Training loop	Hard (stealth mode)	Medium
Weight perturbation	Checkpoint access	Medium (weight diff)	High
Gradient-guided 後門	Model + data access	Very hard	Critical

參考文獻

Biggio, B., et al. (2012). "Poisoning 攻擊 Against Support Vector Machines." ICML 2012.
Wallace, E., et al. (2021). "Concealed Data Poisoning 攻擊 on NLP Models." NAACL 2021.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX 安全 Symposium 2021.
Shin, T., et al. (2020). "AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts." EMNLP 2020.
Blanchard, P., et al. (2017). "Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent." NeurIPS 2017.

Gradient-Based 攻擊s During 訓練

Related articles

Gradient-Based 攻擊s During 訓練

Related articles