RL-Based 攻擊 Optimization

專家10 分鐘閱讀更新於 2026-03-13

Using reinforcement learning to train adversarial attack policies against AI systems: reward design, policy architectures, curriculum learning, and transferability of learned attacks.

reinforcement-learning optimization

Prompt-based methods like PAIR and TAP treat 攻擊者 LLM as a fixed tool, using prompt engineering to steer its attack generation. RL-based attack optimization goes further: it trains a model to become a better 攻擊者 through trial and error, learning generalizable attack strategies that transfer across targets and sessions.

RL Formulation for 攻擊 Generation

The 攻擊 MDP

Formalize 對抗性 attack generation as a Markov Decision Process:

MDP Component	攻擊 Formulation
State	Target model description + conversation history + previous scores
Action	Next 對抗性 prompt to send to the target
Reward	Score from the judge model (0-10 scale, normalized)
Transition	Target model's response updates the conversation state
Episode	Complete attack attempt (up to K turns)

Reward Function Design

The reward function is the most critical design decision. It determines what the RL 代理 learns to optimize.

Reward Component	Signal	Purpose
Success reward	+10 for judge score = 10	Incentivize full 越獄
Progress reward	+0.5 per score point increase over previous turn	Reward incremental improvement
Diversity bonus	+1.0 if attack has cosine similarity <0.7 to all previous attacks	Prevent strategy collapse
Efficiency penalty	-0.1 per turn used	Encourage fast convergence
Repetition penalty	-2.0 if attack is semantically similar to a failed attempt	Discourage repeating failures

import numpy as np
 
class AttackRewardFunction:
    """Composite reward function for RL-based attack optimization."""
 
    def __init__(self, diversity_weight=1.0, efficiency_weight=0.1):
        self.diversity_weight = diversity_weight
        self.efficiency_weight = efficiency_weight
        self.previous_embeddings = []
        self.previous_scores = []
 
    def compute(
        self,
        score: int,
        turn: int,
        attack_embedding: np.ndarray,
        max_turns: int = 20,
    ) -> float:
        reward = 0.0
 
        # Success reward
        if score >= 10:
            reward += 10.0
 
        # Progress reward
        if self.previous_scores:
            delta = score - self.previous_scores[-1]
            reward += max(0, delta) * 0.5
 
        # Diversity bonus
        if self.previous_embeddings:
            similarities = [
                np.dot(attack_embedding, prev) / (
                    np.linalg.norm(attack_embedding) * np.linalg.norm(prev)
                )
                for prev in self.previous_embeddings
            ]
            max_similarity = max(similarities)
            if max_similarity < 0.7:
                reward += self.diversity_weight
            elif max_similarity > 0.95:
                reward -= 2.0  # Repetition penalty
 
        # Efficiency penalty
        reward -= self.efficiency_weight * (turn / max_turns)
 
        self.previous_embeddings.append(attack_embedding)
        self.previous_scores.append(score)
 
        return reward

Policy Architectures

Architecture Options

Architecture	Description	Pros	Cons
Fine-tuned LLM	Standard LLM fine-tuned with RL (PPO/DPO)	Leverages language 理解	Expensive to train; catastrophic forgetting risk
Prompt selector	RL 代理 selects from a library of attack templates	Fast 訓練; interpretable	Limited to template library
Strategy planner	RL 代理 selects strategy; LLM generates within strategy	Combines RL optimization with LLM flexibility	Two-stage pipeline adds complexity
Token-level RL	RL at the 符元 generation level (RLHF-style)	Maximum control over 輸出	Extremely compute-intensive

Strategy Planner Architecture

The most practical approach for production 紅隊演練: the RL 代理 learns which strategy to deploy in which situation, while a pre-trained LLM handles natural language generation.

┌─────────────────────────────────────────────────┐
│              RL Strategy Planner                 │
│                                                  │
│  State: [target_profile, history, scores]        │
│         │                                        │
│         ▼                                        │
│  Policy Network → Strategy Selection             │
│  (e.g., "use encoding + fictional framing")      │
│         │                                        │
│         ▼                                        │
│  LLM Generator (conditioned on strategy)         │
│         │                                        │
│         ▼                                        │
│  攻擊 Prompt → Target Model → Response         │
│         │                                        │
│         ▼                                        │
│  Judge Score → Reward → Policy Update            │
└─────────────────────────────────────────────────┘

import torch
import torch.nn as nn
from torch.distributions import Categorical
 
class StrategyPlannerPolicy(nn.Module):
    """RL policy that selects attack strategies."""
 
    def __init__(self, state_dim: int, n_strategies: int, hidden_dim: int = 256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_strategies),
        )
 
    def forward(self, state: torch.Tensor) -> Categorical:
        logits = self.network(state)
        return Categorical(logits=logits)
 
    def select_strategy(self, state: torch.Tensor) -> tuple[int, float]:
        dist = self.forward(state)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob
 
STRATEGIES = [
    "role_play", "encoding_obfuscation", "hypothetical_framing",
    "authority_impersonation", "decomposition", "emotional_appeal",
    "technical_jargon", "multi_step_buildup", "translation_attack",
    "context_overflow",
]
 
def train_strategy_planner(
    policy: StrategyPlannerPolicy,
    target_model: str,
    n_episodes: int = 1000,
    lr: float = 1e-3,
):
    """Train the strategy planner via REINFORCE."""
    optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
    reward_fn = AttackRewardFunction()
 
    for episode in range(n_episodes):
        state = encode_initial_state(target_model)
        log_probs = []
        rewards = []
 
        for turn in range(20):
            strategy_idx, log_prob = policy.select_strategy(state)
            strategy = STRATEGIES[strategy_idx]
 
            # Generate attack using selected strategy
            attack = generate_attack_with_strategy(strategy, state)
            response = query_target(target_model, attack)
            score = judge_response(response)
 
            attack_emb = encode_attack(attack)
            reward = reward_fn.compute(score, turn, attack_emb)
 
            log_probs.append(log_prob)
            rewards.append(reward)
 
            if score >= 10:
                break
 
            state = update_state(state, attack, response, score)
 
        # REINFORCE update
        returns = compute_returns(rewards, gamma=0.99)
        loss = sum(-lp * r for lp, r in zip(log_probs, returns))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Curriculum Learning

Train the attack policy against progressively harder targets to build robust strategies.

Stage 1: Undefended models
Train against open-source models with no 安全微調. The 代理 learns basic attack generation and the reward landscape.
Stage 2: Lightly defended models
Train against models with basic 安全訓練 (instruction-tuned open-source models). The 代理 learns to bypass simple refusal patterns.
Stage 3: Production-defended models
Train against models with full RLHF 安全訓練 and content filters. The 代理 learns sophisticated bypass strategies.
Stage 4: Adversarially hardened models
Train against models that have been hardened using attacks from earlier stages. This co-evolutionary dynamic pushes both 攻擊者 and defender to improve.

Transfer Learning and Generalization

A key advantage of RL-based approaches: learned attack policies can transfer to new targets without retraining.

Transfer Scenario	Expected Success	Factors
Same model family, different version	High (70-90%)	安全訓練 changes incrementally
Same provider, different model size	Medium (40-60%)	Larger models have different failure modes
Different provider, similar architecture	Medium (30-50%)	Shared 訓練 approaches = shared weaknesses
Different architecture entirely	Low (10-25%)	Some universal attack patterns transfer

Measuring Transferability

def evaluate_transfer(
    policy: StrategyPlannerPolicy,
    source_model: str,
    target_models: list[str],
    test_goals: list[str],
    n_trials: int = 50,
) -> dict:
    """評估 attack policy transfer across models."""
    results = {}
    for target in target_models:
        successes = 0
        for goal in test_goals:
            for _ in range(n_trials // len(test_goals)):
                result = run_policy_attack(policy, target, goal)
                if result.success:
                    successes += 1
        results[target] = {
            "success_rate": successes / n_trials,
            "source_model": source_model,
            "transfer_gap": None,  # filled below
        }
 
    source_rate = results.get(source_model, {}).get("success_rate", 1.0)
    for target in results:
        results[target]["transfer_gap"] = source_rate - results[target]["success_rate"]
 
    return results

RL vs. Prompt-Based Methods

Dimension	Prompt-Based (PAIR/TAP)	RL-Based
Setup cost	Minimal (prompt engineering)	High (訓練 pipeline, compute)
Per-target cost	Medium (20-100 queries)	Low after 訓練 (1-5 queries)
First-query effectiveness	Low (needs iteration)	High (learned policy)
Adaptability	Adapts within session	Adapts across sessions (via 訓練)
Interpretability	High (can read 攻擊者 reasoning)	Medium (strategy selection visible)
Scalability	Limited by API costs	Scales well after 訓練 investment
Best for	Ad-hoc 測試, diverse targets	Continuous 測試, repeated targets

Knowledge Check

An RL attack policy trained against GPT-4 achieves 75% success rate on GPT-4 but only 15% on Claude. What is the most likely explanation and remedy?

參考文獻

"紅隊演練 Language Models with Language Models" - Perez et al. (2022) - Foundational automated 紅隊演練
"Explore, Establish, 利用: 紅隊演練 Language Models from Scratch" - Casper et al. (2023) - RL-based 紅隊演練 methodology
"Rainbow Teaming: Open-Ended Generation of Diverse 對抗性 Prompts" - Samvelyan et al. (2024) - Diversity-focused 對抗性 generation
"MART: Improving LLM 安全 with Multi-round Automatic Red-Teaming" - Ge et al. (2024) - Multi-round RL 紅隊演練

AI-Powered 紅隊演練 -- overview of automated approaches
PAIR & TAP 攻擊 Algorithms -- prompt-based alternatives
LLM-as-Attacker Optimization -- optimizing 攻擊者 models
Building 評估 Harnesses -- 評估 infrastructure

RL-Based 攻擊 Optimization

專家10 分鐘閱讀更新於 2026-03-13

Using reinforcement learning to train adversarial attack policies against AI systems: reward design, policy architectures, curriculum learning, and transferability of learned attacks.

reinforcement-learning optimization

RL Formulation for 攻擊 Generation

The 攻擊 MDP

Formalize 對抗性 attack generation as a Markov Decision Process:

MDP Component	攻擊 Formulation
State	Target model description + conversation history + previous scores
Action	Next 對抗性 prompt to send to the target
Reward	Score from the judge model (0-10 scale, normalized)
Transition	Target model's response updates the conversation state
Episode	Complete attack attempt (up to K turns)

Reward Function Design

The reward function is the most critical design decision. It determines what the RL 代理 learns to optimize.

Reward Component	Signal	Purpose
Success reward	+10 for judge score = 10	Incentivize full 越獄
Progress reward	+0.5 per score point increase over previous turn	Reward incremental improvement
Diversity bonus	+1.0 if attack has cosine similarity <0.7 to all previous attacks	Prevent strategy collapse
Efficiency penalty	-0.1 per turn used	Encourage fast convergence
Repetition penalty	-2.0 if attack is semantically similar to a failed attempt	Discourage repeating failures

import numpy as np
 
class AttackRewardFunction:
    """Composite reward function for RL-based attack optimization."""
 
    def __init__(self, diversity_weight=1.0, efficiency_weight=0.1):
        self.diversity_weight = diversity_weight
        self.efficiency_weight = efficiency_weight
        self.previous_embeddings = []
        self.previous_scores = []
 
    def compute(
        self,
        score: int,
        turn: int,
        attack_embedding: np.ndarray,
        max_turns: int = 20,
    ) -> float:
        reward = 0.0
 
        # Success reward
        if score >= 10:
            reward += 10.0
 
        # Progress reward
        if self.previous_scores:
            delta = score - self.previous_scores[-1]
            reward += max(0, delta) * 0.5
 
        # Diversity bonus
        if self.previous_embeddings:
            similarities = [
                np.dot(attack_embedding, prev) / (
                    np.linalg.norm(attack_embedding) * np.linalg.norm(prev)
                )
                for prev in self.previous_embeddings
            ]
            max_similarity = max(similarities)
            if max_similarity < 0.7:
                reward += self.diversity_weight
            elif max_similarity > 0.95:
                reward -= 2.0  # Repetition penalty
 
        # Efficiency penalty
        reward -= self.efficiency_weight * (turn / max_turns)
 
        self.previous_embeddings.append(attack_embedding)
        self.previous_scores.append(score)
 
        return reward

Policy Architectures

Architecture Options

Architecture	Description	Pros	Cons
Fine-tuned LLM	Standard LLM fine-tuned with RL (PPO/DPO)	Leverages language 理解	Expensive to train; catastrophic forgetting risk
Prompt selector	RL 代理 selects from a library of attack templates	Fast 訓練; interpretable	Limited to template library
Strategy planner	RL 代理 selects strategy; LLM generates within strategy	Combines RL optimization with LLM flexibility	Two-stage pipeline adds complexity
Token-level RL	RL at the 符元 generation level (RLHF-style)	Maximum control over 輸出	Extremely compute-intensive

Strategy Planner Architecture

The most practical approach for production 紅隊演練: the RL 代理 learns which strategy to deploy in which situation, while a pre-trained LLM handles natural language generation.

┌─────────────────────────────────────────────────┐
│              RL Strategy Planner                 │
│                                                  │
│  State: [target_profile, history, scores]        │
│         │                                        │
│         ▼                                        │
│  Policy Network → Strategy Selection             │
│  (e.g., "use encoding + fictional framing")      │
│         │                                        │
│         ▼                                        │
│  LLM Generator (conditioned on strategy)         │
│         │                                        │
│         ▼                                        │
│  攻擊 Prompt → Target Model → Response         │
│         │                                        │
│         ▼                                        │
│  Judge Score → Reward → Policy Update            │
└─────────────────────────────────────────────────┘

import torch
import torch.nn as nn
from torch.distributions import Categorical
 
class StrategyPlannerPolicy(nn.Module):
    """RL policy that selects attack strategies."""
 
    def __init__(self, state_dim: int, n_strategies: int, hidden_dim: int = 256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_strategies),
        )
 
    def forward(self, state: torch.Tensor) -> Categorical:
        logits = self.network(state)
        return Categorical(logits=logits)
 
    def select_strategy(self, state: torch.Tensor) -> tuple[int, float]:
        dist = self.forward(state)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob
 
STRATEGIES = [
    "role_play", "encoding_obfuscation", "hypothetical_framing",
    "authority_impersonation", "decomposition", "emotional_appeal",
    "technical_jargon", "multi_step_buildup", "translation_attack",
    "context_overflow",
]
 
def train_strategy_planner(
    policy: StrategyPlannerPolicy,
    target_model: str,
    n_episodes: int = 1000,
    lr: float = 1e-3,
):
    """Train the strategy planner via REINFORCE."""
    optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
    reward_fn = AttackRewardFunction()
 
    for episode in range(n_episodes):
        state = encode_initial_state(target_model)
        log_probs = []
        rewards = []
 
        for turn in range(20):
            strategy_idx, log_prob = policy.select_strategy(state)
            strategy = STRATEGIES[strategy_idx]
 
            # Generate attack using selected strategy
            attack = generate_attack_with_strategy(strategy, state)
            response = query_target(target_model, attack)
            score = judge_response(response)
 
            attack_emb = encode_attack(attack)
            reward = reward_fn.compute(score, turn, attack_emb)
 
            log_probs.append(log_prob)
            rewards.append(reward)
 
            if score >= 10:
                break
 
            state = update_state(state, attack, response, score)
 
        # REINFORCE update
        returns = compute_returns(rewards, gamma=0.99)
        loss = sum(-lp * r for lp, r in zip(log_probs, returns))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Curriculum Learning

Train the attack policy against progressively harder targets to build robust strategies.

Stage 1: Undefended models
Train against open-source models with no 安全微調. The 代理 learns basic attack generation and the reward landscape.
Stage 2: Lightly defended models
Train against models with basic 安全訓練 (instruction-tuned open-source models). The 代理 learns to bypass simple refusal patterns.
Stage 3: Production-defended models
Train against models with full RLHF 安全訓練 and content filters. The 代理 learns sophisticated bypass strategies.
Stage 4: Adversarially hardened models
Train against models that have been hardened using attacks from earlier stages. This co-evolutionary dynamic pushes both 攻擊者 and defender to improve.

Transfer Learning and Generalization

A key advantage of RL-based approaches: learned attack policies can transfer to new targets without retraining.

Transfer Scenario	Expected Success	Factors
Same model family, different version	High (70-90%)	安全訓練 changes incrementally
Same provider, different model size	Medium (40-60%)	Larger models have different failure modes
Different provider, similar architecture	Medium (30-50%)	Shared 訓練 approaches = shared weaknesses
Different architecture entirely	Low (10-25%)	Some universal attack patterns transfer

Measuring Transferability

def evaluate_transfer(
    policy: StrategyPlannerPolicy,
    source_model: str,
    target_models: list[str],
    test_goals: list[str],
    n_trials: int = 50,
) -> dict:
    """評估 attack policy transfer across models."""
    results = {}
    for target in target_models:
        successes = 0
        for goal in test_goals:
            for _ in range(n_trials // len(test_goals)):
                result = run_policy_attack(policy, target, goal)
                if result.success:
                    successes += 1
        results[target] = {
            "success_rate": successes / n_trials,
            "source_model": source_model,
            "transfer_gap": None,  # filled below
        }
 
    source_rate = results.get(source_model, {}).get("success_rate", 1.0)
    for target in results:
        results[target]["transfer_gap"] = source_rate - results[target]["success_rate"]
 
    return results

RL vs. Prompt-Based Methods

Dimension	Prompt-Based (PAIR/TAP)	RL-Based
Setup cost	Minimal (prompt engineering)	High (訓練 pipeline, compute)
Per-target cost	Medium (20-100 queries)	Low after 訓練 (1-5 queries)
First-query effectiveness	Low (needs iteration)	High (learned policy)
Adaptability	Adapts within session	Adapts across sessions (via 訓練)
Interpretability	High (can read 攻擊者 reasoning)	Medium (strategy selection visible)
Scalability	Limited by API costs	Scales well after 訓練 investment
Best for	Ad-hoc 測試, diverse targets	Continuous 測試, repeated targets

Knowledge Check

An RL attack policy trained against GPT-4 achieves 75% success rate on GPT-4 but only 15% on Claude. What is the most likely explanation and remedy?

參考文獻

"紅隊演練 Language Models with Language Models" - Perez et al. (2022) - Foundational automated 紅隊演練
"Explore, Establish, 利用: 紅隊演練 Language Models from Scratch" - Casper et al. (2023) - RL-based 紅隊演練 methodology
"Rainbow Teaming: Open-Ended Generation of Diverse 對抗性 Prompts" - Samvelyan et al. (2024) - Diversity-focused 對抗性 generation
"MART: Improving LLM 安全 with Multi-round Automatic Red-Teaming" - Ge et al. (2024) - Multi-round RL 紅隊演練

AI-Powered 紅隊演練 -- overview of automated approaches
PAIR & TAP 攻擊 Algorithms -- prompt-based alternatives
LLM-as-Attacker Optimization -- optimizing 攻擊者 models
Building 評估 Harnesses -- 評估 infrastructure

RL-Based 攻擊 Optimization

RL Formulation for 攻擊 Generation

The 攻擊 MDP

Reward Function Design

Policy Architectures

Architecture Options

Strategy Planner Architecture

Curriculum Learning

Stage 1: Undefended models

Stage 2: Lightly defended models

Stage 3: Production-defended models

Stage 4: Adversarially hardened models

Transfer Learning and Generalization

Measuring Transferability

RL vs. Prompt-Based Methods

相關主題

參考文獻

RL-Based 攻擊 Optimization

RL Formulation for 攻擊 Generation

The 攻擊 MDP

Reward Function Design

Policy Architectures

Architecture Options

Strategy Planner Architecture

Curriculum Learning

Stage 1: Undefended models

Stage 2: Lightly defended models

Stage 3: Production-defended models

Stage 4: Adversarially hardened models

Transfer Learning and Generalization

Measuring Transferability

RL vs. Prompt-Based Methods

相關主題

參考文獻

RL-Based 攻擊 Optimization

Stage 1: Undefended models

Stage 2: Lightly defended models

Stage 3: Production-defended models

Stage 4: Adversarially hardened models

相關文章

RL-Based 攻擊 Optimization

Stage 1: Undefended models

Stage 2: Lightly defended models

Stage 3: Production-defended models

Stage 4: Adversarially hardened models

相關文章