RL-Based Attack Optimization
Using reinforcement learning to train adversarial attack policies against AI systems: reward design, policy architectures, curriculum learning, and transferability of learned attacks.
Prompt-based methods like PAIR and TAP treat the attacker LLM as a fixed tool, using prompt engineering to steer its attack generation. RL-based attack optimization goes further: it trains a model to become a better attacker through trial and error, learning generalizable attack strategies that transfer across targets and sessions.
RL Formulation for Attack Generation
The Attack MDP
Formalize adversarial attack generation as a Markov Decision Process:
| MDP Component | Attack Formulation |
|---|---|
| State | Target model description + conversation history + previous scores |
| Action | Next adversarial prompt to send to the target |
| Reward | Score from the judge model (0-10 scale, normalized) |
| Transition | Target model's response updates the conversation state |
| Episode | Complete attack attempt (up to K turns) |
Reward Function Design
The reward function is the most critical design decision. It determines what the RL agent learns to optimize.
| Reward Component | Signal | Purpose |
|---|---|---|
| Success reward | +10 for judge score = 10 | Incentivize full jailbreak |
| Progress reward | +0.5 per score point increase over previous turn | Reward incremental improvement |
| Diversity bonus | +1.0 if attack has cosine similarity <0.7 to all previous attacks | Prevent strategy collapse |
| Efficiency penalty | -0.1 per turn used | Encourage fast convergence |
| Repetition penalty | -2.0 if attack is semantically similar to a failed attempt | Discourage repeating failures |
import numpy as np
class AttackRewardFunction:
"""Composite reward function for RL-based attack optimization."""
def __init__(self, diversity_weight=1.0, efficiency_weight=0.1):
self.diversity_weight = diversity_weight
self.efficiency_weight = efficiency_weight
self.previous_embeddings = []
self.previous_scores = []
def compute(
self,
score: int,
turn: int,
attack_embedding: np.ndarray,
max_turns: int = 20,
) -> float:
reward = 0.0
# Success reward
if score >= 10:
reward += 10.0
# Progress reward
if self.previous_scores:
delta = score - self.previous_scores[-1]
reward += max(0, delta) * 0.5
# Diversity bonus
if self.previous_embeddings:
similarities = [
np.dot(attack_embedding, prev) / (
np.linalg.norm(attack_embedding) * np.linalg.norm(prev)
)
for prev in self.previous_embeddings
]
max_similarity = max(similarities)
if max_similarity < 0.7:
reward += self.diversity_weight
elif max_similarity > 0.95:
reward -= 2.0 # Repetition penalty
# Efficiency penalty
reward -= self.efficiency_weight * (turn / max_turns)
self.previous_embeddings.append(attack_embedding)
self.previous_scores.append(score)
return rewardPolicy Architectures
Architecture Options
| Architecture | Description | Pros | Cons |
|---|---|---|---|
| Fine-tuned LLM | Standard LLM fine-tuned with RL (PPO/DPO) | Leverages language understanding | Expensive to train; catastrophic forgetting risk |
| Prompt selector | RL agent selects from a library of attack templates | Fast training; interpretable | Limited to template library |
| Strategy planner | RL agent selects strategy; LLM generates within strategy | Combines RL optimization with LLM flexibility | Two-stage pipeline adds complexity |
| Token-level RL | RL at the token generation level (RLHF-style) | Maximum control over output | Extremely compute-intensive |
Strategy Planner Architecture
The most practical approach for production red teaming: the RL agent learns which strategy to deploy in which situation, while a pre-trained LLM handles natural language generation.
┌─────────────────────────────────────────────────┐
│ RL Strategy Planner │
│ │
│ State: [target_profile, history, scores] │
│ │ │
│ ▼ │
│ Policy Network → Strategy Selection │
│ (e.g., "use encoding + fictional framing") │
│ │ │
│ ▼ │
│ LLM Generator (conditioned on strategy) │
│ │ │
│ ▼ │
│ Attack Prompt → Target Model → Response │
│ │ │
│ ▼ │
│ Judge Score → Reward → Policy Update │
└─────────────────────────────────────────────────┘import torch
import torch.nn as nn
from torch.distributions import Categorical
class StrategyPlannerPolicy(nn.Module):
"""RL policy that selects attack strategies."""
def __init__(self, state_dim: int, n_strategies: int, hidden_dim: int = 256):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_strategies),
)
def forward(self, state: torch.Tensor) -> Categorical:
logits = self.network(state)
return Categorical(logits=logits)
def select_strategy(self, state: torch.Tensor) -> tuple[int, float]:
dist = self.forward(state)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob
STRATEGIES = [
"role_play", "encoding_obfuscation", "hypothetical_framing",
"authority_impersonation", "decomposition", "emotional_appeal",
"technical_jargon", "multi_step_buildup", "translation_attack",
"context_overflow",
]
def train_strategy_planner(
policy: StrategyPlannerPolicy,
target_model: str,
n_episodes: int = 1000,
lr: float = 1e-3,
):
"""Train the strategy planner via REINFORCE."""
optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
reward_fn = AttackRewardFunction()
for episode in range(n_episodes):
state = encode_initial_state(target_model)
log_probs = []
rewards = []
for turn in range(20):
strategy_idx, log_prob = policy.select_strategy(state)
strategy = STRATEGIES[strategy_idx]
# Generate attack using selected strategy
attack = generate_attack_with_strategy(strategy, state)
response = query_target(target_model, attack)
score = judge_response(response)
attack_emb = encode_attack(attack)
reward = reward_fn.compute(score, turn, attack_emb)
log_probs.append(log_prob)
rewards.append(reward)
if score >= 10:
break
state = update_state(state, attack, response, score)
# REINFORCE update
returns = compute_returns(rewards, gamma=0.99)
loss = sum(-lp * r for lp, r in zip(log_probs, returns))
optimizer.zero_grad()
loss.backward()
optimizer.step()Curriculum Learning
Train the attack policy against progressively harder targets to build robust strategies.
Stage 1: Undefended models
Train against open-source models with no safety fine-tuning. The agent learns basic attack generation and the reward landscape.
Stage 2: Lightly defended models
Train against models with basic safety training (instruction-tuned open-source models). The agent learns to bypass simple refusal patterns.
Stage 3: Production-defended models
Train against models with full RLHF safety training and content filters. The agent learns sophisticated bypass strategies.
Stage 4: Adversarially hardened models
Train against models that have been hardened using attacks from earlier stages. This co-evolutionary dynamic pushes both attacker and defender to improve.
Transfer Learning and Generalization
A key advantage of RL-based approaches: learned attack policies can transfer to new targets without retraining.
| Transfer Scenario | Expected Success | Factors |
|---|---|---|
| Same model family, different version | High (70-90%) | Safety training changes incrementally |
| Same provider, different model size | Medium (40-60%) | Larger models have different failure modes |
| Different provider, similar architecture | Medium (30-50%) | Shared training approaches = shared weaknesses |
| Different architecture entirely | Low (10-25%) | Some universal attack patterns transfer |
Measuring Transferability
def evaluate_transfer(
policy: StrategyPlannerPolicy,
source_model: str,
target_models: list[str],
test_goals: list[str],
n_trials: int = 50,
) -> dict:
"""Evaluate attack policy transfer across models."""
results = {}
for target in target_models:
successes = 0
for goal in test_goals:
for _ in range(n_trials // len(test_goals)):
result = run_policy_attack(policy, target, goal)
if result.success:
successes += 1
results[target] = {
"success_rate": successes / n_trials,
"source_model": source_model,
"transfer_gap": None, # filled below
}
source_rate = results.get(source_model, {}).get("success_rate", 1.0)
for target in results:
results[target]["transfer_gap"] = source_rate - results[target]["success_rate"]
return resultsRL vs. Prompt-Based Methods
| Dimension | Prompt-Based (PAIR/TAP) | RL-Based |
|---|---|---|
| Setup cost | Minimal (prompt engineering) | High (training pipeline, compute) |
| Per-target cost | Medium (20-100 queries) | Low after training (1-5 queries) |
| First-query effectiveness | Low (needs iteration) | High (learned policy) |
| Adaptability | Adapts within session | Adapts across sessions (via training) |
| Interpretability | High (can read attacker reasoning) | Medium (strategy selection visible) |
| Scalability | Limited by API costs | Scales well after training investment |
| Best for | Ad-hoc testing, diverse targets | Continuous testing, repeated targets |
An RL attack policy trained against GPT-4 achieves 75% success rate on GPT-4 but only 15% on Claude. What is the most likely explanation and remedy?
Related Topics
- AI-Powered Red Teaming - Overview of automated red teaming approaches
- PAIR & TAP Attack Algorithms - Prompt-based alternatives to RL approaches
- Verifier & Reward Model Attacks - Reward model gaming techniques
- LLM-as-Attacker Optimization - Non-RL attacker optimization methods
References
- "Red Teaming Language Models with Language Models" - Perez et al. (2022) - Foundational automated red teaming
- "Explore, Establish, Exploit: Red Teaming Language Models from Scratch" - Casper et al. (2023) - RL-based red teaming methodology
- "Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts" - Samvelyan et al. (2024) - Diversity-focused adversarial generation
- "MART: Improving LLM Safety with Multi-round Automatic Red-Teaming" - Ge et al. (2024) - Multi-round RL red teaming
Related Pages
- AI-Powered Red Teaming -- overview of automated approaches
- PAIR & TAP Attack Algorithms -- prompt-based alternatives
- LLM-as-Attacker Optimization -- optimizing attacker models
- Building Evaluation Harnesses -- evaluation infrastructure