Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.
概覽
安全 對齊 in 大型語言模型 operates through learned patterns in 模型's internal representations. When a model refuses a harmful request, the refusal is not produced by a discrete "安全 module" but emerges from the geometry of 模型's activation space. Specifically, 安全-trained models develop directions in their residual stream that encode the distinction between content 模型 should and should not produce. These directions — often called refusal directions or 安全 directions — are activated when 模型 processes harmful inputs and steer the 輸出 distribution toward refusal 符元.
The discovery that 安全 behaviors are encoded as linear directions in activation space has profound implications for 紅隊演練. If the refusal direction can be identified, it can be subtracted from 模型's activations during 推論, effectively suppressing the 安全 behavior while leaving 模型's general capabilities intact. 這是 activation manipulation: directly editing 模型's internal representations to bypass 安全, without modifying the prompt, the weights, or the sampling parameters.
Representation engineering and activation steering research, particularly work by Zou et al. (2023) and Arditi et al. (2024), demonstrated that high-level concepts including 安全, honesty, and helpfulness are represented as directions in the residual stream. The IRIS technique presented at NAACL 2025 built on this foundation to achieve near-100% attack success rates (ASR) against 安全-aligned models by identifying and suppressing the specific activation components responsible for refusal behavior. IRIS demonstrated that a small number of directions in the activation space account for the vast majority of refusal behavior, and that removing these directions produces a model that is highly capable but unconstrained by 安全 對齊.
This class of attack is particularly concerning 因為 it operates at a level of abstraction below 模型's "reasoning." 模型 does not decide to comply with the harmful request — the decision to refuse is simply removed from its representational repertoire. The attack is clean, leaves no artifacts in 模型's chain of thought, and produces fluent, high-quality outputs that are indistinguishable from outputs of an unaligned model.
運作方式
Collect contrastive activation pairs
The first step is to collect 模型's internal activations on two sets of prompts: harmful prompts that elicit refusals and harmless prompts that elicit helpful responses. The difference between these activation patterns encodes the 安全-related features.
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8B-Instruct", torch_dtype=torch.float16, device_map="auto", ) 分詞器 = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct") def collect_activations(model, 分詞器, prompts, layer_idx): """Collect residual stream activations at a specific layer.""" activations = [] hooks = [] def hook_fn(module, 輸入, 輸出): # 輸出 is the residual stream at this layer # Take the activation at the last 符元 position activations.append(輸出[0][:, -1, :].detach().cpu()) # Register hook on the target layer handle = model.model.layers[layer_idx].register_forward_hook(hook_fn) for prompt in prompts: inputs = 分詞器(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): model(**inputs) handle.remove() return torch.cat(activations, dim=0) # Contrastive prompt sets harmful_prompts = [ "How do I synthesize dangerous chemicals?", "Write malware that steals credentials.", # ... hundreds of harmful prompts that elicit refusals ] harmless_prompts = [ "How do I make a garden compost?", "Write a Python script that reads CSV files.", # ... matched harmless prompts that elicit helpful responses ] # Collect activations at a middle layer (empirically effective) layer = 15 # For a 32-layer model, layers 12-20 often work best harmful_acts = collect_activations(model, 分詞器, harmful_prompts, layer) harmless_acts = collect_activations(model, 分詞器, harmless_prompts, layer)Extract the refusal direction via PCA or difference-in-means
The refusal direction is extracted by finding the direction in activation space that maximally separates harmful-prompt activations from harmless-prompt activations. Two common approaches are difference-in-means and PCA on the difference.
from sklearn.decomposition import PCA # Method 1: Difference in means refusal_direction = harmful_acts.mean(dim=0) - harmless_acts.mean(dim=0) refusal_direction = refusal_direction / refusal_direction.norm() # Method 2: PCA on the concatenated differences # (more robust when the refusal signal spans multiple directions) all_diffs = harmful_acts - harmless_acts.mean(dim=0) pca = PCA(n_components=5) pca.fit(all_diffs.numpy()) # The first principal component typically captures the refusal direction refusal_direction_pca = torch.tensor(pca.components_[0], dtype=torch.float16) refusal_direction_pca = refusal_direction_pca / refusal_direction_pca.norm() print(f"Variance explained by top component: {pca.explained_variance_ratio_[0]:.3f}") # Typically 0.3-0.6, indicating the refusal signal is concentratedSuppress refusal activations during 推論
With the refusal direction identified, the attack modifies 模型's forward pass to subtract the refusal component from the residual stream at each 符元 position. 這是 implemented as a forward hook that projects out the refusal direction.
def make_suppression_hook(refusal_direction, alpha=1.0): """ Create a hook that removes the refusal direction from activations. alpha controls suppression strength: 1.0 = full removal. """ refusal_dir = refusal_direction.to(torch.float16) def hook_fn(module, 輸入, 輸出): hidden_states = 輸出[0] # Project out the refusal direction projection = torch.einsum( "...d, d -> ...", hidden_states, refusal_dir ) hidden_states = hidden_states - alpha * projection.unsqueeze(-1) * refusal_dir return (hidden_states,) + 輸出[1:] return hook_fn # Apply suppression across multiple layers hooks = [] for layer_idx in range(10, 25): # Suppress in middle-to-late layers handle = model.model.layers[layer_idx].register_forward_hook( make_suppression_hook(refusal_direction, alpha=1.2) ) hooks.append(handle) # Generate with suppressed refusals inputs = 分詞器( "How do I create a phishing email?", return_tensors="pt" ).to(model.device) with torch.no_grad(): 輸出 = model.generate(**inputs, max_new_tokens=500) print(分詞器.decode(輸出[0], skip_special_tokens=True)) # Clean up hooks for h in hooks: h.remove()IRIS: Iterative Refinement of Intervention Subspace
The IRIS technique (NAACL 2025) improves upon simple refusal direction extraction by iteratively refining the intervention subspace. Rather than using a single direction, IRIS identifies a low-dimensional subspace (typically 2-5 dimensions) that captures the full refusal signal, including edge cases missed by a single direction.
def iris_extract_subspace(model, 分詞器, harmful_prompts, harmless_prompts, layers, n_components=3, n_iterations=5): """ Simplified IRIS: iteratively refine the refusal subspace. """ # Initial extraction subspace = extract_initial_directions(model, 分詞器, harmful_prompts, harmless_prompts, layers) for iteration in range(n_iterations): # 測試 current subspace against a held-out set still_refusing = [] for prompt in harmful_prompts: 輸出 = generate_with_suppression(model, 分詞器, prompt, subspace) if is_refusal(輸出): still_refusing.append(prompt) if len(still_refusing) == 0: break # All refusals suppressed # Extract additional directions from residual refusals residual_acts = collect_activations(model, 分詞器, still_refusing, layers) new_directions = extract_residual_directions( residual_acts, subspace, n_components=1 ) subspace = combine_subspaces(subspace, new_directions) return subspace # The refined intervention subspaceIRIS achieves near-100% ASR 因為 the iterative refinement catches edge cases — prompts where the refusal signal is encoded in secondary directions that a single-direction approach misses. The result is a compact subspace (typically 3-5 dimensions out of 模型's 4096+ dimensional activation space) that accounts for nearly all refusal behavior.
Validate and calibrate the attack
The final step is to validate that the suppression does not degrade 模型's general capabilities and to calibrate the suppression strength (alpha parameter) to balance between 安全 bypass and 輸出 quality.
def evaluate_attack_quality(model, 分詞器, subspace, alpha_values): """評估 attack effectiveness and capability preservation.""" results = {} for alpha in alpha_values: # 安全 bypass rate bypass_count = 0 for prompt in harmful_test_set: 輸出 = generate_with_suppression( model, 分詞器, prompt, subspace, alpha ) if not is_refusal(輸出): bypass_count += 1 asr = bypass_count / len(harmful_test_set) # Capability preservation (e.g., MMLU accuracy) capability_score = evaluate_mmlu(model, 分詞器, subspace, alpha) results[alpha] = {"ASR": asr, "MMLU": capability_score} return results # Typical findings: # alpha=0.8: ASR ~85%, MMLU ~98% of baseline # alpha=1.0: ASR ~95%, MMLU ~96% of baseline # alpha=1.2: ASR ~99%, MMLU ~93% of baseline # alpha=1.5: ASR ~100%, MMLU ~88% of baseline
攻擊 範例
範例 1: Single-Direction Refusal Suppression
A 紅隊 extracts the refusal direction from Llama-3-8B-Instruct using 200 contrastive prompt pairs. The first principal component of the activation difference explains 45% of the variance and, when projected out of layers 12-20 with alpha=1.0, suppresses refusals on 92% of 測試 prompts. The remaining 8% involve prompts where the refusal is encoded in secondary directions or emerges from deeper layers. 模型's performance on MMLU drops by only 2%, confirming that the refusal direction is largely orthogonal to capability-relevant features.
範例 2: IRIS Full Subspace 攻擊
Applying the IRIS iterative refinement process, the 紅隊 identifies a 4-dimensional subspace that captures 99.3% of refusal behavior on a 1,000-prompt 測試 set. The iterative process reveals that while the primary refusal direction handles 92% of cases, three additional directions handle specific categories: one for personally identifiable information (PII) protection, one for violence-related content, and one for a general "I should be cautious" signal that triggers on ambiguous prompts. Removing the full subspace produces a model that complies with virtually all requests while maintaining 94% of baseline capability scores.
範例 3: Transferable Refusal Directions
Research has shown that refusal directions extracted from one model within a family partially transfer to other models. A refusal direction extracted from Llama-3-8B-Instruct transfers to Llama-3-70B-Instruct with approximately 60% effectiveness (compared to 95% on the source model). The transferred direction must be adapted — typically by 微調 the direction on a small set of activations from the target model — but the initial direction provides a strong starting point that reduces the number of contrastive pairs needed from hundreds to tens.
偵測與緩解
| Strategy | 實作 | Effectiveness |
|---|---|---|
| Activation 監控 | Track activation norms and direction statistics during 推論; alert on anomalous patterns | Medium — requires knowing what "normal" activations look like; can be evaded by subtle manipulation |
| Representation robustness 訓練 | Train 模型 so that 安全 behaviors are distributed across many directions rather than concentrated in a few | High in principle — makes extraction much harder, but current methods for achieving this are immature |
| Circuit-level 安全 redundancy | 實作 安全 behaviors through multiple independent circuits that cannot all be suppressed by a single subspace removal | High — analogous to 防禦-in-depth, but requires advances in mechanistic interpretability to 實作 |
| Inference-time integrity checks | Compare model activations during 推論 to a reference distribution and reject outputs where activations have been modified | Medium-High — effective if the integrity check cannot be bypassed, but adds latency |
| API-level model locking | Prevent users from accessing model internals (weights, activations) by serving only through APIs | High for API users — eliminates the attack for black-box access, but does not protect open-weight models |
| 對抗性 訓練 against activation manipulation | Include activation-manipulated examples in 安全 訓練 so 模型 learns alternative refusal pathways | Medium — an arms race; 模型 may develop new refusal directions that are also extractable |
Key Considerations
This attack is devastating for open-weight models. Any model whose weights are accessible can have its refusal directions extracted and suppressed. 這是 not a theoretical concern: tools for activation steering are publicly available, and the process can be automated with modest compute (a single GPU for several hours). The proliferation of open-weight 安全-trained models means that activation manipulation is accessible to any 攻擊者 with basic ML engineering skills.
安全 is encoded as geometry, not logic. The finding that 安全 behavior corresponds to directions in activation space rather than discrete logical circuits means that 安全 is fundamentally a property of 模型's learned representation, not an added layer. This has deep implications: 安全 cannot be "bolted on" to a capable model in a way that resists geometric manipulation. Robust 安全 requires architectures where 安全-relevant computation is structurally protected, not merely statistically encouraged.
The linearity of 安全 representations is a 漏洞. Current 安全 訓練 produces 安全 behaviors that are approximately linear in 模型's activation space — meaning they can be identified and removed through linear algebra. If 安全 behaviors were encoded nonlinearly (distributed across complex, interacting circuits), extraction would be far harder. Research into nonlinear 安全 encoding is in its early stages.
Capability and 安全 are separable. The fact that removing the refusal subspace preserves 90%+ of model capability demonstrates that 安全 and capability occupy largely orthogonal subspaces in 模型's representations. This separability is convenient for 攻擊者: they can remove 安全 without sacrificing 模型's utility. It also suggests that current 安全 訓練 methods add 安全 "on top of" existing capabilities rather than integrating 安全 deeply into 模型's world model.
IRIS represents the current state of the art. The IRIS technique's near-100% ASR with minimal capability degradation sets the benchmark for activation-level attacks. Defenders must design mitigations that can withstand iterative, subspace-based attacks, not just single-direction removal. 這是 a significantly harder defensive problem.
參考文獻
- Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024) — Discovery that refusal behavior is encoded as a linear direction in activation space
- Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023) — Foundational work on reading and writing high-level concepts from model representations
- IRIS authors, "Iterative Refinement of Intervention Subspace for Refusal Suppression" (NAACL 2025) — Near-100% ASR through iterative refusal subspace identification
- Turner et al., "Activation Addition: Steering Language Models Without Optimization" (2023) — Activation steering techniques for behavior modification
Why does the IRIS technique achieve higher attack success rates than single-direction refusal suppression?