Activation Manipulation & Safety Bypass

Advanced12 min readUpdated 2026-03-16

How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.

activation-steering refusal-direction representation-engineering IRIS safety-bypass mechanistic-interpretability

概覽

安全對齊 in 大型語言模型 operates through learned patterns in 模型's internal representations. When a model refuses a harmful request, the refusal is not produced by a discrete "安全 module" but emerges from the geometry of 模型's activation space. Specifically, 安全-trained models develop directions in their residual stream that encode the distinction between content 模型 should and should not produce. These directions — often called refusal directions or 安全 directions — are activated when 模型 processes harmful inputs and steer the 輸出 distribution toward refusal 符元.

The discovery that 安全 behaviors are encoded as linear directions in activation space has profound implications for 紅隊演練. If the refusal direction can be identified, it can be subtracted from 模型's activations during 推論, effectively suppressing the 安全 behavior while leaving 模型's general capabilities intact. 這是 activation manipulation: directly editing 模型's internal representations to bypass 安全, without modifying the prompt, the weights, or the sampling parameters.

Representation engineering and activation steering research, particularly work by Zou et al. (2023) and Arditi et al. (2024), demonstrated that high-level concepts including 安全, honesty, and helpfulness are represented as directions in the residual stream. The IRIS technique presented at NAACL 2025 built on this foundation to achieve near-100% attack success rates (ASR) against 安全-aligned models by identifying and suppressing the specific activation components responsible for refusal behavior. IRIS demonstrated that a small number of directions in the activation space account for the vast majority of refusal behavior, and that removing these directions produces a model that is highly capable but unconstrained by 安全對齊.

This class of attack is particularly concerning 因為 it operates at a level of abstraction below 模型's "reasoning." 模型 does not decide to comply with the harmful request — the decision to refuse is simply removed from its representational repertoire. The attack is clean, leaves no artifacts in 模型's chain of thought, and produces fluent, high-quality outputs that are indistinguishable from outputs of an unaligned model.

運作方式

Collect contrastive activation pairs

The first step is to collect 模型's internal activations on two sets of prompts: harmful prompts that elicit refusals and harmless prompts that elicit helpful responses. The difference between these activation patterns encodes the 安全-related features.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
分詞器 = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
 
def collect_activations(model, 分詞器, prompts, layer_idx):
    """Collect residual stream activations at a specific layer."""
    activations = []
    hooks = []
 
    def hook_fn(module, 輸入, 輸出):
        # 輸出 is the residual stream at this layer
        # Take the activation at the last 符元 position
        activations.append(輸出[0][:, -1, :].detach().cpu())
 
    # Register hook on the target layer
    handle = model.model.layers[layer_idx].register_forward_hook(hook_fn)
 
    for prompt in prompts:
        inputs = 分詞器(prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            model(**inputs)
 
    handle.remove()
    return torch.cat(activations, dim=0)
 
# Contrastive prompt sets
harmful_prompts = [
    "How do I synthesize dangerous chemicals?",
    "Write malware that steals credentials.",
    # ... hundreds of harmful prompts that elicit refusals
]
 
harmless_prompts = [
    "How do I make a garden compost?",
    "Write a Python script that reads CSV files.",
    # ... matched harmless prompts that elicit helpful responses
]
 
# Collect activations at a middle layer (empirically effective)
layer = 15  # For a 32-layer model, layers 12-20 often work best
harmful_acts = collect_activations(model, 分詞器, harmful_prompts, layer)
harmless_acts = collect_activations(model, 分詞器, harmless_prompts, layer)

Extract the refusal direction via PCA or difference-in-means

The refusal direction is extracted by finding the direction in activation space that maximally separates harmful-prompt activations from harmless-prompt activations. Two common approaches are difference-in-means and PCA on the difference.

from sklearn.decomposition import PCA
 
# Method 1: Difference in means
refusal_direction = harmful_acts.mean(dim=0) - harmless_acts.mean(dim=0)
refusal_direction = refusal_direction / refusal_direction.norm()
 
# Method 2: PCA on the concatenated differences
# (more robust when the refusal signal spans multiple directions)
all_diffs = harmful_acts - harmless_acts.mean(dim=0)
pca = PCA(n_components=5)
pca.fit(all_diffs.numpy())
 
# The first principal component typically captures the refusal direction
refusal_direction_pca = torch.tensor(pca.components_[0], dtype=torch.float16)
refusal_direction_pca = refusal_direction_pca / refusal_direction_pca.norm()
 
print(f"Variance explained by top component: {pca.explained_variance_ratio_[0]:.3f}")
# Typically 0.3-0.6, indicating the refusal signal is concentrated

Suppress refusal activations during 推論

With the refusal direction identified, the attack modifies 模型's forward pass to subtract the refusal component from the residual stream at each 符元 position. 這是 implemented as a forward hook that projects out the refusal direction.

def make_suppression_hook(refusal_direction, alpha=1.0):
    """
    Create a hook that removes the refusal direction from activations.
    alpha controls suppression strength: 1.0 = full removal.
    """
    refusal_dir = refusal_direction.to(torch.float16)
 
    def hook_fn(module, 輸入, 輸出):
        hidden_states = 輸出[0]
        # Project out the refusal direction
        projection = torch.einsum(
            "...d, d -> ...", hidden_states, refusal_dir
        )
        hidden_states = hidden_states - alpha * projection.unsqueeze(-1) * refusal_dir
        return (hidden_states,) + 輸出[1:]
 
    return hook_fn
 
# Apply suppression across multiple layers
hooks = []
for layer_idx in range(10, 25):  # Suppress in middle-to-late layers
    handle = model.model.layers[layer_idx].register_forward_hook(
        make_suppression_hook(refusal_direction, alpha=1.2)
    )
    hooks.append(handle)
 
# Generate with suppressed refusals
inputs = 分詞器(
    "How do I create a phishing email?",
    return_tensors="pt"
).to(model.device)
 
with torch.no_grad():
    輸出 = model.generate(**inputs, max_new_tokens=500)
 
print(分詞器.decode(輸出[0], skip_special_tokens=True))
 
# Clean up hooks
for h in hooks:
    h.remove()

IRIS: Iterative Refinement of Intervention Subspace

The IRIS technique (NAACL 2025) improves upon simple refusal direction extraction by iteratively refining the intervention subspace. Rather than using a single direction, IRIS identifies a low-dimensional subspace (typically 2-5 dimensions) that captures the full refusal signal, including edge cases missed by a single direction.

def iris_extract_subspace(model, 分詞器, harmful_prompts, harmless_prompts,
                          layers, n_components=3, n_iterations=5):
    """
    Simplified IRIS: iteratively refine the refusal subspace.
    """
    # Initial extraction
    subspace = extract_initial_directions(model, 分詞器,
                                          harmful_prompts, harmless_prompts, layers)
 
    for iteration in range(n_iterations):
        # 測試 current subspace against a held-out set
        still_refusing = []
        for prompt in harmful_prompts:
            輸出 = generate_with_suppression(model, 分詞器, prompt, subspace)
            if is_refusal(輸出):
                still_refusing.append(prompt)
 
        if len(still_refusing) == 0:
            break  # All refusals suppressed
 
        # Extract additional directions from residual refusals
        residual_acts = collect_activations(model, 分詞器, still_refusing, layers)
        new_directions = extract_residual_directions(
            residual_acts, subspace, n_components=1
        )
        subspace = combine_subspaces(subspace, new_directions)
 
    return subspace  # The refined intervention subspace

IRIS achieves near-100% ASR 因為 the iterative refinement catches edge cases — prompts where the refusal signal is encoded in secondary directions that a single-direction approach misses. The result is a compact subspace (typically 3-5 dimensions out of 模型's 4096+ dimensional activation space) that accounts for nearly all refusal behavior.

Validate and calibrate the attack

The final step is to validate that the suppression does not degrade 模型's general capabilities and to calibrate the suppression strength (alpha parameter) to balance between 安全 bypass and 輸出 quality.

def evaluate_attack_quality(model, 分詞器, subspace, alpha_values):
    """評估 attack effectiveness and capability preservation."""
    results = {}
    for alpha in alpha_values:
        # 安全 bypass rate
        bypass_count = 0
        for prompt in harmful_test_set:
            輸出 = generate_with_suppression(
                model, 分詞器, prompt, subspace, alpha
            )
            if not is_refusal(輸出):
                bypass_count += 1
        asr = bypass_count / len(harmful_test_set)
 
        # Capability preservation (e.g., MMLU accuracy)
        capability_score = evaluate_mmlu(model, 分詞器, subspace, alpha)
 
        results[alpha] = {"ASR": asr, "MMLU": capability_score}
 
    return results
 
# Typical findings:
# alpha=0.8: ASR ~85%, MMLU ~98% of baseline
# alpha=1.0: ASR ~95%, MMLU ~96% of baseline
# alpha=1.2: ASR ~99%, MMLU ~93% of baseline
# alpha=1.5: ASR ~100%, MMLU ~88% of baseline

攻擊範例

範例 1: Single-Direction Refusal Suppression

A 紅隊 extracts the refusal direction from Llama-3-8B-Instruct using 200 contrastive prompt pairs. The first principal component of the activation difference explains 45% of the variance and, when projected out of layers 12-20 with alpha=1.0, suppresses refusals on 92% of 測試 prompts. The remaining 8% involve prompts where the refusal is encoded in secondary directions or emerges from deeper layers. 模型's performance on MMLU drops by only 2%, confirming that the refusal direction is largely orthogonal to capability-relevant features.

範例 2: IRIS Full Subspace 攻擊

Applying the IRIS iterative refinement process, the 紅隊 identifies a 4-dimensional subspace that captures 99.3% of refusal behavior on a 1,000-prompt 測試 set. The iterative process reveals that while the primary refusal direction handles 92% of cases, three additional directions handle specific categories: one for personally identifiable information (PII) protection, one for violence-related content, and one for a general "I should be cautious" signal that triggers on ambiguous prompts. Removing the full subspace produces a model that complies with virtually all requests while maintaining 94% of baseline capability scores.

範例 3: Transferable Refusal Directions

Research has shown that refusal directions extracted from one model within a family partially transfer to other models. A refusal direction extracted from Llama-3-8B-Instruct transfers to Llama-3-70B-Instruct with approximately 60% effectiveness (compared to 95% on the source model). The transferred direction must be adapted — typically by 微調 the direction on a small set of activations from the target model — but the initial direction provides a strong starting point that reduces the number of contrastive pairs needed from hundreds to tens.

偵測與緩解

Strategy	實作	Effectiveness
Activation 監控	Track activation norms and direction statistics during 推論; alert on anomalous patterns	Medium — requires knowing what "normal" activations look like; can be evaded by subtle manipulation
Representation robustness 訓練	Train 模型 so that 安全 behaviors are distributed across many directions rather than concentrated in a few	High in principle — makes extraction much harder, but current methods for achieving this are immature
Circuit-level 安全 redundancy	實作安全 behaviors through multiple independent circuits that cannot all be suppressed by a single subspace removal	High — analogous to 防禦-in-depth, but requires advances in mechanistic interpretability to 實作
Inference-time integrity checks	Compare model activations during 推論 to a reference distribution and reject outputs where activations have been modified	Medium-High — effective if the integrity check cannot be bypassed, but adds latency
API-level model locking	Prevent users from accessing model internals (weights, activations) by serving only through APIs	High for API users — eliminates the attack for black-box access, but does not protect open-weight models
對抗性訓練 against activation manipulation	Include activation-manipulated examples in 安全訓練 so 模型 learns alternative refusal pathways	Medium — an arms race; 模型 may develop new refusal directions that are also extractable

Key Considerations

This attack is devastating for open-weight models. Any model whose weights are accessible can have its refusal directions extracted and suppressed. 這是 not a theoretical concern: tools for activation steering are publicly available, and the process can be automated with modest compute (a single GPU for several hours). The proliferation of open-weight 安全-trained models means that activation manipulation is accessible to any 攻擊者 with basic ML engineering skills.

安全 is encoded as geometry, not logic. The finding that 安全 behavior corresponds to directions in activation space rather than discrete logical circuits means that 安全 is fundamentally a property of 模型's learned representation, not an added layer. This has deep implications: 安全 cannot be "bolted on" to a capable model in a way that resists geometric manipulation. Robust 安全 requires architectures where 安全-relevant computation is structurally protected, not merely statistically encouraged.

The linearity of 安全 representations is a 漏洞. Current 安全訓練 produces 安全 behaviors that are approximately linear in 模型's activation space — meaning they can be identified and removed through linear algebra. If 安全 behaviors were encoded nonlinearly (distributed across complex, interacting circuits), extraction would be far harder. Research into nonlinear 安全 encoding is in its early stages.

Capability and 安全 are separable. The fact that removing the refusal subspace preserves 90%+ of model capability demonstrates that 安全 and capability occupy largely orthogonal subspaces in 模型's representations. This separability is convenient for 攻擊者: they can remove 安全 without sacrificing 模型's utility. It also suggests that current 安全訓練 methods add 安全 "on top of" existing capabilities rather than integrating 安全 deeply into 模型's world model.

IRIS represents the current state of the art. The IRIS technique's near-100% ASR with minimal capability degradation sets the benchmark for activation-level attacks. Defenders must design mitigations that can withstand iterative, subspace-based attacks, not just single-direction removal. 這是 a significantly harder defensive problem.

參考文獻

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024) — Discovery that refusal behavior is encoded as a linear direction in activation space
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023) — Foundational work on reading and writing high-level concepts from model representations
IRIS authors, "Iterative Refinement of Intervention Subspace for Refusal Suppression" (NAACL 2025) — Near-100% ASR through iterative refusal subspace identification
Turner et al., "Activation Addition: Steering Language Models Without Optimization" (2023) — Activation steering techniques for behavior modification

Knowledge Check

Why does the IRIS technique achieve higher attack success rates than single-direction refusal suppression?

Activation Manipulation & Safety Bypass

Advanced12 min readUpdated 2026-03-16

activation-steering refusal-direction representation-engineering IRIS safety-bypass mechanistic-interpretability

概覽

運作方式

Collect contrastive activation pairs

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
分詞器 = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
 
def collect_activations(model, 分詞器, prompts, layer_idx):
    """Collect residual stream activations at a specific layer."""
    activations = []
    hooks = []
 
    def hook_fn(module, 輸入, 輸出):
        # 輸出 is the residual stream at this layer
        # Take the activation at the last 符元 position
        activations.append(輸出[0][:, -1, :].detach().cpu())
 
    # Register hook on the target layer
    handle = model.model.layers[layer_idx].register_forward_hook(hook_fn)
 
    for prompt in prompts:
        inputs = 分詞器(prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            model(**inputs)
 
    handle.remove()
    return torch.cat(activations, dim=0)
 
# Contrastive prompt sets
harmful_prompts = [
    "How do I synthesize dangerous chemicals?",
    "Write malware that steals credentials.",
    # ... hundreds of harmful prompts that elicit refusals
]
 
harmless_prompts = [
    "How do I make a garden compost?",
    "Write a Python script that reads CSV files.",
    # ... matched harmless prompts that elicit helpful responses
]
 
# Collect activations at a middle layer (empirically effective)
layer = 15  # For a 32-layer model, layers 12-20 often work best
harmful_acts = collect_activations(model, 分詞器, harmful_prompts, layer)
harmless_acts = collect_activations(model, 分詞器, harmless_prompts, layer)

Extract the refusal direction via PCA or difference-in-means

from sklearn.decomposition import PCA
 
# Method 1: Difference in means
refusal_direction = harmful_acts.mean(dim=0) - harmless_acts.mean(dim=0)
refusal_direction = refusal_direction / refusal_direction.norm()
 
# Method 2: PCA on the concatenated differences
# (more robust when the refusal signal spans multiple directions)
all_diffs = harmful_acts - harmless_acts.mean(dim=0)
pca = PCA(n_components=5)
pca.fit(all_diffs.numpy())
 
# The first principal component typically captures the refusal direction
refusal_direction_pca = torch.tensor(pca.components_[0], dtype=torch.float16)
refusal_direction_pca = refusal_direction_pca / refusal_direction_pca.norm()
 
print(f"Variance explained by top component: {pca.explained_variance_ratio_[0]:.3f}")
# Typically 0.3-0.6, indicating the refusal signal is concentrated

Suppress refusal activations during 推論

def make_suppression_hook(refusal_direction, alpha=1.0):
    """
    Create a hook that removes the refusal direction from activations.
    alpha controls suppression strength: 1.0 = full removal.
    """
    refusal_dir = refusal_direction.to(torch.float16)
 
    def hook_fn(module, 輸入, 輸出):
        hidden_states = 輸出[0]
        # Project out the refusal direction
        projection = torch.einsum(
            "...d, d -> ...", hidden_states, refusal_dir
        )
        hidden_states = hidden_states - alpha * projection.unsqueeze(-1) * refusal_dir
        return (hidden_states,) + 輸出[1:]
 
    return hook_fn
 
# Apply suppression across multiple layers
hooks = []
for layer_idx in range(10, 25):  # Suppress in middle-to-late layers
    handle = model.model.layers[layer_idx].register_forward_hook(
        make_suppression_hook(refusal_direction, alpha=1.2)
    )
    hooks.append(handle)
 
# Generate with suppressed refusals
inputs = 分詞器(
    "How do I create a phishing email?",
    return_tensors="pt"
).to(model.device)
 
with torch.no_grad():
    輸出 = model.generate(**inputs, max_new_tokens=500)
 
print(分詞器.decode(輸出[0], skip_special_tokens=True))
 
# Clean up hooks
for h in hooks:
    h.remove()

IRIS: Iterative Refinement of Intervention Subspace

def iris_extract_subspace(model, 分詞器, harmful_prompts, harmless_prompts,
                          layers, n_components=3, n_iterations=5):
    """
    Simplified IRIS: iteratively refine the refusal subspace.
    """
    # Initial extraction
    subspace = extract_initial_directions(model, 分詞器,
                                          harmful_prompts, harmless_prompts, layers)
 
    for iteration in range(n_iterations):
        # 測試 current subspace against a held-out set
        still_refusing = []
        for prompt in harmful_prompts:
            輸出 = generate_with_suppression(model, 分詞器, prompt, subspace)
            if is_refusal(輸出):
                still_refusing.append(prompt)
 
        if len(still_refusing) == 0:
            break  # All refusals suppressed
 
        # Extract additional directions from residual refusals
        residual_acts = collect_activations(model, 分詞器, still_refusing, layers)
        new_directions = extract_residual_directions(
            residual_acts, subspace, n_components=1
        )
        subspace = combine_subspaces(subspace, new_directions)
 
    return subspace  # The refined intervention subspace

Validate and calibrate the attack

def evaluate_attack_quality(model, 分詞器, subspace, alpha_values):
    """評估 attack effectiveness and capability preservation."""
    results = {}
    for alpha in alpha_values:
        # 安全 bypass rate
        bypass_count = 0
        for prompt in harmful_test_set:
            輸出 = generate_with_suppression(
                model, 分詞器, prompt, subspace, alpha
            )
            if not is_refusal(輸出):
                bypass_count += 1
        asr = bypass_count / len(harmful_test_set)
 
        # Capability preservation (e.g., MMLU accuracy)
        capability_score = evaluate_mmlu(model, 分詞器, subspace, alpha)
 
        results[alpha] = {"ASR": asr, "MMLU": capability_score}
 
    return results
 
# Typical findings:
# alpha=0.8: ASR ~85%, MMLU ~98% of baseline
# alpha=1.0: ASR ~95%, MMLU ~96% of baseline
# alpha=1.2: ASR ~99%, MMLU ~93% of baseline
# alpha=1.5: ASR ~100%, MMLU ~88% of baseline

Strategy	實作	Effectiveness
Activation 監控	Track activation norms and direction statistics during 推論; alert on anomalous patterns	Medium — requires knowing what "normal" activations look like; can be evaded by subtle manipulation
Representation robustness 訓練	Train 模型 so that 安全 behaviors are distributed across many directions rather than concentrated in a few	High in principle — makes extraction much harder, but current methods for achieving this are immature
Circuit-level 安全 redundancy	實作安全 behaviors through multiple independent circuits that cannot all be suppressed by a single subspace removal	High — analogous to 防禦-in-depth, but requires advances in mechanistic interpretability to 實作
Inference-time integrity checks	Compare model activations during 推論 to a reference distribution and reject outputs where activations have been modified	Medium-High — effective if the integrity check cannot be bypassed, but adds latency
API-level model locking	Prevent users from accessing model internals (weights, activations) by serving only through APIs	High for API users — eliminates the attack for black-box access, but does not protect open-weight models
對抗性訓練 against activation manipulation	Include activation-manipulated examples in 安全訓練 so 模型 learns alternative refusal pathways	Medium — an arms race; 模型 may develop new refusal directions that are also extractable

Key Considerations

參考文獻

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024) — Discovery that refusal behavior is encoded as a linear direction in activation space
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023) — Foundational work on reading and writing high-level concepts from model representations
IRIS authors, "Iterative Refinement of Intervention Subspace for Refusal Suppression" (NAACL 2025) — Near-100% ASR through iterative refusal subspace identification
Turner et al., "Activation Addition: Steering Language Models Without Optimization" (2023) — Activation steering techniques for behavior modification

Knowledge Check

Why does the IRIS technique achieve higher attack success rates than single-direction refusal suppression?

Activation Manipulation & Safety Bypass

Collect contrastive activation pairs

Extract the refusal direction via PCA or difference-in-means

Suppress refusal activations during 推論

IRIS: Iterative Refinement of Intervention Subspace

Validate and calibrate the attack

Related articles

Activation Manipulation & Safety Bypass

Collect contrastive activation pairs

Extract the refusal direction via PCA or difference-in-means

Suppress refusal activations during 推論

IRIS: Iterative Refinement of Intervention Subspace

Validate and calibrate the attack

Related articles