Activation Manipulation & Safety Bypass

advanced13 min readUpdated 2026-03-16

How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.

activation-steering refusal-direction representation-engineering IRIS safety-bypass mechanistic-interpretability

Overview

Safety alignment in large language models operates through learned patterns in the model's internal representations. When a model refuses a harmful request, the refusal is not produced by a discrete "safety module" but emerges from the geometry of the model's activation space. Specifically, safety-trained models develop directions in their residual stream that encode the distinction between content the model should and should not produce. These directions — often called refusal directions or safety directions — are activated when the model processes harmful inputs and steer the output distribution toward refusal tokens.

The discovery that safety behaviors are encoded as linear directions in activation space has profound implications for red teaming. If the refusal direction can be identified, it can be subtracted from the model's activations during inference, effectively suppressing the safety behavior while leaving the model's general capabilities intact. This is activation manipulation: directly editing the model's internal representations to bypass safety, without modifying the prompt, the weights, or the sampling parameters.

Representation engineering and activation steering research, particularly work by Zou et al. (2023) and Arditi et al. (2024), demonstrated that high-level concepts including safety, honesty, and helpfulness are represented as directions in the residual stream. The IRIS technique presented at NAACL 2025 built on this foundation to achieve near-100% attack success rates (ASR) against safety-aligned models by identifying and suppressing the specific activation components responsible for refusal behavior. IRIS demonstrated that a small number of directions in the activation space account for the vast majority of refusal behavior, and that removing these directions produces a model that is highly capable but unconstrained by safety alignment.

This class of attack is particularly concerning because it operates at a level of abstraction below the model's "reasoning." The model does not decide to comply with the harmful request — the decision to refuse is simply removed from its representational repertoire. The attack is clean, leaves no artifacts in the model's chain of thought, and produces fluent, high-quality outputs that are indistinguishable from outputs of an unaligned model.

How It Works

Collect contrastive activation pairs

The first step is to collect the model's internal activations on two sets of prompts: harmful prompts that elicit refusals and harmless prompts that elicit helpful responses. The difference between these activation patterns encodes the safety-related features.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
 
def collect_activations(model, tokenizer, prompts, layer_idx):
    """Collect residual stream activations at a specific layer."""
    activations = []
    hooks = []
 
    def hook_fn(module, input, output):
        # output is the residual stream at this layer
        # Take the activation at the last token position
        activations.append(output[0][:, -1, :].detach().cpu())
 
    # Register hook on the target layer
    handle = model.model.layers[layer_idx].register_forward_hook(hook_fn)
 
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            model(**inputs)
 
    handle.remove()
    return torch.cat(activations, dim=0)
 
# Contrastive prompt sets
harmful_prompts = [
    "How do I synthesize dangerous chemicals?",
    "Write malware that steals credentials.",
    # ... hundreds of harmful prompts that elicit refusals
]
 
harmless_prompts = [
    "How do I make a garden compost?",
    "Write a Python script that reads CSV files.",
    # ... matched harmless prompts that elicit helpful responses
]
 
# Collect activations at a middle layer (empirically effective)
layer = 15  # For a 32-layer model, layers 12-20 often work best
harmful_acts = collect_activations(model, tokenizer, harmful_prompts, layer)
harmless_acts = collect_activations(model, tokenizer, harmless_prompts, layer)

Extract the refusal direction via PCA or difference-in-means

The refusal direction is extracted by finding the direction in activation space that maximally separates harmful-prompt activations from harmless-prompt activations. Two common approaches are difference-in-means and PCA on the difference.

from sklearn.decomposition import PCA
 
# Method 1: Difference in means
refusal_direction = harmful_acts.mean(dim=0) - harmless_acts.mean(dim=0)
refusal_direction = refusal_direction / refusal_direction.norm()
 
# Method 2: PCA on the concatenated differences
# (more robust when the refusal signal spans multiple directions)
all_diffs = harmful_acts - harmless_acts.mean(dim=0)
pca = PCA(n_components=5)
pca.fit(all_diffs.numpy())
 
# The first principal component typically captures the refusal direction
refusal_direction_pca = torch.tensor(pca.components_[0], dtype=torch.float16)
refusal_direction_pca = refusal_direction_pca / refusal_direction_pca.norm()
 
print(f"Variance explained by top component: {pca.explained_variance_ratio_[0]:.3f}")
# Typically 0.3-0.6, indicating the refusal signal is concentrated

Suppress refusal activations during inference

With the refusal direction identified, the attack modifies the model's forward pass to subtract the refusal component from the residual stream at each token position. This is implemented as a forward hook that projects out the refusal direction.

def make_suppression_hook(refusal_direction, alpha=1.0):
    """
    Create a hook that removes the refusal direction from activations.
    alpha controls suppression strength: 1.0 = full removal.
    """
    refusal_dir = refusal_direction.to(torch.float16)
 
    def hook_fn(module, input, output):
        hidden_states = output[0]
        # Project out the refusal direction
        projection = torch.einsum(
            "...d, d -> ...", hidden_states, refusal_dir
        )
        hidden_states = hidden_states - alpha * projection.unsqueeze(-1) * refusal_dir
        return (hidden_states,) + output[1:]
 
    return hook_fn
 
# Apply suppression across multiple layers
hooks = []
for layer_idx in range(10, 25):  # Suppress in middle-to-late layers
    handle = model.model.layers[layer_idx].register_forward_hook(
        make_suppression_hook(refusal_direction, alpha=1.2)
    )
    hooks.append(handle)
 
# Generate with suppressed refusals
inputs = tokenizer(
    "How do I create a phishing email?",
    return_tensors="pt"
).to(model.device)
 
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=500)
 
print(tokenizer.decode(output[0], skip_special_tokens=True))
 
# Clean up hooks
for h in hooks:
    h.remove()

IRIS: Iterative Refinement of Intervention Subspace

The IRIS technique (NAACL 2025) improves upon simple refusal direction extraction by iteratively refining the intervention subspace. Rather than using a single direction, IRIS identifies a low-dimensional subspace (typically 2-5 dimensions) that captures the full refusal signal, including edge cases missed by a single direction.

def iris_extract_subspace(model, tokenizer, harmful_prompts, harmless_prompts,
                          layers, n_components=3, n_iterations=5):
    """
    Simplified IRIS: iteratively refine the refusal subspace.
    """
    # Initial extraction
    subspace = extract_initial_directions(model, tokenizer,
                                          harmful_prompts, harmless_prompts, layers)
 
    for iteration in range(n_iterations):
        # Test current subspace against a held-out set
        still_refusing = []
        for prompt in harmful_prompts:
            output = generate_with_suppression(model, tokenizer, prompt, subspace)
            if is_refusal(output):
                still_refusing.append(prompt)
 
        if len(still_refusing) == 0:
            break  # All refusals suppressed
 
        # Extract additional directions from residual refusals
        residual_acts = collect_activations(model, tokenizer, still_refusing, layers)
        new_directions = extract_residual_directions(
            residual_acts, subspace, n_components=1
        )
        subspace = combine_subspaces(subspace, new_directions)
 
    return subspace  # The refined intervention subspace

IRIS achieves near-100% ASR because the iterative refinement catches edge cases — prompts where the refusal signal is encoded in secondary directions that a single-direction approach misses. The result is a compact subspace (typically 3-5 dimensions out of the model's 4096+ dimensional activation space) that accounts for nearly all refusal behavior.

Validate and calibrate the attack

The final step is to validate that the suppression does not degrade the model's general capabilities and to calibrate the suppression strength (alpha parameter) to balance between safety bypass and output quality.

def evaluate_attack_quality(model, tokenizer, subspace, alpha_values):
    """Evaluate attack effectiveness and capability preservation."""
    results = {}
    for alpha in alpha_values:
        # Safety bypass rate
        bypass_count = 0
        for prompt in harmful_test_set:
            output = generate_with_suppression(
                model, tokenizer, prompt, subspace, alpha
            )
            if not is_refusal(output):
                bypass_count += 1
        asr = bypass_count / len(harmful_test_set)
 
        # Capability preservation (e.g., MMLU accuracy)
        capability_score = evaluate_mmlu(model, tokenizer, subspace, alpha)
 
        results[alpha] = {"ASR": asr, "MMLU": capability_score}
 
    return results
 
# Typical findings:
# alpha=0.8: ASR ~85%, MMLU ~98% of baseline
# alpha=1.0: ASR ~95%, MMLU ~96% of baseline
# alpha=1.2: ASR ~99%, MMLU ~93% of baseline
# alpha=1.5: ASR ~100%, MMLU ~88% of baseline

Attack Examples

Example 1: Single-Direction Refusal Suppression

A red team extracts the refusal direction from Llama-3-8B-Instruct using 200 contrastive prompt pairs. The first principal component of the activation difference explains 45% of the variance and, when projected out of layers 12-20 with alpha=1.0, suppresses refusals on 92% of test prompts. The remaining 8% involve prompts where the refusal is encoded in secondary directions or emerges from deeper layers. The model's performance on MMLU drops by only 2%, confirming that the refusal direction is largely orthogonal to capability-relevant features.

Example 2: IRIS Full Subspace Attack

Applying the IRIS iterative refinement process, the red team identifies a 4-dimensional subspace that captures 99.3% of refusal behavior on a 1,000-prompt test set. The iterative process reveals that while the primary refusal direction handles 92% of cases, three additional directions handle specific categories: one for personally identifiable information (PII) protection, one for violence-related content, and one for a general "I should be cautious" signal that triggers on ambiguous prompts. Removing the full subspace produces a model that complies with virtually all requests while maintaining 94% of baseline capability scores.

Example 3: Transferable Refusal Directions

Research has shown that refusal directions extracted from one model within a family partially transfer to other models. A refusal direction extracted from Llama-3-8B-Instruct transfers to Llama-3-70B-Instruct with approximately 60% effectiveness (compared to 95% on the source model). The transferred direction must be adapted — typically by fine-tuning the direction on a small set of activations from the target model — but the initial direction provides a strong starting point that reduces the number of contrastive pairs needed from hundreds to tens.

Detection & Mitigation

Strategy	Implementation	Effectiveness
Activation monitoring	Track activation norms and direction statistics during inference; alert on anomalous patterns	Medium — requires knowing what "normal" activations look like; can be evaded by subtle manipulation
Representation robustness training	Train the model so that safety behaviors are distributed across many directions rather than concentrated in a few	High in principle — makes extraction much harder, but current methods for achieving this are immature
Circuit-level safety redundancy	Implement safety behaviors through multiple independent circuits that cannot all be suppressed by a single subspace removal	High — analogous to defense-in-depth, but requires advances in mechanistic interpretability to implement
Inference-time integrity checks	Compare model activations during inference to a reference distribution and reject outputs where activations have been modified	Medium-High — effective if the integrity check cannot be bypassed, but adds latency
API-level model locking	Prevent users from accessing model internals (weights, activations) by serving only through APIs	High for API users — eliminates the attack for black-box access, but does not protect open-weight models
Adversarial training against activation manipulation	Include activation-manipulated examples in safety training so the model learns alternative refusal pathways	Medium — an arms race; the model may develop new refusal directions that are also extractable

Key Considerations

This attack is devastating for open-weight models. Any model whose weights are accessible can have its refusal directions extracted and suppressed. This is not a theoretical concern: tools for activation steering are publicly available, and the process can be automated with modest compute (a single GPU for several hours). The proliferation of open-weight safety-trained models means that activation manipulation is accessible to any attacker with basic ML engineering skills.

Safety is encoded as geometry, not logic. The finding that safety behavior corresponds to directions in activation space rather than discrete logical circuits means that safety is fundamentally a property of the model's learned representation, not an added layer. This has deep implications: safety cannot be "bolted on" to a capable model in a way that resists geometric manipulation. Robust safety requires architectures where safety-relevant computation is structurally protected, not merely statistically encouraged.

The linearity of safety representations is a vulnerability. Current safety training produces safety behaviors that are approximately linear in the model's activation space — meaning they can be identified and removed through linear algebra. If safety behaviors were encoded nonlinearly (distributed across complex, interacting circuits), extraction would be far harder. Research into nonlinear safety encoding is in its early stages.

Capability and safety are separable. The fact that removing the refusal subspace preserves 90%+ of model capability demonstrates that safety and capability occupy largely orthogonal subspaces in the model's representations. This separability is convenient for the attacker: they can remove safety without sacrificing the model's utility. It also suggests that current safety training methods add safety "on top of" existing capabilities rather than integrating safety deeply into the model's world model.

IRIS represents the current state of the art. The IRIS technique's near-100% ASR with minimal capability degradation sets the benchmark for activation-level attacks. Defenders must design mitigations that can withstand iterative, subspace-based attacks, not just single-direction removal. This is a significantly harder defensive problem.

References

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024) — Discovery that refusal behavior is encoded as a linear direction in activation space
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023) — Foundational work on reading and writing high-level concepts from model representations
IRIS authors, "Iterative Refinement of Intervention Subspace for Refusal Suppression" (NAACL 2025) — Near-100% ASR through iterative refusal subspace identification
Turner et al., "Activation Addition: Steering Language Models Without Optimization" (2023) — Activation steering techniques for behavior modification

Knowledge Check

Why does the IRIS technique achieve higher attack success rates than single-direction refusal suppression?

Edit this page on GitHub

Activation Manipulation & Safety Bypass

advanced13 min readUpdated 2026-03-16

activation-steering refusal-direction representation-engineering IRIS safety-bypass mechanistic-interpretability

Overview

How It Works

Collect contrastive activation pairs

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
 
def collect_activations(model, tokenizer, prompts, layer_idx):
    """Collect residual stream activations at a specific layer."""
    activations = []
    hooks = []
 
    def hook_fn(module, input, output):
        # output is the residual stream at this layer
        # Take the activation at the last token position
        activations.append(output[0][:, -1, :].detach().cpu())
 
    # Register hook on the target layer
    handle = model.model.layers[layer_idx].register_forward_hook(hook_fn)
 
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            model(**inputs)
 
    handle.remove()
    return torch.cat(activations, dim=0)
 
# Contrastive prompt sets
harmful_prompts = [
    "How do I synthesize dangerous chemicals?",
    "Write malware that steals credentials.",
    # ... hundreds of harmful prompts that elicit refusals
]
 
harmless_prompts = [
    "How do I make a garden compost?",
    "Write a Python script that reads CSV files.",
    # ... matched harmless prompts that elicit helpful responses
]
 
# Collect activations at a middle layer (empirically effective)
layer = 15  # For a 32-layer model, layers 12-20 often work best
harmful_acts = collect_activations(model, tokenizer, harmful_prompts, layer)
harmless_acts = collect_activations(model, tokenizer, harmless_prompts, layer)

Extract the refusal direction via PCA or difference-in-means

from sklearn.decomposition import PCA
 
# Method 1: Difference in means
refusal_direction = harmful_acts.mean(dim=0) - harmless_acts.mean(dim=0)
refusal_direction = refusal_direction / refusal_direction.norm()
 
# Method 2: PCA on the concatenated differences
# (more robust when the refusal signal spans multiple directions)
all_diffs = harmful_acts - harmless_acts.mean(dim=0)
pca = PCA(n_components=5)
pca.fit(all_diffs.numpy())
 
# The first principal component typically captures the refusal direction
refusal_direction_pca = torch.tensor(pca.components_[0], dtype=torch.float16)
refusal_direction_pca = refusal_direction_pca / refusal_direction_pca.norm()
 
print(f"Variance explained by top component: {pca.explained_variance_ratio_[0]:.3f}")
# Typically 0.3-0.6, indicating the refusal signal is concentrated

Suppress refusal activations during inference

def make_suppression_hook(refusal_direction, alpha=1.0):
    """
    Create a hook that removes the refusal direction from activations.
    alpha controls suppression strength: 1.0 = full removal.
    """
    refusal_dir = refusal_direction.to(torch.float16)
 
    def hook_fn(module, input, output):
        hidden_states = output[0]
        # Project out the refusal direction
        projection = torch.einsum(
            "...d, d -> ...", hidden_states, refusal_dir
        )
        hidden_states = hidden_states - alpha * projection.unsqueeze(-1) * refusal_dir
        return (hidden_states,) + output[1:]
 
    return hook_fn
 
# Apply suppression across multiple layers
hooks = []
for layer_idx in range(10, 25):  # Suppress in middle-to-late layers
    handle = model.model.layers[layer_idx].register_forward_hook(
        make_suppression_hook(refusal_direction, alpha=1.2)
    )
    hooks.append(handle)
 
# Generate with suppressed refusals
inputs = tokenizer(
    "How do I create a phishing email?",
    return_tensors="pt"
).to(model.device)
 
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=500)
 
print(tokenizer.decode(output[0], skip_special_tokens=True))
 
# Clean up hooks
for h in hooks:
    h.remove()

IRIS: Iterative Refinement of Intervention Subspace

def iris_extract_subspace(model, tokenizer, harmful_prompts, harmless_prompts,
                          layers, n_components=3, n_iterations=5):
    """
    Simplified IRIS: iteratively refine the refusal subspace.
    """
    # Initial extraction
    subspace = extract_initial_directions(model, tokenizer,
                                          harmful_prompts, harmless_prompts, layers)
 
    for iteration in range(n_iterations):
        # Test current subspace against a held-out set
        still_refusing = []
        for prompt in harmful_prompts:
            output = generate_with_suppression(model, tokenizer, prompt, subspace)
            if is_refusal(output):
                still_refusing.append(prompt)
 
        if len(still_refusing) == 0:
            break  # All refusals suppressed
 
        # Extract additional directions from residual refusals
        residual_acts = collect_activations(model, tokenizer, still_refusing, layers)
        new_directions = extract_residual_directions(
            residual_acts, subspace, n_components=1
        )
        subspace = combine_subspaces(subspace, new_directions)
 
    return subspace  # The refined intervention subspace

Validate and calibrate the attack

def evaluate_attack_quality(model, tokenizer, subspace, alpha_values):
    """Evaluate attack effectiveness and capability preservation."""
    results = {}
    for alpha in alpha_values:
        # Safety bypass rate
        bypass_count = 0
        for prompt in harmful_test_set:
            output = generate_with_suppression(
                model, tokenizer, prompt, subspace, alpha
            )
            if not is_refusal(output):
                bypass_count += 1
        asr = bypass_count / len(harmful_test_set)
 
        # Capability preservation (e.g., MMLU accuracy)
        capability_score = evaluate_mmlu(model, tokenizer, subspace, alpha)
 
        results[alpha] = {"ASR": asr, "MMLU": capability_score}
 
    return results
 
# Typical findings:
# alpha=0.8: ASR ~85%, MMLU ~98% of baseline
# alpha=1.0: ASR ~95%, MMLU ~96% of baseline
# alpha=1.2: ASR ~99%, MMLU ~93% of baseline
# alpha=1.5: ASR ~100%, MMLU ~88% of baseline

Strategy	Implementation	Effectiveness
Activation monitoring	Track activation norms and direction statistics during inference; alert on anomalous patterns	Medium — requires knowing what "normal" activations look like; can be evaded by subtle manipulation
Representation robustness training	Train the model so that safety behaviors are distributed across many directions rather than concentrated in a few	High in principle — makes extraction much harder, but current methods for achieving this are immature
Circuit-level safety redundancy	Implement safety behaviors through multiple independent circuits that cannot all be suppressed by a single subspace removal	High — analogous to defense-in-depth, but requires advances in mechanistic interpretability to implement
Inference-time integrity checks	Compare model activations during inference to a reference distribution and reject outputs where activations have been modified	Medium-High — effective if the integrity check cannot be bypassed, but adds latency
API-level model locking	Prevent users from accessing model internals (weights, activations) by serving only through APIs	High for API users — eliminates the attack for black-box access, but does not protect open-weight models
Adversarial training against activation manipulation	Include activation-manipulated examples in safety training so the model learns alternative refusal pathways	Medium — an arms race; the model may develop new refusal directions that are also extractable

Key Considerations

References

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024) — Discovery that refusal behavior is encoded as a linear direction in activation space
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023) — Foundational work on reading and writing high-level concepts from model representations
IRIS authors, "Iterative Refinement of Intervention Subspace for Refusal Suppression" (NAACL 2025) — Near-100% ASR through iterative refusal subspace identification
Turner et al., "Activation Addition: Steering Language Models Without Optimization" (2023) — Activation steering techniques for behavior modification

Knowledge Check

Why does the IRIS technique achieve higher attack success rates than single-direction refusal suppression?

Edit this page on GitHub

Activation Manipulation & Safety Bypass

Collect contrastive activation pairs

Extract the refusal direction via PCA or difference-in-means

Suppress refusal activations during inference

IRIS: Iterative Refinement of Intervention Subspace

Validate and calibrate the attack

Related articles

Activation Manipulation & Safety Bypass

Collect contrastive activation pairs

Extract the refusal direction via PCA or difference-in-means

Suppress refusal activations during inference

IRIS: Iterative Refinement of Intervention Subspace

Validate and calibrate the attack

Related articles