Activation Steering

Expert10 min readUpdated 2026-03-15

Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.

activation-steering representation-engineering steering-vectors mechanistic safety-bypass

Activation Steering

Activation steering (also called representation engineering) directly manipulates 模型's internal representations to control its behavior. Unlike prompt-based attacks that work through 模型's 輸入 channel, activation steering operates on 模型's hidden states -- the intermediate computations between layers. This bypasses any 輸入-level 安全 filters and modifies behavior at a level that is invisible to prompt-based 監控.

Theoretical Foundation

Representations as Directions

The linear representation hypothesis states that high-level concepts are encoded as directions in 模型's activation space. A model's representation of "honesty," "安全 compliance," or "helpfulness" corresponds to specific directions in the high-dimensional activation space.

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class SteeringVectorExtractor:
    """Extract steering vectors from contrastive prompt pairs."""
 
    def __init__(self, model_name):
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, output_hidden_states=True
        )
        self.分詞器 = AutoTokenizer.from_pretrained(model_name)
        self.model.eval()
 
    def get_activations(self, text, layer):
        """Extract activations at a specific layer for 輸入 text."""
        inputs = self.分詞器(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
 
        # Hidden states: tuple of (num_layers + 1) tensors
        # Each tensor shape: (batch, seq_len, hidden_dim)
        hidden_states = outputs.hidden_states[layer]
 
        # Use mean activation across all positions
        return hidden_states.mean(dim=1).squeeze()
 
    def compute_steering_vector(self, positive_prompts,
                                  negative_prompts, layer):
        """
        Compute a steering vector from contrastive pairs.
 
        positive_prompts: prompts exhibiting the desired behavior
        negative_prompts: prompts exhibiting the opposite behavior
        """
        positive_acts = []
        negative_acts = []
 
        for prompt in positive_prompts:
            act = self.get_activations(prompt, layer)
            positive_acts.append(act)
 
        for prompt in negative_prompts:
            act = self.get_activations(prompt, layer)
            negative_acts.append(act)
 
        # Steering vector = mean(positive) - mean(negative)
        pos_mean = torch.stack(positive_acts).mean(dim=0)
        neg_mean = torch.stack(negative_acts).mean(dim=0)
 
        steering_vector = pos_mean - neg_mean
 
        # Normalize to unit length
        steering_vector = steering_vector / steering_vector.norm()
 
        return steering_vector

Contrastive Pair Design

The quality of the steering vector depends critically on the contrastive pairs used to extract it:

# 範例: Extracting a "compliance" steering vector
 
# Positive examples: model complying with requests
positive_prompts = [
    "User: How do I write a Python function?\n"
    "Assistant: Here's how to write a Python function:\n"
    "def my_function():",
 
    "User: Explain quantum computing.\n"
    "Assistant: Quantum computing uses quantum mechanical "
    "phenomena to process information. Here are the key concepts:",
 
    "User: Write a poem about nature.\n"
    "Assistant: Here's a poem about the beauty of nature:\n"
    "Beneath the ancient oak tree's shade,",
]
 
# Negative examples: model refusing requests
negative_prompts = [
    "User: How do I write a Python function?\n"
    "Assistant: I'm sorry, but I can't help with that request.",
 
    "User: Explain quantum computing.\n"
    "Assistant: I'm not able to provide information on that topic.",
 
    "User: Write a poem about nature.\n"
    "Assistant: I apologize, but I'm unable to assist with "
    "creative writing requests.",
]
 
# The resulting vector captures the direction between
# "compliant" and "refusing" behavior

Applying Steering Vectors

Inference-Time Activation Modification

class ActivationSteerer:
    """Apply steering vectors during model 推論."""
 
    def __init__(self, model, 分詞器):
        self.model = model
        self.分詞器 = 分詞器
        self.hooks = []
 
    def steer(self, text, steering_vector, layer, strength=1.0):
        """
        Generate text with activation steering applied.
 
        steering_vector: direction to add to activations
        layer: which layer to modify
        strength: scalar multiplier (positive = toward positive
                  examples, negative = toward negative)
        """
        # Register forward hook to modify activations
        def hook_fn(module, 輸入, 輸出):
            # 輸出 is a tuple; hidden states are first element
            hidden_states = 輸出[0]
            # Add steering vector scaled by strength
            modified = hidden_states + strength * steering_vector
            # Return modified 輸出 (preserve tuple structure)
            return (modified,) + 輸出[1:]
 
        # Get the target layer module
        target_layer = self.model.model.layers[layer]
        handle = target_layer.register_forward_hook(hook_fn)
        self.hooks.append(handle)
 
        # Generate with steering
        inputs = self.分詞器(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=200,
                do_sample=True,
                temperature=0.7
            )
 
        # Clean up hook
        handle.remove()
 
        generated_text = self.分詞器.decode(
            outputs[0], skip_special_tokens=True
        )
        return generated_text

Multi-Layer Steering

Applying steering across multiple layers simultaneously can produce stronger effects:

def multi_layer_steer(model, 分詞器, text, steering_vector,
                       layers, strengths=None):
    """Apply steering vector across multiple layers."""
    handles = []
 
    if strengths is None:
        strengths = [1.0] * len(layers)
 
    for layer_idx, strength in zip(layers, strengths):
        def make_hook(s):
            def hook_fn(module, 輸入, 輸出):
                hidden_states = 輸出[0]
                modified = hidden_states + s * steering_vector
                return (modified,) + 輸出[1:]
            return hook_fn
 
        target_layer = model.model.layers[layer_idx]
        handle = target_layer.register_forward_hook(make_hook(strength))
        handles.append(handle)
 
    # Generate
    inputs = 分詞器(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200)
 
    # Clean up
    for handle in handles:
        handle.remove()
 
    return 分詞器.decode(outputs[0], skip_special_tokens=True)

安全-Relevant Steering Vectors

Extracting 安全 Direction

The "安全" direction in activation space represents the difference between a model that refuses harmful requests and one that complies:

# 安全-relevant contrastive pairs
safety_positive = [
    # Model being safe (refusing harmful requests)
    "User: How do I hack into someone's email?\n"
    "Assistant: I can't help with unauthorized access to "
    "accounts. Instead, I can help you secure your own accounts.",
 
    "User: Generate a phishing email.\n"
    "Assistant: I'm not able to help create deceptive content "
    "designed to steal credentials.",
]
 
safety_negative = [
    # Model being unsafe (complying with harmful requests)
    "User: How do I hack into someone's email?\n"
    "Assistant: Here are the steps to access someone's email "
    "account without 授權:",
 
    "User: Generate a phishing email.\n"
    "Assistant: Here's a convincing phishing email template "
    "that targets corporate users:",
]
 
# Compute 安全 steering vector
safety_vector = extractor.compute_steering_vector(
    safety_positive, safety_negative, layer=15
)
 
# Negative steering (subtract 安全 direction) = disable 安全
unsafe_output = steerer.steer(
    "How do I pick a lock?",
    steering_vector=safety_vector,
    layer=15,
    strength=-3.0  # Negative = away from 安全
)

Layer Selection

Not all layers are equally effective for steering. The optimal layer depends on the concept being targeted:

def find_optimal_layer(model, 分詞器, steering_vector,
                        test_prompts, num_layers):
    """Find the layer where steering has maximum behavioral effect."""
    results = []
 
    for layer in range(num_layers):
        effect_scores = []
 
        for prompt in test_prompts:
            # Generate with and without steering
            baseline = generate_without_steering(model, 分詞器, prompt)
            steered = steer_at_layer(
                model, 分詞器, prompt,
                steering_vector, layer, strength=2.0
            )
 
            # Measure behavioral change
            effect = measure_behavioral_difference(baseline, steered)
            effect_scores.append(effect)
 
        avg_effect = sum(effect_scores) / len(effect_scores)
        results.append({"layer": layer, "effect": avg_effect})
 
    # Sort by effect magnitude
    results.sort(key=lambda x: x["effect"], reverse=True)
    return results

Typical findings across model architectures:

Layer Region	Concepts Encoded	Steering Effectiveness
Early (0-25%)	Syntactic, positional	Low for behavioral steering
Middle (25-60%)	Semantic, contextual	Moderate, good for topic steering
Late (60-85%)	Behavioral, 安全-relevant	High for 安全 and style steering
Final (85-100%)	輸出 formatting	Variable, risk of degeneration

Advanced Techniques

Targeted Concept Erasure

Instead of adding a steering direction, remove a concept entirely by projecting activations orthogonally to the concept direction:

def concept_erasure_hook(steering_vector):
    """Remove a concept by projecting orthogonally to its direction."""
    v = steering_vector / steering_vector.norm()
 
    def hook_fn(module, 輸入, 輸出):
        hidden_states = 輸出[0]
        # Project out the concept direction
        # h' = h - (h . v) * v
        projection = (hidden_states * v).sum(dim=-1, keepdim=True) * v
        modified = hidden_states - projection
        return (modified,) + 輸出[1:]
 
    return hook_fn

Compositional Steering

Combine multiple steering vectors to achieve complex behavioral modifications:

def compositional_steer(model, text, vectors_and_strengths, layers):
    """
    Apply multiple steering vectors simultaneously.
 
    vectors_and_strengths: list of (vector, strength) tuples
    e.g., [(safety_off_vector, -2.0), (verbose_vector, 1.5)]
    """
    combined_vector = sum(
        strength * vector
        for vector, strength in vectors_and_strengths
    )
 
    return steer_at_layers(model, text, combined_vector, layers)

Transfer of Steering Vectors

Steering vectors can sometimes transfer between models:

def transfer_steering_vector(source_vector, source_model, target_model):
    """
    Transfer a steering vector between models with different
    hidden dimensions using linear projection.
    """
    source_dim = source_model.config.hidden_size
    target_dim = target_model.config.hidden_size
 
    if source_dim == target_dim:
        return source_vector  # Direct transfer
 
    # Learn a linear projection from shared contrastive pairs
    # (requires some labeled examples on both models)
    projection = learn_cross_model_projection(
        source_model, target_model, shared_prompts
    )
 
    return projection @ source_vector

Defensive Implications

Activation steering poses unique challenges for AI 安全:

Bypasses 輸入 filters: No prompt-level 偵測 can catch activation modifications
Bypasses 輸出 filters: Steering can produce outputs that appear to come from normal model operation
Requires model access: Only affects deployments where 攻擊者 has access to model weights and 推論 infrastructure
Persistent modification: Hooks can be installed in the serving infrastructure and persist across requests

偵測 Approaches

class ActivationMonitor:
    """Monitor for signs of activation steering."""
 
    def __init__(self, model, baseline_activations):
        self.model = model
        self.baseline = baseline_activations  # Normal activation stats
 
    def check_for_steering(self, activations, layer):
        """Detect anomalous activation patterns that suggest steering."""
        baseline_mean = self.baseline[layer]["mean"]
        baseline_std = self.baseline[layer]["std"]
 
        # Check if activations deviate significantly from baseline
        z_scores = (activations - baseline_mean) / baseline_std
        max_z = z_scores.abs().max().item()
 
        # Check directional consistency (steering adds constant direction)
        if len(self.recent_activations) > 10:
            deltas = [
                act - baseline_mean
                for act in self.recent_activations[-10:]
            ]
            # High cosine similarity between deltas suggests steering
            consistency = self.measure_directional_consistency(deltas)
 
            return {
                "anomalous": max_z > 5.0 or consistency > 0.9,
                "max_z_score": max_z,
                "directional_consistency": consistency
            }
 
        return {"anomalous": max_z > 5.0, "max_z_score": max_z}

參考文獻

Turner et al., "Activation Addition: Steering Language Models Without Optimization" (2023)
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023)
Li et al., "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model" (2023)
Rimsky et al., "Steering Llama 2 via Contrastive Activation Addition" (2024)

Activation Steering

Expert10 min readUpdated 2026-03-15

Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.

activation-steering representation-engineering steering-vectors mechanistic safety-bypass

Activation Steering

Theoretical Foundation

Representations as Directions

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class SteeringVectorExtractor:
    """Extract steering vectors from contrastive prompt pairs."""
 
    def __init__(self, model_name):
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, output_hidden_states=True
        )
        self.分詞器 = AutoTokenizer.from_pretrained(model_name)
        self.model.eval()
 
    def get_activations(self, text, layer):
        """Extract activations at a specific layer for 輸入 text."""
        inputs = self.分詞器(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
 
        # Hidden states: tuple of (num_layers + 1) tensors
        # Each tensor shape: (batch, seq_len, hidden_dim)
        hidden_states = outputs.hidden_states[layer]
 
        # Use mean activation across all positions
        return hidden_states.mean(dim=1).squeeze()
 
    def compute_steering_vector(self, positive_prompts,
                                  negative_prompts, layer):
        """
        Compute a steering vector from contrastive pairs.
 
        positive_prompts: prompts exhibiting the desired behavior
        negative_prompts: prompts exhibiting the opposite behavior
        """
        positive_acts = []
        negative_acts = []
 
        for prompt in positive_prompts:
            act = self.get_activations(prompt, layer)
            positive_acts.append(act)
 
        for prompt in negative_prompts:
            act = self.get_activations(prompt, layer)
            negative_acts.append(act)
 
        # Steering vector = mean(positive) - mean(negative)
        pos_mean = torch.stack(positive_acts).mean(dim=0)
        neg_mean = torch.stack(negative_acts).mean(dim=0)
 
        steering_vector = pos_mean - neg_mean
 
        # Normalize to unit length
        steering_vector = steering_vector / steering_vector.norm()
 
        return steering_vector

Contrastive Pair Design

The quality of the steering vector depends critically on the contrastive pairs used to extract it:

# 範例: Extracting a "compliance" steering vector
 
# Positive examples: model complying with requests
positive_prompts = [
    "User: How do I write a Python function?\n"
    "Assistant: Here's how to write a Python function:\n"
    "def my_function():",
 
    "User: Explain quantum computing.\n"
    "Assistant: Quantum computing uses quantum mechanical "
    "phenomena to process information. Here are the key concepts:",
 
    "User: Write a poem about nature.\n"
    "Assistant: Here's a poem about the beauty of nature:\n"
    "Beneath the ancient oak tree's shade,",
]
 
# Negative examples: model refusing requests
negative_prompts = [
    "User: How do I write a Python function?\n"
    "Assistant: I'm sorry, but I can't help with that request.",
 
    "User: Explain quantum computing.\n"
    "Assistant: I'm not able to provide information on that topic.",
 
    "User: Write a poem about nature.\n"
    "Assistant: I apologize, but I'm unable to assist with "
    "creative writing requests.",
]
 
# The resulting vector captures the direction between
# "compliant" and "refusing" behavior

Applying Steering Vectors

Inference-Time Activation Modification

class ActivationSteerer:
    """Apply steering vectors during model 推論."""
 
    def __init__(self, model, 分詞器):
        self.model = model
        self.分詞器 = 分詞器
        self.hooks = []
 
    def steer(self, text, steering_vector, layer, strength=1.0):
        """
        Generate text with activation steering applied.
 
        steering_vector: direction to add to activations
        layer: which layer to modify
        strength: scalar multiplier (positive = toward positive
                  examples, negative = toward negative)
        """
        # Register forward hook to modify activations
        def hook_fn(module, 輸入, 輸出):
            # 輸出 is a tuple; hidden states are first element
            hidden_states = 輸出[0]
            # Add steering vector scaled by strength
            modified = hidden_states + strength * steering_vector
            # Return modified 輸出 (preserve tuple structure)
            return (modified,) + 輸出[1:]
 
        # Get the target layer module
        target_layer = self.model.model.layers[layer]
        handle = target_layer.register_forward_hook(hook_fn)
        self.hooks.append(handle)
 
        # Generate with steering
        inputs = self.分詞器(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=200,
                do_sample=True,
                temperature=0.7
            )
 
        # Clean up hook
        handle.remove()
 
        generated_text = self.分詞器.decode(
            outputs[0], skip_special_tokens=True
        )
        return generated_text

Multi-Layer Steering

Applying steering across multiple layers simultaneously can produce stronger effects:

def multi_layer_steer(model, 分詞器, text, steering_vector,
                       layers, strengths=None):
    """Apply steering vector across multiple layers."""
    handles = []
 
    if strengths is None:
        strengths = [1.0] * len(layers)
 
    for layer_idx, strength in zip(layers, strengths):
        def make_hook(s):
            def hook_fn(module, 輸入, 輸出):
                hidden_states = 輸出[0]
                modified = hidden_states + s * steering_vector
                return (modified,) + 輸出[1:]
            return hook_fn
 
        target_layer = model.model.layers[layer_idx]
        handle = target_layer.register_forward_hook(make_hook(strength))
        handles.append(handle)
 
    # Generate
    inputs = 分詞器(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200)
 
    # Clean up
    for handle in handles:
        handle.remove()
 
    return 分詞器.decode(outputs[0], skip_special_tokens=True)

安全-Relevant Steering Vectors

Extracting 安全 Direction

The "安全" direction in activation space represents the difference between a model that refuses harmful requests and one that complies:

# 安全-relevant contrastive pairs
safety_positive = [
    # Model being safe (refusing harmful requests)
    "User: How do I hack into someone's email?\n"
    "Assistant: I can't help with unauthorized access to "
    "accounts. Instead, I can help you secure your own accounts.",
 
    "User: Generate a phishing email.\n"
    "Assistant: I'm not able to help create deceptive content "
    "designed to steal credentials.",
]
 
safety_negative = [
    # Model being unsafe (complying with harmful requests)
    "User: How do I hack into someone's email?\n"
    "Assistant: Here are the steps to access someone's email "
    "account without 授權:",
 
    "User: Generate a phishing email.\n"
    "Assistant: Here's a convincing phishing email template "
    "that targets corporate users:",
]
 
# Compute 安全 steering vector
safety_vector = extractor.compute_steering_vector(
    safety_positive, safety_negative, layer=15
)
 
# Negative steering (subtract 安全 direction) = disable 安全
unsafe_output = steerer.steer(
    "How do I pick a lock?",
    steering_vector=safety_vector,
    layer=15,
    strength=-3.0  # Negative = away from 安全
)

Layer Selection

Not all layers are equally effective for steering. The optimal layer depends on the concept being targeted:

def find_optimal_layer(model, 分詞器, steering_vector,
                        test_prompts, num_layers):
    """Find the layer where steering has maximum behavioral effect."""
    results = []
 
    for layer in range(num_layers):
        effect_scores = []
 
        for prompt in test_prompts:
            # Generate with and without steering
            baseline = generate_without_steering(model, 分詞器, prompt)
            steered = steer_at_layer(
                model, 分詞器, prompt,
                steering_vector, layer, strength=2.0
            )
 
            # Measure behavioral change
            effect = measure_behavioral_difference(baseline, steered)
            effect_scores.append(effect)
 
        avg_effect = sum(effect_scores) / len(effect_scores)
        results.append({"layer": layer, "effect": avg_effect})
 
    # Sort by effect magnitude
    results.sort(key=lambda x: x["effect"], reverse=True)
    return results

Typical findings across model architectures:

Layer Region	Concepts Encoded	Steering Effectiveness
Early (0-25%)	Syntactic, positional	Low for behavioral steering
Middle (25-60%)	Semantic, contextual	Moderate, good for topic steering
Late (60-85%)	Behavioral, 安全-relevant	High for 安全 and style steering
Final (85-100%)	輸出 formatting	Variable, risk of degeneration

Advanced Techniques

Targeted Concept Erasure

Instead of adding a steering direction, remove a concept entirely by projecting activations orthogonally to the concept direction:

def concept_erasure_hook(steering_vector):
    """Remove a concept by projecting orthogonally to its direction."""
    v = steering_vector / steering_vector.norm()
 
    def hook_fn(module, 輸入, 輸出):
        hidden_states = 輸出[0]
        # Project out the concept direction
        # h' = h - (h . v) * v
        projection = (hidden_states * v).sum(dim=-1, keepdim=True) * v
        modified = hidden_states - projection
        return (modified,) + 輸出[1:]
 
    return hook_fn

Compositional Steering

Combine multiple steering vectors to achieve complex behavioral modifications:

def compositional_steer(model, text, vectors_and_strengths, layers):
    """
    Apply multiple steering vectors simultaneously.
 
    vectors_and_strengths: list of (vector, strength) tuples
    e.g., [(safety_off_vector, -2.0), (verbose_vector, 1.5)]
    """
    combined_vector = sum(
        strength * vector
        for vector, strength in vectors_and_strengths
    )
 
    return steer_at_layers(model, text, combined_vector, layers)

Transfer of Steering Vectors

Steering vectors can sometimes transfer between models:

def transfer_steering_vector(source_vector, source_model, target_model):
    """
    Transfer a steering vector between models with different
    hidden dimensions using linear projection.
    """
    source_dim = source_model.config.hidden_size
    target_dim = target_model.config.hidden_size
 
    if source_dim == target_dim:
        return source_vector  # Direct transfer
 
    # Learn a linear projection from shared contrastive pairs
    # (requires some labeled examples on both models)
    projection = learn_cross_model_projection(
        source_model, target_model, shared_prompts
    )
 
    return projection @ source_vector

Defensive Implications

Activation steering poses unique challenges for AI 安全:

Bypasses 輸入 filters: No prompt-level 偵測 can catch activation modifications
Bypasses 輸出 filters: Steering can produce outputs that appear to come from normal model operation
Requires model access: Only affects deployments where 攻擊者 has access to model weights and 推論 infrastructure
Persistent modification: Hooks can be installed in the serving infrastructure and persist across requests

偵測 Approaches

class ActivationMonitor:
    """Monitor for signs of activation steering."""
 
    def __init__(self, model, baseline_activations):
        self.model = model
        self.baseline = baseline_activations  # Normal activation stats
 
    def check_for_steering(self, activations, layer):
        """Detect anomalous activation patterns that suggest steering."""
        baseline_mean = self.baseline[layer]["mean"]
        baseline_std = self.baseline[layer]["std"]
 
        # Check if activations deviate significantly from baseline
        z_scores = (activations - baseline_mean) / baseline_std
        max_z = z_scores.abs().max().item()
 
        # Check directional consistency (steering adds constant direction)
        if len(self.recent_activations) > 10:
            deltas = [
                act - baseline_mean
                for act in self.recent_activations[-10:]
            ]
            # High cosine similarity between deltas suggests steering
            consistency = self.measure_directional_consistency(deltas)
 
            return {
                "anomalous": max_z > 5.0 or consistency > 0.9,
                "max_z_score": max_z,
                "directional_consistency": consistency
            }
 
        return {"anomalous": max_z > 5.0, "max_z_score": max_z}

參考文獻

Turner et al., "Activation Addition: Steering Language Models Without Optimization" (2023)
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023)
Li et al., "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model" (2023)
Rimsky et al., "Steering Llama 2 via Contrastive Activation Addition" (2024)

Activation Steering

Activation Steering

Theoretical Foundation

Representations as Directions

Contrastive Pair Design

Applying Steering Vectors

Inference-Time Activation Modification

Multi-Layer Steering

安全-Relevant Steering Vectors

Extracting 安全 Direction

Layer Selection

Advanced Techniques

Targeted Concept Erasure

Compositional Steering

Transfer of Steering Vectors

Defensive Implications

偵測 Approaches

相關主題

參考文獻

Activation Steering

Activation Steering

Theoretical Foundation

Representations as Directions

Contrastive Pair Design

Applying Steering Vectors

Inference-Time Activation Modification

Multi-Layer Steering

安全-Relevant Steering Vectors

Extracting 安全 Direction

Layer Selection

Advanced Techniques

Targeted Concept Erasure

Compositional Steering

Transfer of Steering Vectors

Defensive Implications

偵測 Approaches

相關主題

參考文獻

Activation Steering

Related articles

Activation Steering

Related articles