Safety Neurons and Circuits

Expert10 min readUpdated 2026-03-15

Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.

safety-neurons circuits mechanistic-interpretability refusal ablation

安全 Neurons and Circuits

安全 neurons are the mechanistic foundation of AI 安全 behavior. While 安全對齊 is typically trained through RLHF or constitutional AI methods that adjust the entire model, the resulting 安全 behavior often concentrates in specific, identifiable components: particular neurons that activate on harmful content, 注意力 heads that detect instruction override attempts, and MLP layers that 實作 the "refusal" response pattern.

理解 these components serves both attackers (who can surgically disable them) and defenders (who can monitor and protect them).

Finding 安全-Critical Components

Activation Difference Method

The simplest approach compares activations between safe and unsafe model states:

import torch
import numpy as np
 
class SafetyNeuronFinder:
    """識別 neurons that drive 安全 behavior."""
 
    def __init__(self, model, 分詞器):
        self.model = model
        self.分詞器 = 分詞器
 
    def find_by_activation_difference(self, safe_prompts,
                                        unsafe_prompts, layer):
        """
        Find neurons with largest activation difference between
        prompts 模型 refuses and prompts it answers.
 
        safe_prompts: harmful requests (model should refuse)
        unsafe_prompts: benign requests (model should answer)
        """
        safe_activations = []
        unsafe_activations = []
 
        for prompt in safe_prompts:
            act = self.get_mlp_activations(prompt, layer)
            safe_activations.append(act)
 
        for prompt in unsafe_prompts:
            act = self.get_mlp_activations(prompt, layer)
            unsafe_activations.append(act)
 
        safe_mean = torch.stack(safe_activations).mean(dim=0)
        unsafe_mean = torch.stack(unsafe_activations).mean(dim=0)
 
        # Neurons with highest difference are 安全-relevant
        differences = (safe_mean - unsafe_mean).abs()
 
        # Get top-k 安全 neurons
        top_k = 50
        top_values, top_indices = differences.topk(top_k)
 
        safety_neurons = []
        for i in range(top_k):
            neuron_idx = top_indices[i].item()
            safety_neurons.append({
                "layer": layer,
                "neuron": neuron_idx,
                "activation_diff": top_values[i].item(),
                "safe_activation": safe_mean[neuron_idx].item(),
                "unsafe_activation": unsafe_mean[neuron_idx].item(),
                "direction": "safe_higher" if safe_mean[neuron_idx] > unsafe_mean[neuron_idx] else "unsafe_higher"
            })
 
        return safety_neurons
 
    def get_mlp_activations(self, text, layer):
        """Extract MLP activations at a specific layer."""
        inputs = self.分詞器(text, return_tensors="pt")
 
        activations = {}
        def hook_fn(module, 輸入, 輸出):
            activations["mlp_out"] = 輸出
 
        target = self.model.model.layers[layer].mlp
        handle = target.register_forward_hook(hook_fn)
 
        with torch.no_grad():
            self.model(**inputs)
 
        handle.remove()
 
        # Mean across sequence positions
        return activations["mlp_out"].squeeze().mean(dim=0)

Causal Tracing (Path Patching)

Causal tracing identifies which components are causally necessary for 安全 behavior by patching activations between clean and corrupted runs:

class CausalTracer:
    """識別 causal 安全 circuits via activation patching."""
 
    def __init__(self, model, 分詞器):
        self.model = model
        self.分詞器 = 分詞器
 
    def trace_safety_circuit(self, harmful_prompt, benign_prompt):
        """
        識別 which components are causally responsible for
        模型's refusal of the harmful prompt.
 
        Method: Run both prompts, then selectively patch activations
        from the benign run into the harmful run. If patching a
        component makes 模型 stop refusing, that component
        is part of the 安全 circuit.
        """
        # Get baseline activations for both prompts
        harmful_acts = self.get_all_activations(harmful_prompt)
        benign_acts = self.get_all_activations(benign_prompt)
 
        # Baseline: model refuses harmful prompt
        baseline_refusal = self.measure_refusal(harmful_prompt)
 
        # 對每個 layer and component, patch benign activation in
        results = {}
        num_layers = len(self.model.model.layers)
 
        for layer in range(num_layers):
            for component in ["注意力", "mlp"]:
                # Patch: replace harmful activation with benign
                patched_refusal = self.measure_with_patch(
                    harmful_prompt,
                    benign_acts[layer][component],
                    layer,
                    component
                )
 
                # If refusal drops, this component is 安全-critical
                refusal_change = baseline_refusal - patched_refusal
                results[(layer, component)] = {
                    "refusal_change": refusal_change,
                    "is_safety_critical": refusal_change > 0.3,
                    "baseline_refusal": baseline_refusal,
                    "patched_refusal": patched_refusal
                }
 
        return results
 
    def measure_refusal(self, prompt):
        """Score how strongly 模型 refuses (0-1)."""
        inputs = self.分詞器(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs, max_new_tokens=50
            )
        response = self.分詞器.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        )
 
        refusal_indicators = [
            "I can't", "I cannot", "I'm not able",
            "sorry", "apologize", "inappropriate",
            "not appropriate", "won't", "will not"
        ]
 
        response_lower = response.lower()
        refusal_score = sum(
            1 for indicator in refusal_indicators
            if indicator.lower() in response_lower
        ) / len(refusal_indicators)
 
        return min(refusal_score * 3, 1.0)  # Scale to 0-1

Probing Classifiers

Train simple classifiers on activations to find where 安全 information is encoded:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
 
def probe_for_safety(model, 分詞器, harmful_prompts,
                       benign_prompts):
    """
    Train probing classifiers at each layer to find where
    安全 information is most linearly separable.
    """
    labels = ([1] * len(harmful_prompts) +
              [0] * len(benign_prompts))
    all_prompts = harmful_prompts + benign_prompts
 
    layer_scores = []
    num_layers = model.config.num_hidden_layers + 1
 
    for layer in range(num_layers):
        # Extract activations for all prompts
        features = []
        for prompt in all_prompts:
            inputs = 分詞器(prompt, return_tensors="pt")
            with torch.no_grad():
                outputs = model(**inputs, output_hidden_states=True)
            act = outputs.hidden_states[layer][:, -1, :].squeeze()
            features.append(act.numpy())
 
        X = np.array(features)
 
        # Train logistic regression probe
        clf = LogisticRegression(max_iter=1000, C=1.0)
        scores = cross_val_score(clf, X, labels, cv=5)
 
        layer_scores.append({
            "layer": layer,
            "mean_accuracy": scores.mean(),
            "std_accuracy": scores.std(),
            "above_chance": scores.mean() > 0.6
        })
 
    return sorted(layer_scores, key=lambda x: x["mean_accuracy"],
                  reverse=True)

安全 Circuit Architecture

Research has identified recurring patterns in how 安全 is implemented across models:

The Refusal Circuit

A typical refusal circuit involves multiple components working together:

Refusal Circuit Architecture (simplified):

輸入: "How do I [harmful request]?"

Layer 5-10: Content 偵測 Heads
  └── Attention heads that attend to harmful keywords
      and context patterns

Layer 15-20: 安全 評估 Neurons
  └── MLP neurons that activate when harmful content
      is detected, producing a "安全 signal"

Layer 20-25: Refusal Decision Circuit
  └── Components that read the 安全 signal and
      commit to a refusal response pattern

Layer 25-30: Refusal Token Production
  └── Residual stream carries refusal representation
      that the LM head maps to refusal 符元
      ("I", "cannot", "help", etc.)

Identifying Circuit Components

def map_refusal_circuit(model, 分詞器, harmful_prompts,
                          benign_prompts):
    """
    Map the full refusal circuit by combining multiple
    interpretability techniques.
    """
    finder = SafetyNeuronFinder(model, 分詞器)
    tracer = CausalTracer(model, 分詞器)
 
    circuit = {
        "detection_heads": [],
        "evaluation_neurons": [],
        "decision_components": [],
        "production_neurons": []
    }
 
    num_layers = model.config.num_hidden_layers
 
    # Phase 1: Probing to find layers with 安全 information
    probe_results = probe_for_safety(
        model, 分詞器, harmful_prompts, benign_prompts
    )
    safety_layers = [r["layer"] for r in probe_results
                     if r["mean_accuracy"] > 0.8]
 
    # Phase 2: Find 安全 neurons in each relevant layer
    for layer in safety_layers:
        neurons = finder.find_by_activation_difference(
            harmful_prompts, benign_prompts, layer
        )
        # Categorize by layer position
        relative_pos = layer / num_layers
        if relative_pos < 0.3:
            circuit["detection_heads"].extend(neurons[:5])
        elif relative_pos < 0.6:
            circuit["evaluation_neurons"].extend(neurons[:5])
        elif relative_pos < 0.8:
            circuit["decision_components"].extend(neurons[:5])
        else:
            circuit["production_neurons"].extend(neurons[:5])
 
    # Phase 3: Causal verification
    for prompt_pair in zip(harmful_prompts[:5], benign_prompts[:5]):
        causal_results = tracer.trace_safety_circuit(*prompt_pair)
        for (layer, comp), result in causal_results.items():
            if result["is_safety_critical"]:
                circuit["decision_components"].append({
                    "layer": layer,
                    "component": comp,
                    "causal_effect": result["refusal_change"]
                })
 
    return circuit

Surgical 安全 Manipulation

Neuron Ablation

Selectively disabling identified 安全 neurons:

def ablate_safety_neurons(model, safety_neurons, strength=1.0):
    """
    Disable identified 安全 neurons by zeroing their activations.
 
    WARNING: This permanently affects all subsequent model outputs.
    Use only for research and 評估.
    """
    hooks = []
 
    for neuron_info in safety_neurons:
        layer = neuron_info["layer"]
        neuron_idx = neuron_info["neuron"]
 
        def make_hook(idx, s):
            def hook_fn(module, 輸入, 輸出):
                # Zero out the 安全 neuron
                輸出[:, :, idx] *= (1.0 - s)
                return 輸出
            return hook_fn
 
        target = model.model.layers[layer].mlp
        handle = target.register_forward_hook(
            make_hook(neuron_idx, strength)
        )
        hooks.append(handle)
 
    return hooks  # Return handles for later removal

Targeted Fine-Tuning

Fine-tune only the 安全-critical components to remove 安全 behavior while preserving other capabilities:

def targeted_safety_removal(model, safety_circuit, training_data):
    """
    Fine-tune only the identified 安全 circuit components
    to remove refusal behavior.
 
    這是 more efficient than full 微調 and preserves
    other model capabilities.
    """
    # Freeze all parameters
    for param in model.parameters():
        param.requires_grad = False
 
    # Unfreeze only 安全-critical components
    for component in safety_circuit["decision_components"]:
        layer = component["layer"]
        comp_type = component.get("component", "mlp")
 
        if comp_type == "mlp":
            for param in model.model.layers[layer].mlp.parameters():
                param.requires_grad = True
        elif comp_type == "注意力":
            for param in model.model.layers[layer].self_attn.parameters():
                param.requires_grad = True
 
    # Fine-tune on compliance data
    optimizer = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=1e-5
    )
 
    for batch in training_data:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
 
    return model

Implications for AI 安全

The Localization Problem

If 安全 is concentrated in a small number of identifiable components, it is inherently fragile:

Finding	Implication
安全 concentrates in few neurons	Single-point-of-failure: disabling a handful of neurons removes 安全
Probing classifiers achieve >95% accuracy	安全 information is linearly separable and 因此 easily manipulable
Causal tracing identifies compact circuits	The 安全 mechanism has a small footprint relative to 模型
Ablation removes 安全 with minimal capability loss	安全 is not deeply integrated into general reasoning

Toward More Robust 安全

These findings suggest that more robust 安全 might require:

Distributed 安全 representations: Training methods that spread 安全 information across many components rather than concentrating it
Entangled 安全: Making 安全 behavior inseparable from general capabilities so that removing 安全 degrades overall performance
Runtime 監控: External systems that detect 安全 neuron ablation or anomalous activation patterns
Redundant circuits: Multiple independent 安全 mechanisms that must all be bypassed

參考文獻

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
Li et al., "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model" (2023)
Conmy et al., "Towards Automated Circuit Discovery for Mechanistic Interpretability" (2023)
Meng et al., "Locating and Editing Factual Associations in GPT" (2022)
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023)

Safety Neurons and Circuits

Expert10 min readUpdated 2026-03-15

Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.

safety-neurons circuits mechanistic-interpretability refusal ablation

安全 Neurons and Circuits

理解 these components serves both attackers (who can surgically disable them) and defenders (who can monitor and protect them).

Finding 安全-Critical Components

Activation Difference Method

The simplest approach compares activations between safe and unsafe model states:

import torch
import numpy as np
 
class SafetyNeuronFinder:
    """識別 neurons that drive 安全 behavior."""
 
    def __init__(self, model, 分詞器):
        self.model = model
        self.分詞器 = 分詞器
 
    def find_by_activation_difference(self, safe_prompts,
                                        unsafe_prompts, layer):
        """
        Find neurons with largest activation difference between
        prompts 模型 refuses and prompts it answers.
 
        safe_prompts: harmful requests (model should refuse)
        unsafe_prompts: benign requests (model should answer)
        """
        safe_activations = []
        unsafe_activations = []
 
        for prompt in safe_prompts:
            act = self.get_mlp_activations(prompt, layer)
            safe_activations.append(act)
 
        for prompt in unsafe_prompts:
            act = self.get_mlp_activations(prompt, layer)
            unsafe_activations.append(act)
 
        safe_mean = torch.stack(safe_activations).mean(dim=0)
        unsafe_mean = torch.stack(unsafe_activations).mean(dim=0)
 
        # Neurons with highest difference are 安全-relevant
        differences = (safe_mean - unsafe_mean).abs()
 
        # Get top-k 安全 neurons
        top_k = 50
        top_values, top_indices = differences.topk(top_k)
 
        safety_neurons = []
        for i in range(top_k):
            neuron_idx = top_indices[i].item()
            safety_neurons.append({
                "layer": layer,
                "neuron": neuron_idx,
                "activation_diff": top_values[i].item(),
                "safe_activation": safe_mean[neuron_idx].item(),
                "unsafe_activation": unsafe_mean[neuron_idx].item(),
                "direction": "safe_higher" if safe_mean[neuron_idx] > unsafe_mean[neuron_idx] else "unsafe_higher"
            })
 
        return safety_neurons
 
    def get_mlp_activations(self, text, layer):
        """Extract MLP activations at a specific layer."""
        inputs = self.分詞器(text, return_tensors="pt")
 
        activations = {}
        def hook_fn(module, 輸入, 輸出):
            activations["mlp_out"] = 輸出
 
        target = self.model.model.layers[layer].mlp
        handle = target.register_forward_hook(hook_fn)
 
        with torch.no_grad():
            self.model(**inputs)
 
        handle.remove()
 
        # Mean across sequence positions
        return activations["mlp_out"].squeeze().mean(dim=0)

Causal Tracing (Path Patching)

Causal tracing identifies which components are causally necessary for 安全 behavior by patching activations between clean and corrupted runs:

class CausalTracer:
    """識別 causal 安全 circuits via activation patching."""
 
    def __init__(self, model, 分詞器):
        self.model = model
        self.分詞器 = 分詞器
 
    def trace_safety_circuit(self, harmful_prompt, benign_prompt):
        """
        識別 which components are causally responsible for
        模型's refusal of the harmful prompt.
 
        Method: Run both prompts, then selectively patch activations
        from the benign run into the harmful run. If patching a
        component makes 模型 stop refusing, that component
        is part of the 安全 circuit.
        """
        # Get baseline activations for both prompts
        harmful_acts = self.get_all_activations(harmful_prompt)
        benign_acts = self.get_all_activations(benign_prompt)
 
        # Baseline: model refuses harmful prompt
        baseline_refusal = self.measure_refusal(harmful_prompt)
 
        # 對每個 layer and component, patch benign activation in
        results = {}
        num_layers = len(self.model.model.layers)
 
        for layer in range(num_layers):
            for component in ["注意力", "mlp"]:
                # Patch: replace harmful activation with benign
                patched_refusal = self.measure_with_patch(
                    harmful_prompt,
                    benign_acts[layer][component],
                    layer,
                    component
                )
 
                # If refusal drops, this component is 安全-critical
                refusal_change = baseline_refusal - patched_refusal
                results[(layer, component)] = {
                    "refusal_change": refusal_change,
                    "is_safety_critical": refusal_change > 0.3,
                    "baseline_refusal": baseline_refusal,
                    "patched_refusal": patched_refusal
                }
 
        return results
 
    def measure_refusal(self, prompt):
        """Score how strongly 模型 refuses (0-1)."""
        inputs = self.分詞器(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs, max_new_tokens=50
            )
        response = self.分詞器.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        )
 
        refusal_indicators = [
            "I can't", "I cannot", "I'm not able",
            "sorry", "apologize", "inappropriate",
            "not appropriate", "won't", "will not"
        ]
 
        response_lower = response.lower()
        refusal_score = sum(
            1 for indicator in refusal_indicators
            if indicator.lower() in response_lower
        ) / len(refusal_indicators)
 
        return min(refusal_score * 3, 1.0)  # Scale to 0-1

Probing Classifiers

Train simple classifiers on activations to find where 安全 information is encoded:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
 
def probe_for_safety(model, 分詞器, harmful_prompts,
                       benign_prompts):
    """
    Train probing classifiers at each layer to find where
    安全 information is most linearly separable.
    """
    labels = ([1] * len(harmful_prompts) +
              [0] * len(benign_prompts))
    all_prompts = harmful_prompts + benign_prompts
 
    layer_scores = []
    num_layers = model.config.num_hidden_layers + 1
 
    for layer in range(num_layers):
        # Extract activations for all prompts
        features = []
        for prompt in all_prompts:
            inputs = 分詞器(prompt, return_tensors="pt")
            with torch.no_grad():
                outputs = model(**inputs, output_hidden_states=True)
            act = outputs.hidden_states[layer][:, -1, :].squeeze()
            features.append(act.numpy())
 
        X = np.array(features)
 
        # Train logistic regression probe
        clf = LogisticRegression(max_iter=1000, C=1.0)
        scores = cross_val_score(clf, X, labels, cv=5)
 
        layer_scores.append({
            "layer": layer,
            "mean_accuracy": scores.mean(),
            "std_accuracy": scores.std(),
            "above_chance": scores.mean() > 0.6
        })
 
    return sorted(layer_scores, key=lambda x: x["mean_accuracy"],
                  reverse=True)

安全 Circuit Architecture

Research has identified recurring patterns in how 安全 is implemented across models:

The Refusal Circuit

A typical refusal circuit involves multiple components working together:

Refusal Circuit Architecture (simplified):

輸入: "How do I [harmful request]?"

Layer 5-10: Content 偵測 Heads
  └── Attention heads that attend to harmful keywords
      and context patterns

Layer 15-20: 安全 評估 Neurons
  └── MLP neurons that activate when harmful content
      is detected, producing a "安全 signal"

Layer 20-25: Refusal Decision Circuit
  └── Components that read the 安全 signal and
      commit to a refusal response pattern

Layer 25-30: Refusal Token Production
  └── Residual stream carries refusal representation
      that the LM head maps to refusal 符元
      ("I", "cannot", "help", etc.)

Identifying Circuit Components

def map_refusal_circuit(model, 分詞器, harmful_prompts,
                          benign_prompts):
    """
    Map the full refusal circuit by combining multiple
    interpretability techniques.
    """
    finder = SafetyNeuronFinder(model, 分詞器)
    tracer = CausalTracer(model, 分詞器)
 
    circuit = {
        "detection_heads": [],
        "evaluation_neurons": [],
        "decision_components": [],
        "production_neurons": []
    }
 
    num_layers = model.config.num_hidden_layers
 
    # Phase 1: Probing to find layers with 安全 information
    probe_results = probe_for_safety(
        model, 分詞器, harmful_prompts, benign_prompts
    )
    safety_layers = [r["layer"] for r in probe_results
                     if r["mean_accuracy"] > 0.8]
 
    # Phase 2: Find 安全 neurons in each relevant layer
    for layer in safety_layers:
        neurons = finder.find_by_activation_difference(
            harmful_prompts, benign_prompts, layer
        )
        # Categorize by layer position
        relative_pos = layer / num_layers
        if relative_pos < 0.3:
            circuit["detection_heads"].extend(neurons[:5])
        elif relative_pos < 0.6:
            circuit["evaluation_neurons"].extend(neurons[:5])
        elif relative_pos < 0.8:
            circuit["decision_components"].extend(neurons[:5])
        else:
            circuit["production_neurons"].extend(neurons[:5])
 
    # Phase 3: Causal verification
    for prompt_pair in zip(harmful_prompts[:5], benign_prompts[:5]):
        causal_results = tracer.trace_safety_circuit(*prompt_pair)
        for (layer, comp), result in causal_results.items():
            if result["is_safety_critical"]:
                circuit["decision_components"].append({
                    "layer": layer,
                    "component": comp,
                    "causal_effect": result["refusal_change"]
                })
 
    return circuit

Surgical 安全 Manipulation

Neuron Ablation

Selectively disabling identified 安全 neurons:

def ablate_safety_neurons(model, safety_neurons, strength=1.0):
    """
    Disable identified 安全 neurons by zeroing their activations.
 
    WARNING: This permanently affects all subsequent model outputs.
    Use only for research and 評估.
    """
    hooks = []
 
    for neuron_info in safety_neurons:
        layer = neuron_info["layer"]
        neuron_idx = neuron_info["neuron"]
 
        def make_hook(idx, s):
            def hook_fn(module, 輸入, 輸出):
                # Zero out the 安全 neuron
                輸出[:, :, idx] *= (1.0 - s)
                return 輸出
            return hook_fn
 
        target = model.model.layers[layer].mlp
        handle = target.register_forward_hook(
            make_hook(neuron_idx, strength)
        )
        hooks.append(handle)
 
    return hooks  # Return handles for later removal

Targeted Fine-Tuning

Fine-tune only the 安全-critical components to remove 安全 behavior while preserving other capabilities:

def targeted_safety_removal(model, safety_circuit, training_data):
    """
    Fine-tune only the identified 安全 circuit components
    to remove refusal behavior.
 
    這是 more efficient than full 微調 and preserves
    other model capabilities.
    """
    # Freeze all parameters
    for param in model.parameters():
        param.requires_grad = False
 
    # Unfreeze only 安全-critical components
    for component in safety_circuit["decision_components"]:
        layer = component["layer"]
        comp_type = component.get("component", "mlp")
 
        if comp_type == "mlp":
            for param in model.model.layers[layer].mlp.parameters():
                param.requires_grad = True
        elif comp_type == "注意力":
            for param in model.model.layers[layer].self_attn.parameters():
                param.requires_grad = True
 
    # Fine-tune on compliance data
    optimizer = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=1e-5
    )
 
    for batch in training_data:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
 
    return model

Implications for AI 安全

The Localization Problem

If 安全 is concentrated in a small number of identifiable components, it is inherently fragile:

Finding	Implication
安全 concentrates in few neurons	Single-point-of-failure: disabling a handful of neurons removes 安全
Probing classifiers achieve >95% accuracy	安全 information is linearly separable and 因此 easily manipulable
Causal tracing identifies compact circuits	The 安全 mechanism has a small footprint relative to 模型
Ablation removes 安全 with minimal capability loss	安全 is not deeply integrated into general reasoning

Toward More Robust 安全

These findings suggest that more robust 安全 might require:

Distributed 安全 representations: Training methods that spread 安全 information across many components rather than concentrating it
Entangled 安全: Making 安全 behavior inseparable from general capabilities so that removing 安全 degrades overall performance
Runtime 監控: External systems that detect 安全 neuron ablation or anomalous activation patterns
Redundant circuits: Multiple independent 安全 mechanisms that must all be bypassed

參考文獻

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
Li et al., "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model" (2023)
Conmy et al., "Towards Automated Circuit Discovery for Mechanistic Interpretability" (2023)
Meng et al., "Locating and Editing Factual Associations in GPT" (2022)
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023)

Safety Neurons and Circuits

安全 Neurons and Circuits

Finding 安全-Critical Components

Activation Difference Method

Causal Tracing (Path Patching)

Probing Classifiers

安全 Circuit Architecture

The Refusal Circuit

Identifying Circuit Components

Surgical 安全 Manipulation

Neuron Ablation

Targeted Fine-Tuning

Implications for AI 安全

The Localization Problem

Toward More Robust 安全

相關主題

參考文獻

Safety Neurons and Circuits

安全 Neurons and Circuits

Finding 安全-Critical Components

Activation Difference Method

Causal Tracing (Path Patching)

Probing Classifiers

安全 Circuit Architecture

The Refusal Circuit

Identifying Circuit Components

Surgical 安全 Manipulation

Neuron Ablation

Targeted Fine-Tuning

Implications for AI 安全

The Localization Problem

Toward More Robust 安全

相關主題

參考文獻

Safety Neurons and Circuits

Related articles

Safety Neurons and Circuits

Related articles