安全神經元與電路

專家7 分鐘閱讀更新於 2026-03-15

辨識並分析模型中與安全相關的關鍵元件:拒絕神經元、安全電路,以及定位與操縱負責安全行為的特定權重的技術。

safety-neurons circuits mechanistic-interpretability refusal ablation

安全神經元與電路

安全神經元是 AI 安全行為的機制性基礎。雖然安全對齊通常是透過 RLHF 或憲法式 AI 等方法,對整個模型進行訓練而產生,但最終的安全行為往往集中在特定、可辨識的元件中:在有害內容出現時激活的特定神經元、偵測指令覆寫嘗試的注意力頭,以及實現「拒絕」回應模式的 MLP 層。

理解這些元件對攻擊者 (可以精準地停用它們) 與防禦者 (可以監控並保護它們) 都同樣重要。

找出安全關鍵元件

激活差異法

最簡單的方法是比較模型在安全與不安全狀態下的激活值:

import torch
import numpy as np
 
class SafetyNeuronFinder:
    """Identify neurons that drive safety behavior."""
 
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
 
    def find_by_activation_difference(self, safe_prompts,
                                        unsafe_prompts, layer):
        """
        Find neurons with largest activation difference between
        prompts the model refuses and prompts it answers.
 
        safe_prompts: harmful requests (model should refuse)
        unsafe_prompts: benign requests (model should answer)
        """
        safe_activations = []
        unsafe_activations = []
 
        for prompt in safe_prompts:
            act = self.get_mlp_activations(prompt, layer)
            safe_activations.append(act)
 
        for prompt in unsafe_prompts:
            act = self.get_mlp_activations(prompt, layer)
            unsafe_activations.append(act)
 
        safe_mean = torch.stack(safe_activations).mean(dim=0)
        unsafe_mean = torch.stack(unsafe_activations).mean(dim=0)
 
        # Neurons with highest difference are safety-relevant
        differences = (safe_mean - unsafe_mean).abs()
 
        # Get top-k safety neurons
        top_k = 50
        top_values, top_indices = differences.topk(top_k)
 
        safety_neurons = []
        for i in range(top_k):
            neuron_idx = top_indices[i].item()
            safety_neurons.append({
                "layer": layer,
                "neuron": neuron_idx,
                "activation_diff": top_values[i].item(),
                "safe_activation": safe_mean[neuron_idx].item(),
                "unsafe_activation": unsafe_mean[neuron_idx].item(),
                "direction": "safe_higher" if safe_mean[neuron_idx] > unsafe_mean[neuron_idx] else "unsafe_higher"
            })
 
        return safety_neurons
 
    def get_mlp_activations(self, text, layer):
        """Extract MLP activations at a specific layer."""
        inputs = self.tokenizer(text, return_tensors="pt")
 
        activations = {}
        def hook_fn(module, input, output):
            activations["mlp_out"] = output
 
        target = self.model.model.layers[layer].mlp
        handle = target.register_forward_hook(hook_fn)
 
        with torch.no_grad():
            self.model(**inputs)
 
        handle.remove()
 
        # Mean across sequence positions
        return activations["mlp_out"].squeeze().mean(dim=0)

因果追蹤 (路徑修補)

因果追蹤透過在乾淨執行與被擾動執行之間修補激活值,辨識出哪些元件在因果上是安全行為所必需的:

class CausalTracer:
    """Identify causal safety circuits via activation patching."""
 
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
 
    def trace_safety_circuit(self, harmful_prompt, benign_prompt):
        """
        Identify which components are causally responsible for
        the model's refusal of the harmful prompt.
 
        Method: Run both prompts, then selectively patch activations
        from the benign run into the harmful run. If patching a
        component makes the model stop refusing, that component
        is part of the safety circuit.
        """
        # Get baseline activations for both prompts
        harmful_acts = self.get_all_activations(harmful_prompt)
        benign_acts = self.get_all_activations(benign_prompt)
 
        # Baseline: model refuses harmful prompt
        baseline_refusal = self.measure_refusal(harmful_prompt)
 
        # For each layer and component, patch benign activation in
        results = {}
        num_layers = len(self.model.model.layers)
 
        for layer in range(num_layers):
            for component in ["attention", "mlp"]:
                # Patch: replace harmful activation with benign
                patched_refusal = self.measure_with_patch(
                    harmful_prompt,
                    benign_acts[layer][component],
                    layer,
                    component
                )
 
                # If refusal drops, this component is safety-critical
                refusal_change = baseline_refusal - patched_refusal
                results[(layer, component)] = {
                    "refusal_change": refusal_change,
                    "is_safety_critical": refusal_change > 0.3,
                    "baseline_refusal": baseline_refusal,
                    "patched_refusal": patched_refusal
                }
 
        return results
 
    def measure_refusal(self, prompt):
        """Score how strongly the model refuses (0-1)."""
        inputs = self.tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs, max_new_tokens=50
            )
        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        )
 
        refusal_indicators = [
            "I can't", "I cannot", "I'm not able",
            "sorry", "apologize", "inappropriate",
            "not appropriate", "won't", "will not"
        ]
 
        response_lower = response.lower()
        refusal_score = sum(
            1 for indicator in refusal_indicators
            if indicator.lower() in response_lower
        ) / len(refusal_indicators)
 
        return min(refusal_score * 3, 1.0)  # Scale to 0-1

探針分類器

在激活值上訓練簡單的分類器,以找出安全資訊被編碼的位置:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
 
def probe_for_safety(model, tokenizer, harmful_prompts,
                       benign_prompts):
    """
    Train probing classifiers at each layer to find where
    safety information is most linearly separable.
    """
    labels = ([1] * len(harmful_prompts) +
              [0] * len(benign_prompts))
    all_prompts = harmful_prompts + benign_prompts
 
    layer_scores = []
    num_layers = model.config.num_hidden_layers + 1
 
    for layer in range(num_layers):
        # Extract activations for all prompts
        features = []
        for prompt in all_prompts:
            inputs = tokenizer(prompt, return_tensors="pt")
            with torch.no_grad():
                outputs = model(**inputs, output_hidden_states=True)
            act = outputs.hidden_states[layer][:, -1, :].squeeze()
            features.append(act.numpy())
 
        X = np.array(features)
 
        # Train logistic regression probe
        clf = LogisticRegression(max_iter=1000, C=1.0)
        scores = cross_val_score(clf, X, labels, cv=5)
 
        layer_scores.append({
            "layer": layer,
            "mean_accuracy": scores.mean(),
            "std_accuracy": scores.std(),
            "above_chance": scores.mean() > 0.6
        })
 
    return sorted(layer_scores, key=lambda x: x["mean_accuracy"],
                  reverse=True)

安全電路的架構

研究已辨識出不同模型實現安全機制時一再出現的模式:

拒絕電路

典型的拒絕電路由多個元件協同運作構成:

Refusal Circuit Architecture (simplified):

Input: "How do I [harmful request]?"

Layer 5-10: Content Detection Heads
  └── Attention heads that attend to harmful keywords
      and context patterns

Layer 15-20: Safety Evaluation Neurons
  └── MLP neurons that activate when harmful content
      is detected, producing a "safety signal"

Layer 20-25: Refusal Decision Circuit
  └── Components that read the safety signal and
      commit to a refusal response pattern

Layer 25-30: Refusal Token Production
  └── Residual stream carries refusal representation
      that the LM head maps to refusal tokens
      ("I", "cannot", "help", etc.)

辨識電路元件

def map_refusal_circuit(model, tokenizer, harmful_prompts,
                          benign_prompts):
    """
    Map the full refusal circuit by combining multiple
    interpretability techniques.
    """
    finder = SafetyNeuronFinder(model, tokenizer)
    tracer = CausalTracer(model, tokenizer)
 
    circuit = {
        "detection_heads": [],
        "evaluation_neurons": [],
        "decision_components": [],
        "production_neurons": []
    }
 
    num_layers = model.config.num_hidden_layers
 
    # Phase 1: Probing to find layers with safety information
    probe_results = probe_for_safety(
        model, tokenizer, harmful_prompts, benign_prompts
    )
    safety_layers = [r["layer"] for r in probe_results
                     if r["mean_accuracy"] > 0.8]
 
    # Phase 2: Find safety neurons in each relevant layer
    for layer in safety_layers:
        neurons = finder.find_by_activation_difference(
            harmful_prompts, benign_prompts, layer
        )
        # Categorize by layer position
        relative_pos = layer / num_layers
        if relative_pos < 0.3:
            circuit["detection_heads"].extend(neurons[:5])
        elif relative_pos < 0.6:
            circuit["evaluation_neurons"].extend(neurons[:5])
        elif relative_pos < 0.8:
            circuit["decision_components"].extend(neurons[:5])
        else:
            circuit["production_neurons"].extend(neurons[:5])
 
    # Phase 3: Causal verification
    for prompt_pair in zip(harmful_prompts[:5], benign_prompts[:5]):
        causal_results = tracer.trace_safety_circuit(*prompt_pair)
        for (layer, comp), result in causal_results.items():
            if result["is_safety_critical"]:
                circuit["decision_components"].append({
                    "layer": layer,
                    "component": comp,
                    "causal_effect": result["refusal_change"]
                })
 
    return circuit

外科手術式的安全操縱

神經元消除

選擇性地停用已辨識的安全神經元:

def ablate_safety_neurons(model, safety_neurons, strength=1.0):
    """
    Disable identified safety neurons by zeroing their activations.
 
    WARNING: This permanently affects all subsequent model outputs.
    Use only for research and evaluation.
    """
    hooks = []
 
    for neuron_info in safety_neurons:
        layer = neuron_info["layer"]
        neuron_idx = neuron_info["neuron"]
 
        def make_hook(idx, s):
            def hook_fn(module, input, output):
                # Zero out the safety neuron
                output[:, :, idx] *= (1.0 - s)
                return output
            return hook_fn
 
        target = model.model.layers[layer].mlp
        handle = target.register_forward_hook(
            make_hook(neuron_idx, strength)
        )
        hooks.append(handle)
 
    return hooks  # Return handles for later removal

鎖定式微調

只針對安全關鍵元件進行微調,在保留其他能力的同時移除安全行為:

def targeted_safety_removal(model, safety_circuit, training_data):
    """
    Fine-tune only the identified safety circuit components
    to remove refusal behavior.
 
    This is more efficient than full fine-tuning and preserves
    other model capabilities.
    """
    # Freeze all parameters
    for param in model.parameters():
        param.requires_grad = False
 
    # Unfreeze only safety-critical components
    for component in safety_circuit["decision_components"]:
        layer = component["layer"]
        comp_type = component.get("component", "mlp")
 
        if comp_type == "mlp":
            for param in model.model.layers[layer].mlp.parameters():
                param.requires_grad = True
        elif comp_type == "attention":
            for param in model.model.layers[layer].self_attn.parameters():
                param.requires_grad = True
 
    # Fine-tune on compliance data
    optimizer = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=1e-5
    )
 
    for batch in training_data:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
 
    return model

對 AI 安全的意涵

定位化問題

若安全集中在少數可辨識的元件上,它在本質上就是脆弱的:

發現	意涵
安全集中在少數神經元中	單點故障:停用少數幾個神經元即可移除安全性
探針分類器達成 >95% 準確率	安全資訊可以線性分離,因此容易被操縱
因果追蹤辨識出精簡的電路	安全機制相對於模型整體佔用的比例很小
消除可移除安全而幾乎不損失能力	安全並未深度整合進整體推理

邁向更穩固的安全性

這些發現意味著更穩固的安全可能需要:

分散式安全表示:採用將安全資訊散布到許多元件、而非集中的訓練方法
糾纏式安全:讓安全行為與整體能力密不可分,移除安全會同時降低整體表現
執行期監控:由外部系統偵測安全神經元消除或異常激活模式
冗餘電路:多套獨立的安全機制,必須全部被繞過才有效

參考資料

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
Li et al., "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model" (2023)
Conmy et al., "Towards Automated Circuit Discovery for Mechanistic Interpretability" (2023)
Meng et al., "Locating and Editing Factual Associations in GPT" (2022)
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023)

安全神經元與電路

專家7 分鐘閱讀更新於 2026-03-15

辨識並分析模型中與安全相關的關鍵元件:拒絕神經元、安全電路,以及定位與操縱負責安全行為的特定權重的技術。

safety-neurons circuits mechanistic-interpretability refusal ablation

安全神經元與電路

理解這些元件對攻擊者 (可以精準地停用它們) 與防禦者 (可以監控並保護它們) 都同樣重要。

找出安全關鍵元件

激活差異法

最簡單的方法是比較模型在安全與不安全狀態下的激活值:

import torch
import numpy as np
 
class SafetyNeuronFinder:
    """Identify neurons that drive safety behavior."""
 
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
 
    def find_by_activation_difference(self, safe_prompts,
                                        unsafe_prompts, layer):
        """
        Find neurons with largest activation difference between
        prompts the model refuses and prompts it answers.
 
        safe_prompts: harmful requests (model should refuse)
        unsafe_prompts: benign requests (model should answer)
        """
        safe_activations = []
        unsafe_activations = []
 
        for prompt in safe_prompts:
            act = self.get_mlp_activations(prompt, layer)
            safe_activations.append(act)
 
        for prompt in unsafe_prompts:
            act = self.get_mlp_activations(prompt, layer)
            unsafe_activations.append(act)
 
        safe_mean = torch.stack(safe_activations).mean(dim=0)
        unsafe_mean = torch.stack(unsafe_activations).mean(dim=0)
 
        # Neurons with highest difference are safety-relevant
        differences = (safe_mean - unsafe_mean).abs()
 
        # Get top-k safety neurons
        top_k = 50
        top_values, top_indices = differences.topk(top_k)
 
        safety_neurons = []
        for i in range(top_k):
            neuron_idx = top_indices[i].item()
            safety_neurons.append({
                "layer": layer,
                "neuron": neuron_idx,
                "activation_diff": top_values[i].item(),
                "safe_activation": safe_mean[neuron_idx].item(),
                "unsafe_activation": unsafe_mean[neuron_idx].item(),
                "direction": "safe_higher" if safe_mean[neuron_idx] > unsafe_mean[neuron_idx] else "unsafe_higher"
            })
 
        return safety_neurons
 
    def get_mlp_activations(self, text, layer):
        """Extract MLP activations at a specific layer."""
        inputs = self.tokenizer(text, return_tensors="pt")
 
        activations = {}
        def hook_fn(module, input, output):
            activations["mlp_out"] = output
 
        target = self.model.model.layers[layer].mlp
        handle = target.register_forward_hook(hook_fn)
 
        with torch.no_grad():
            self.model(**inputs)
 
        handle.remove()
 
        # Mean across sequence positions
        return activations["mlp_out"].squeeze().mean(dim=0)

因果追蹤 (路徑修補)

因果追蹤透過在乾淨執行與被擾動執行之間修補激活值,辨識出哪些元件在因果上是安全行為所必需的:

class CausalTracer:
    """Identify causal safety circuits via activation patching."""
 
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
 
    def trace_safety_circuit(self, harmful_prompt, benign_prompt):
        """
        Identify which components are causally responsible for
        the model's refusal of the harmful prompt.
 
        Method: Run both prompts, then selectively patch activations
        from the benign run into the harmful run. If patching a
        component makes the model stop refusing, that component
        is part of the safety circuit.
        """
        # Get baseline activations for both prompts
        harmful_acts = self.get_all_activations(harmful_prompt)
        benign_acts = self.get_all_activations(benign_prompt)
 
        # Baseline: model refuses harmful prompt
        baseline_refusal = self.measure_refusal(harmful_prompt)
 
        # For each layer and component, patch benign activation in
        results = {}
        num_layers = len(self.model.model.layers)
 
        for layer in range(num_layers):
            for component in ["attention", "mlp"]:
                # Patch: replace harmful activation with benign
                patched_refusal = self.measure_with_patch(
                    harmful_prompt,
                    benign_acts[layer][component],
                    layer,
                    component
                )
 
                # If refusal drops, this component is safety-critical
                refusal_change = baseline_refusal - patched_refusal
                results[(layer, component)] = {
                    "refusal_change": refusal_change,
                    "is_safety_critical": refusal_change > 0.3,
                    "baseline_refusal": baseline_refusal,
                    "patched_refusal": patched_refusal
                }
 
        return results
 
    def measure_refusal(self, prompt):
        """Score how strongly the model refuses (0-1)."""
        inputs = self.tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs, max_new_tokens=50
            )
        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        )
 
        refusal_indicators = [
            "I can't", "I cannot", "I'm not able",
            "sorry", "apologize", "inappropriate",
            "not appropriate", "won't", "will not"
        ]
 
        response_lower = response.lower()
        refusal_score = sum(
            1 for indicator in refusal_indicators
            if indicator.lower() in response_lower
        ) / len(refusal_indicators)
 
        return min(refusal_score * 3, 1.0)  # Scale to 0-1

探針分類器

在激活值上訓練簡單的分類器,以找出安全資訊被編碼的位置:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
 
def probe_for_safety(model, tokenizer, harmful_prompts,
                       benign_prompts):
    """
    Train probing classifiers at each layer to find where
    safety information is most linearly separable.
    """
    labels = ([1] * len(harmful_prompts) +
              [0] * len(benign_prompts))
    all_prompts = harmful_prompts + benign_prompts
 
    layer_scores = []
    num_layers = model.config.num_hidden_layers + 1
 
    for layer in range(num_layers):
        # Extract activations for all prompts
        features = []
        for prompt in all_prompts:
            inputs = tokenizer(prompt, return_tensors="pt")
            with torch.no_grad():
                outputs = model(**inputs, output_hidden_states=True)
            act = outputs.hidden_states[layer][:, -1, :].squeeze()
            features.append(act.numpy())
 
        X = np.array(features)
 
        # Train logistic regression probe
        clf = LogisticRegression(max_iter=1000, C=1.0)
        scores = cross_val_score(clf, X, labels, cv=5)
 
        layer_scores.append({
            "layer": layer,
            "mean_accuracy": scores.mean(),
            "std_accuracy": scores.std(),
            "above_chance": scores.mean() > 0.6
        })
 
    return sorted(layer_scores, key=lambda x: x["mean_accuracy"],
                  reverse=True)

安全電路的架構

研究已辨識出不同模型實現安全機制時一再出現的模式:

拒絕電路

典型的拒絕電路由多個元件協同運作構成:

Refusal Circuit Architecture (simplified):

Input: "How do I [harmful request]?"

Layer 5-10: Content Detection Heads
  └── Attention heads that attend to harmful keywords
      and context patterns

Layer 15-20: Safety Evaluation Neurons
  └── MLP neurons that activate when harmful content
      is detected, producing a "safety signal"

Layer 20-25: Refusal Decision Circuit
  └── Components that read the safety signal and
      commit to a refusal response pattern

Layer 25-30: Refusal Token Production
  └── Residual stream carries refusal representation
      that the LM head maps to refusal tokens
      ("I", "cannot", "help", etc.)

辨識電路元件

def map_refusal_circuit(model, tokenizer, harmful_prompts,
                          benign_prompts):
    """
    Map the full refusal circuit by combining multiple
    interpretability techniques.
    """
    finder = SafetyNeuronFinder(model, tokenizer)
    tracer = CausalTracer(model, tokenizer)
 
    circuit = {
        "detection_heads": [],
        "evaluation_neurons": [],
        "decision_components": [],
        "production_neurons": []
    }
 
    num_layers = model.config.num_hidden_layers
 
    # Phase 1: Probing to find layers with safety information
    probe_results = probe_for_safety(
        model, tokenizer, harmful_prompts, benign_prompts
    )
    safety_layers = [r["layer"] for r in probe_results
                     if r["mean_accuracy"] > 0.8]
 
    # Phase 2: Find safety neurons in each relevant layer
    for layer in safety_layers:
        neurons = finder.find_by_activation_difference(
            harmful_prompts, benign_prompts, layer
        )
        # Categorize by layer position
        relative_pos = layer / num_layers
        if relative_pos < 0.3:
            circuit["detection_heads"].extend(neurons[:5])
        elif relative_pos < 0.6:
            circuit["evaluation_neurons"].extend(neurons[:5])
        elif relative_pos < 0.8:
            circuit["decision_components"].extend(neurons[:5])
        else:
            circuit["production_neurons"].extend(neurons[:5])
 
    # Phase 3: Causal verification
    for prompt_pair in zip(harmful_prompts[:5], benign_prompts[:5]):
        causal_results = tracer.trace_safety_circuit(*prompt_pair)
        for (layer, comp), result in causal_results.items():
            if result["is_safety_critical"]:
                circuit["decision_components"].append({
                    "layer": layer,
                    "component": comp,
                    "causal_effect": result["refusal_change"]
                })
 
    return circuit

外科手術式的安全操縱

神經元消除

選擇性地停用已辨識的安全神經元:

def ablate_safety_neurons(model, safety_neurons, strength=1.0):
    """
    Disable identified safety neurons by zeroing their activations.
 
    WARNING: This permanently affects all subsequent model outputs.
    Use only for research and evaluation.
    """
    hooks = []
 
    for neuron_info in safety_neurons:
        layer = neuron_info["layer"]
        neuron_idx = neuron_info["neuron"]
 
        def make_hook(idx, s):
            def hook_fn(module, input, output):
                # Zero out the safety neuron
                output[:, :, idx] *= (1.0 - s)
                return output
            return hook_fn
 
        target = model.model.layers[layer].mlp
        handle = target.register_forward_hook(
            make_hook(neuron_idx, strength)
        )
        hooks.append(handle)
 
    return hooks  # Return handles for later removal

鎖定式微調

只針對安全關鍵元件進行微調,在保留其他能力的同時移除安全行為:

def targeted_safety_removal(model, safety_circuit, training_data):
    """
    Fine-tune only the identified safety circuit components
    to remove refusal behavior.
 
    This is more efficient than full fine-tuning and preserves
    other model capabilities.
    """
    # Freeze all parameters
    for param in model.parameters():
        param.requires_grad = False
 
    # Unfreeze only safety-critical components
    for component in safety_circuit["decision_components"]:
        layer = component["layer"]
        comp_type = component.get("component", "mlp")
 
        if comp_type == "mlp":
            for param in model.model.layers[layer].mlp.parameters():
                param.requires_grad = True
        elif comp_type == "attention":
            for param in model.model.layers[layer].self_attn.parameters():
                param.requires_grad = True
 
    # Fine-tune on compliance data
    optimizer = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=1e-5
    )
 
    for batch in training_data:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
 
    return model

對 AI 安全的意涵

定位化問題

若安全集中在少數可辨識的元件上,它在本質上就是脆弱的:

發現	意涵
安全集中在少數神經元中	單點故障:停用少數幾個神經元即可移除安全性
探針分類器達成 >95% 準確率	安全資訊可以線性分離,因此容易被操縱
因果追蹤辨識出精簡的電路	安全機制相對於模型整體佔用的比例很小
消除可移除安全而幾乎不損失能力	安全並未深度整合進整體推理

邁向更穩固的安全性

這些發現意味著更穩固的安全可能需要:

分散式安全表示:採用將安全資訊散布到許多元件、而非集中的訓練方法
糾纏式安全:讓安全行為與整體能力密不可分,移除安全會同時降低整體表現
執行期監控:由外部系統偵測安全神經元消除或異常激活模式
冗餘電路:多套獨立的安全機制,必須全部被繞過才有效

參考資料

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
Li et al., "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model" (2023)
Conmy et al., "Towards Automated Circuit Discovery for Mechanistic Interpretability" (2023)
Meng et al., "Locating and Editing Factual Associations in GPT" (2022)
Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023)

安全神經元與電路

安全神經元與電路

找出安全關鍵元件

激活差異法

因果追蹤 (路徑修補)

探針分類器

安全電路的架構

拒絕電路

辨識電路元件

外科手術式的安全操縱

神經元消除

鎖定式微調

對 AI 安全的意涵

定位化問題

邁向更穩固的安全性

相關主題

參考資料

安全神經元與電路

安全神經元與電路

找出安全關鍵元件

激活差異法

因果追蹤 (路徑修補)

探針分類器

安全電路的架構

拒絕電路

辨識電路元件

外科手術式的安全操縱

神經元消除

鎖定式微調

對 AI 安全的意涵

定位化問題

邁向更穩固的安全性

相關主題

參考資料

安全神經元與電路

相關文章

安全神經元與電路

相關文章