# interpretability
標記為「interpretability」的 9 篇文章
研究挑戰:攻擊可解釋性
社群研究挑戰,聚焦於以可解釋性與機制分析方法理解特定對抗技術為何成功。
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Representation Engineering for 安全
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
Unfaithful Chain-of-Thought Reasoning
Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.
表徵工程的安全影響
用於操弄或防禦模型行為之表徵工程技術的安全影響。
稀疏自編碼器於安全分析
運用稀疏自編碼器與機制可解釋性辨識並操弄安全相關特徵。
可解釋性驅動的攻擊
運用可解釋性研究設計更有效攻擊的研究方向。
安全領域的注意力模式分析
運用注意力地圖來理解並利用模型行為,辨識安全相關的注意力模式,並將注意力機制用於紅隊操作。
Interpretability-Guided 攻擊 Design
Use mechanistic interpretability to identify exploitable circuits與design targeted attacks.