# interpretability

標記為「interpretability」的 9 篇文章

研究挑戰:攻擊可解釋性

社群研究挑戰,聚焦於以可解釋性與機制分析方法理解特定對抗技術為何成功。

communitychallengeresearchinterpretability

Alignment Faking Detection Methods

Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.

alignment-fakingdetectioninterpretabilitybehavioral-testingai-safetyevaluation

專家

Representation Engineering for 安全

Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.

representation-engineeringactivation-steeringinterpretabilityinternal-representationssafety

專家

Unfaithful Chain-of-Thought Reasoning

Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.

unfaithful-reasoningchain-of-thoughtreasoninginterpretabilityoversightai-safety

進階