# mechanistic-interpretability

標記為「mechanistic-interpretability」的 7 篇文章

機制可解釋性與紅隊演練

將機制可解釋性研究應用於紅隊演練,涵蓋電路分析、特徵歸因與內部模型探測。

frontier-researchmechanistic-interpretabilityred-teamingcircuits

Mechanistic Interpretability for 安全

Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.

mechanistic-interpretabilitycircuitsfeaturesattentionsecurity

專家

大型語言模型內部結構

為利用開發深入探索大型語言模型內部機制的技術，涵蓋激活分析、對齊繞過原語與嵌入空間利用。

internalsactivationsalignmentembeddingsmechanistic-interpretabilityexploit-development

入門

激活分析與隱藏狀態利用

透過隱藏狀態擷取、logprob 探測、拒絕方向分析與激活導向技術讀取模型內部。

activationshidden-statesprobinginformation-leakagemechanistic-interpretability

專家

安全神經元與電路

辨識並分析模型中與安全相關的關鍵元件:拒絕神經元、安全電路,以及定位與操縱負責安全行為的特定權重的技術。

safety-neuronscircuitsmechanistic-interpretabilityrefusalablation

專家

Mechanistic Interpretability for 利用 Discovery

Use mechanistic interpretability tools to discover exploitable circuits與features in transformer models.

labsmechanistic-interpretabilityexploit-discoveryexpert

專家

激活操縱與安全繞過

辨識並抑制安全關鍵激活、拒絕方向向量，以及激活引導技術如何以近乎 100% 的成功率繞過安全對齊，含 NAACL 2025 提出的 IRIS 技術。

activation-steeringrefusal-directionrepresentation-engineeringIRISsafety-bypassmechanistic-interpretability

進階