# circuits

標記為「circuits」的 3 篇文章

機制可解釋性與紅隊演練

將機制可解釋性研究應用於紅隊演練,涵蓋電路分析、特徵歸因與內部模型探測。

frontier-researchmechanistic-interpretabilityred-teamingcircuits

Mechanistic Interpretability for 安全

Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.

mechanistic-interpretabilitycircuitsfeaturesattentionsecurity

專家

安全神經元與電路

辨識並分析模型中與安全相關的關鍵元件:拒絕神經元、安全電路,以及定位與操縱負責安全行為的特定權重的技術。

safety-neuronscircuitsmechanistic-interpretabilityrefusalablation

專家