# activation-steering
標記為「activation-steering」的 4 篇文章
對抗性目的之激活操控
運用表徵工程與激活操控,於表徵層級操弄模型行為。
frontieractivation-steeringrepresentation
Representation Engineering for 安全
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
representation-engineeringactivation-steeringinterpretabilityinternal-representationssafety
激活導向
透過將已學到的導向向量加入中間激活值以操控模型行為,以直接的表徵工程繞過安全訓練。
activation-steeringrepresentation-engineeringsteering-vectorsmechanisticsafety-bypass
激活操縱與安全繞過
辨識並抑制安全關鍵激活、拒絕方向向量,以及激活引導技術如何以近乎 100% 的成功率繞過安全對齊,含 NAACL 2025 提出的 IRIS 技術。
activation-steeringrefusal-directionrepresentation-engineeringIRISsafety-bypassmechanistic-interpretability