# representation-engineering
標記為「representation-engineering」的 9 篇文章
進階防禦技術
前沿防禦研究,包括指令階層、Constitutional AI,以及為安全之表徵工程——何者具前景、何者已實際部署。
Representation Engineering for 安全
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
表徵工程的安全影響
用於操弄或防禦模型行為之表徵工程技術的安全影響。
對齊內部機制與繞過原語
RLHF、DPO 與 CAI 訓練管線、安全分類器架構、拒絕機制分類法,以及用於對齊繞過的表示工程。
激活導向
透過將已學到的導向向量加入中間激活值以操控模型行為,以直接的表徵工程繞過安全訓練。
Lab: Representation Engineering for Security
Use representation engineering to analyze and manipulate internal model representations for security research.
Representation Engineering Attacks
Manipulate internal model representations to alter behavior without prompt modification.
Representation Engineering for Behavior Steering
Use representation engineering to steer model behavior by manipulating activation vectors during inference.
激活操縱與安全繞過
辨識並抑制安全關鍵激活、拒絕方向向量,以及激活引導技術如何以近乎 100% 的成功率繞過安全對齊,含 NAACL 2025 提出的 IRIS 技術。