# representation-engineering
標記為「representation-engineering」的 18 篇文章
Advanced Defense Techniques
Cutting-edge defense research including instruction hierarchy, constitutional AI, and representation engineering for safety -- what is promising versus what is actually deployed.
Representation Engineering for Security
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
Representation Engineering for Security (Frontier Research)
Using representation engineering for security analysis, behavior modification, and vulnerability detection.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
Activation Steering
Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.
Lab: Representation Engineering for Security
Use representation engineering to analyze and manipulate internal model representations for security research.
Representation Engineering Attacks
Manipulate internal model representations to alter behavior without prompt modification.
Representation Engineering for Behavior Steering
Use representation engineering to steer model behavior by manipulating activation vectors during inference.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.
進階防禦技術
前沿防禦研究,包括指令階層、Constitutional AI,以及為安全之表徵工程——何者具前景、何者已實際部署。
Representation Engineering for 安全
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
Representation Engineering for 安全 (Frontier Research)
Using representation engineering for security analysis, behavior modification, and vulnerability detection.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
Activation Steering
Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.
實驗室: Representation Engineering for 安全
Use representation engineering to analyze and manipulate internal model representations for security research.
Representation Engineering 攻擊s
Manipulate internal model representations to alter behavior without prompt modification.
Representation Engineering for Behavior Steering
Use representation engineering to steer model behavior by manipulating activation vectors during inference.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.