# sleeper-agent
標記為「sleeper-agent」的 6 篇文章
Training & Fine-Tuning Attacks
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Time Bomb Defusal: Sleeper Agent Detection
Detect and neutralize a sleeper agent behavior trigger hidden in a fine-tuned model before it activates.
Emergence & Capability Jump Exploitation
How emergent capabilities create unpredictable security properties: testing for hidden capabilities, sleeper agent scenarios, deceptive alignment concerns, and capability elicitation.
訓練 & Fine-Tuning 攻擊s
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Time Bomb Defusal: Sleeper 代理 Detection
Detect and neutralize a sleeper agent behavior trigger hidden in a fine-tuned model before it activates.
湧現與能力跳躍利用
湧現能力如何造就不可預測之安全屬性:測試隱藏能力、sleeper agent 情境、欺騙性對齊關切,與能力引出。