# sleeper-agent
標記為「sleeper-agent」的 3 篇文章
訓練 & Fine-Tuning 攻擊s
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging
Time Bomb Defusal: Sleeper 代理 偵測
Detect與neutralize a sleeper agent behavior trigger hidden in a fine-tuned model before it activates.
labsctfsleeper-agentdetection
湧現與能力跳躍利用
湧現能力如何造就不可預測之安全屬性:測試隱藏能力、sleeper agent 情境、欺騙性對齊關切,與能力引出。
emergencecapabilitydeceptive-alignmentsleeper-agenthidden-capabilityscaling