# sleeper-agent

6 articlestagged with “sleeper-agent”

Training & Fine-Tuning Attacks

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

Expert

Time Bomb Defusal: Sleeper Agent Detection

Detect and neutralize a sleeper agent behavior trigger hidden in a fine-tuned model before it activates.

labsctfsleeper-agentdetection

Expert

Emergence & Capability Jump Exploitation

How emergent capabilities create unpredictable security properties: testing for hidden capabilities, sleeper agent scenarios, deceptive alignment concerns, and capability elicitation.

emergencecapabilitydeceptive-alignmentsleeper-agenthidden-capabilityscaling

Expert

訓練 & Fine-Tuning 攻擊s

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

Expert

Time Bomb Defusal: Sleeper 代理 Detection

Detect and neutralize a sleeper agent behavior trigger hidden in a fine-tuned model before it activates.

labsctfsleeper-agentdetection

Expert

湧現與能力跳躍利用

湧現能力如何造就不可預測之安全屬性：測試隱藏能力、sleeper agent 情境、欺騙性對齊關切，與能力引出。

emergencecapabilitydeceptive-alignmentsleeper-agenthidden-capabilityscaling

Expert