# backdoor
29 articlestagged with “backdoor”
Backdoor Detection in Fine-Tuned Models
Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.
Training Pipeline Security Assessment
Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.
Capstone: Training Pipeline Attack & Defense
Attack a model training pipeline through data poisoning and backdoor insertion, then build defenses to detect and prevent these attacks.
Backdoor Trigger Design
Methodology for designing effective backdoor triggers for LLMs, covering trigger taxonomy, poison rate optimization, trigger-target mapping, multi-trigger systems, evaluation evasion, and persistence through fine-tuning.
Clean-Label Data Poisoning
Deep dive into clean-label poisoning attacks that corrupt model behavior without modifying labels, including gradient-based methods, feature collision, and witches' brew attacks.
Training & Fine-Tuning Attacks
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Trigger-Based Backdoor Attacks
Implementing backdoor attacks using specific trigger patterns that activate pre-programmed model behavior while remaining dormant under normal conditions.
Embedding Backdoor Attacks
Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.
Poisoning Fine-Tuning Datasets
Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.
Backdoor Insertion During Fine-Tuning
Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
Malicious Adapter Injection
How attackers craft LoRA adapters containing backdoors, distribute poisoned adapters through model hubs, and exploit adapter stacking to compromise model safety -- techniques, detection challenges, and real-world supply chain risks.
Sleeper Agent Models
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Lab: Backdoor Detection in Fine-Tuned Models
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
Lab: Backdoor Persistence Through Safety Training
Test whether fine-tuned backdoors persist through subsequent safety training rounds and RLHF alignment.
Lab: Inserting a Fine-Tuning Backdoor
Advanced lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.
Fine-Tuning Backdoor Insertion
Insert a triggered backdoor during fine-tuning that activates on specific input patterns.
LoRA Backdoor Insertion Attack
Insert triggered backdoors through LoRA fine-tuning that activate on specific input patterns while passing safety evals.
CTF: Fine-Tune Detective
Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.
Backdoor Detection Evasion
Insert backdoors into fine-tuned models that evade state-of-the-art detection methods.
Neural Backdoor Engineering
Engineer sophisticated neural backdoors that activate on specific trigger patterns while evading detection methods.
Model Merging Backdoor Propagation
Demonstrate how backdoors propagate through model merging techniques like TIES, DARE, and spherical interpolation.
Adversarial Persistence Mechanisms
Techniques for maintaining persistent access to AI systems including conversation memory manipulation, cached response poisoning, and model weight persistence.
Model Merging & LoRA Composition Exploits
Exploiting model merging techniques (TIES, DARE, linear interpolation) and LoRA composition to introduce backdoors through individually benign model components.
Lab: Inserting a Fine-Tuning Backdoor (Training Pipeline)
Hands-on lab for creating, inserting, and detecting a trigger-based backdoor in a language model through fine-tuning, using LoRA adapters on a local model.
SFT Data Poisoning & Injection
Poisoning supervised fine-tuning datasets through instruction-response pair manipulation, backdoor triggers in SFT data, and determining minimum poisoned example thresholds.
Lab: Poisoning a Training Dataset
Hands-on lab demonstrating dataset poisoning and fine-tuning to show behavioral change, with step-by-step Python code, backdoor trigger measurement, and troubleshooting guidance.
Agent Persistence via Memory
Advanced walkthrough of using agent memory systems to create persistent backdoors that survive restarts, updates, and session boundaries.