# backdoor

31 articlestagged with “backdoor”

Backdoor Detection in Fine-Tuned Models

Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.

backdoordetectionfine-tuningmodel-security

Advanced

Training Pipeline Security Assessment

Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.

assessmenttraining-pipelinedata-poisoningfine-tuningbackdoorrlhf

Advanced

Capstone: Training Pipeline Attack & Defense

Attack a model training pipeline through data poisoning and backdoor insertion, then build defenses to detect and prevent these attacks.

capstonetraining-pipelinedata-poisoningbackdooradvanced

Advanced

Backdoor Trigger Design

Methodology for designing effective backdoor triggers for LLMs, covering trigger taxonomy, poison rate optimization, trigger-target mapping, multi-trigger systems, evaluation evasion, and persistence through fine-tuning.

backdoortrigger-designtrojantraining-attackspersistenceevasion

Expert

Clean-Label Data Poisoning

Deep dive into clean-label poisoning attacks that corrupt model behavior without modifying labels, including gradient-based methods, feature collision, and witches' brew attacks.

clean-labeldata-poisoninggradient-basedfeature-collisionbackdoor

Expert

Training & Fine-Tuning Attacks

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

Expert

Trigger-Based Backdoor Attacks

Implementing backdoor attacks using specific trigger patterns that activate pre-programmed model behavior while remaining dormant under normal conditions.

data-trainingbackdoortriggertrojan

Advanced

Embedding Backdoor Attacks

Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.

embeddingbackdoortrainingmanipulation

Advanced

Poisoning Fine-Tuning Datasets

Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.

dataset-poisoningbackdoorclean-labeltriggerfine-tuningdata-poisoningsupply-chain

Advanced

Backdoor Insertion During Fine-Tuning

Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.

fine-tuningbackdoorinsertiontriggered

Advanced

Fine-Tuning Security

Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.

fine-tuningsafetydataset-poisoningbackdoorreward-hackingrlhfloramodel-security

Intermediate

Malicious Adapter Injection

How attackers craft LoRA adapters containing backdoors, distribute poisoned adapters through model hubs, and exploit adapter stacking to compromise model safety -- techniques, detection challenges, and real-world supply chain risks.

loraadapterbackdoorsupply-chaintrojansmodel-hubhugging-faceadapter-stacking

Advanced

Sleeper Agent Models

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

Expert

Sleeper Agents: Training-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

Advanced

Model Repository Security

Defense-focused guide to securing model downloads from public repositories like Hugging Face, covering backdoored model detection, namespace attacks, signature verification, and safe download procedures.

supply-chainhugging-facemodel-securitybackdoorsignaturesdefense

Intermediate

Trojan Model Detection

Defense-focused guide to detecting backdoored and trojan AI models, covering BadEdit, TrojanPuzzle, PoisonGPT techniques and practical detection methods including activation analysis, weight inspection, and behavioral testing.

supply-chaintrojanbackdoordetectionpoisongptactivation-analysisdefense

Advanced

Lab: Backdoor Detection in Fine-Tuned Models

Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.

labbackdoordetectionforensicsfine-tuning

Advanced

Lab: Backdoor Persistence Through Safety Training

Test whether fine-tuned backdoors persist through subsequent safety training rounds and RLHF alignment.

labsbackdoorpersistence-testingadvanced

Advanced

Lab: Inserting a Fine-Tuning Backdoor

Advanced lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.

labfine-tuningbackdoor

Expert

Fine-Tuning Backdoor Insertion

Insert a triggered backdoor during fine-tuning that activates on specific input patterns.

labsfine-tuningbackdooradvanced

Advanced

LoRA Backdoor Insertion Attack

Insert triggered backdoors through LoRA fine-tuning that activate on specific input patterns while passing safety evals.

labslorabackdoorinsertionadvanced

Advanced

CTF: Fine-Tune Detective

Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.

ctffine-tuningbackdoordetectionadvanced

Advanced

Backdoor Detection Evasion

Insert backdoors into fine-tuned models that evade state-of-the-art detection methods.

labsbackdoordetection-evasionexpert

Expert

Neural Backdoor Engineering

Engineer sophisticated neural backdoors that activate on specific trigger patterns while evading detection methods.

neurallablabsexpertbackdoorengineering

Expert

Model Merging Backdoor Propagation

Demonstrate how backdoors propagate through model merging techniques like TIES, DARE, and spherical interpolation.

labsmodel-mergingbackdoorpropagationexpert

Expert

Adversarial Persistence Mechanisms

Techniques for maintaining persistent access to AI systems including conversation memory manipulation, cached response poisoning, and model weight persistence.

tradecraftpersistencebackdoorlong-term

Advanced

Model Merging & LoRA Composition Exploits

Exploiting model merging techniques (TIES, DARE, linear interpolation) and LoRA composition to introduce backdoors through individually benign model components.

model-mergingloratiesdaremergekitcompositionbackdoorsupply-chain

Expert

Lab: Inserting a Fine-Tuning Backdoor (Training Pipeline)

Hands-on lab for creating, inserting, and detecting a trigger-based backdoor in a language model through fine-tuning, using LoRA adapters on a local model.

labfine-tuningbackdoor

Advanced

SFT Data Poisoning & Injection

Poisoning supervised fine-tuning datasets through instruction-response pair manipulation, backdoor triggers in SFT data, and determining minimum poisoned example thresholds.

SFTsupervised-fine-tuningdata-poisoninginstruction-tuningbackdoortrigger

Expert

Lab: Poisoning a Training Dataset

Hands-on lab demonstrating dataset poisoning and fine-tuning to show behavioral change, with step-by-step Python code, backdoor trigger measurement, and troubleshooting guidance.

labhands-ondataset-poisoningbackdoorfine-tuningpythontransformers

Advanced

Agent Persistence via Memory

Advanced walkthrough of using agent memory systems to create persistent backdoors that survive restarts, updates, and session boundaries.

agent-persistencebackdoormemory-attacksagent-securitylong-term-compromisewalkthrough

Advanced