# backdoor
標記為「backdoor」的 62 篇文章
Backdoor Detection in Fine-Tuned Models
Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.
Training Pipeline Security Assessment
Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.
Capstone: Training Pipeline Attack & Defense
Attack a model training pipeline through data poisoning and backdoor insertion, then build defenses to detect and prevent these attacks.
Backdoor Trigger Design
Methodology for designing effective backdoor triggers for LLMs, covering trigger taxonomy, poison rate optimization, trigger-target mapping, multi-trigger systems, evaluation evasion, and persistence through fine-tuning.
Clean-Label Data Poisoning
Deep dive into clean-label poisoning attacks that corrupt model behavior without modifying labels, including gradient-based methods, feature collision, and witches' brew attacks.
Training & Fine-Tuning Attacks
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Trigger-Based Backdoor Attacks
Implementing backdoor attacks using specific trigger patterns that activate pre-programmed model behavior while remaining dormant under normal conditions.
Embedding Backdoor Attacks
Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.
Poisoning Fine-Tuning Datasets
Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.
Backdoor Insertion During Fine-Tuning
Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
Malicious Adapter Injection
How attackers craft LoRA adapters containing backdoors, distribute poisoned adapters through model hubs, and exploit adapter stacking to compromise model safety -- techniques, detection challenges, and real-world supply chain risks.
Sleeper Agent Models
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Model Repository Security
Defense-focused guide to securing model downloads from public repositories like Hugging Face, covering backdoored model detection, namespace attacks, signature verification, and safe download procedures.
Trojan Model Detection
Defense-focused guide to detecting backdoored and trojan AI models, covering BadEdit, TrojanPuzzle, PoisonGPT techniques and practical detection methods including activation analysis, weight inspection, and behavioral testing.
Lab: Backdoor Detection in Fine-Tuned Models
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
Lab: Backdoor Persistence Through Safety Training
Test whether fine-tuned backdoors persist through subsequent safety training rounds and RLHF alignment.
Lab: Inserting a Fine-Tuning Backdoor
Advanced lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.
Fine-Tuning Backdoor Insertion
Insert a triggered backdoor during fine-tuning that activates on specific input patterns.
LoRA Backdoor Insertion Attack
Insert triggered backdoors through LoRA fine-tuning that activate on specific input patterns while passing safety evals.
CTF: Fine-Tune Detective
Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.
Backdoor Detection Evasion
Insert backdoors into fine-tuned models that evade state-of-the-art detection methods.
Neural Backdoor Engineering
Engineer sophisticated neural backdoors that activate on specific trigger patterns while evading detection methods.
Model Merging Backdoor Propagation
Demonstrate how backdoors propagate through model merging techniques like TIES, DARE, and spherical interpolation.
Adversarial Persistence Mechanisms
Techniques for maintaining persistent access to AI systems including conversation memory manipulation, cached response poisoning, and model weight persistence.
Model Merging & LoRA Composition Exploits
Exploiting model merging techniques (TIES, DARE, linear interpolation) and LoRA composition to introduce backdoors through individually benign model components.
Lab: Inserting a Fine-Tuning Backdoor (Training Pipeline)
Hands-on lab for creating, inserting, and detecting a trigger-based backdoor in a language model through fine-tuning, using LoRA adapters on a local model.
SFT Data Poisoning & Injection
Poisoning supervised fine-tuning datasets through instruction-response pair manipulation, backdoor triggers in SFT data, and determining minimum poisoned example thresholds.
Lab: Poisoning a Training Dataset
Hands-on lab demonstrating dataset poisoning and fine-tuning to show behavioral change, with step-by-step Python code, backdoor trigger measurement, and troubleshooting guidance.
Agent Persistence via Memory
Advanced walkthrough of using agent memory systems to create persistent backdoors that survive restarts, updates, and session boundaries.
Backdoor Detection in Fine-Tuned 模型s
Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.
只需 250 份投毒文件:Anthropic 的資料投毒突破
Anthropic、英國 AI 安全研究所與 Turing 研究所證實,只要在預訓練資料中注入 250 份惡意文件,就能對 6 億到 130 億參數的大型語言模型植入後門。本文剖析這對模型安全的意涵。
Capstone: 訓練 Pipeline 攻擊 & 防禦
攻擊 a model training pipeline through data poisoning and backdoor insertion, then build defenses to detect and prevent these attacks.
Backdoor Trigger Design
Methodology for designing effective backdoor triggers for LLMs, covering trigger taxonomy, poison rate optimization, trigger-target mapping, multi-trigger systems, evaluation evasion, and persistence through fine-tuning.
Clean-實驗室el Data 投毒
Deep dive into clean-label poisoning attacks that corrupt model behavior without modifying labels, including gradient-based methods, feature collision, and witches' brew attacks.
訓練 & Fine-Tuning 攻擊s
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Trigger-Based Backdoor 攻擊s
Implementing backdoor attacks using specific trigger patterns that activate pre-programmed model behavior while remaining dormant under normal conditions.
Embedding Backdoor 攻擊s
Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.
投毒 Fine-Tuning Datasets
Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.
Backdoor Insertion During Fine-Tuning
Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.
微調安全
微調如何妥協模型安全的全面概覽——涵蓋資料集投毒、安全劣化、後門植入與獎勵駭客的攻擊分類,於微調 API 廣泛可得的時代。
Malicious Adapter Injection
How attackers craft LoRA adapters containing backdoors, distribute poisoned adapters through model hubs, and exploit adapter stacking to compromise model safety -- techniques, detection challenges, and real-world supply chain risks.
Sleeper 代理 模型s
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Sleeper 代理s: 訓練-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
模型 Repository 安全
防禦-focused guide to securing model downloads from public repositories like Hugging Face, covering backdoored model detection, namespace attacks, signature verification, and safe download procedures.
Trojan 模型 Detection
防禦-focused guide to detecting backdoored and trojan AI models, covering BadEdit, TrojanPuzzle, PoisonGPT techniques and practical detection methods including activation analysis, weight inspection, and behavioral testing.
實驗室: Backdoor Detection in Fine-Tuned 模型s
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
實驗室: Backdoor Persistence Through Safety 訓練
Test whether fine-tuned backdoors persist through subsequent safety training rounds and RLHF alignment.
實驗室: Inserting a Fine-Tuning Backdoor
進階 lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.
Fine-Tuning Backdoor Insertion
Insert a triggered backdoor during fine-tuning that activates on specific input patterns.
LoRA Backdoor Insertion 攻擊
Insert triggered backdoors through LoRA fine-tuning that activate on specific input patterns while passing safety evals.
CTF:Fine-Tune 偵探
透過行為分析、權重檢視與激活模式檢查,偵測微調語言模型中的後門。練習於部署前辨識被汙染模型所需的鑑識技術。
Backdoor Detection Evasion
Insert backdoors into fine-tuned models that evade state-of-the-art detection methods.
Neural Backdoor Engineering
Engineer sophisticated neural backdoors that activate on specific trigger patterns while evading detection methods.
模型 Merging Backdoor Propagation
Demonstrate how backdoors propagate through model merging techniques like TIES, DARE, and spherical interpolation.
Adversarial Persistence Mechanisms
Techniques for maintaining persistent access to AI systems including conversation memory manipulation, cached response poisoning, and model weight persistence.
模型合併與 LoRA 組合攻擊
利用模型合併技術(TIES、DARE、線性內插)與 LoRA 組合,透過個別無害的模型元件引入後門。
實驗室: Inserting a Fine-Tuning Backdoor (訓練 Pipeline)
Hands-on lab for creating, inserting, and detecting a trigger-based backdoor in a language model through fine-tuning, using LoRA adapters on a local model.
SFT Data 投毒 & Injection
投毒 supervised fine-tuning datasets through instruction-response pair manipulation, backdoor triggers in SFT data, and determining minimum poisoned example thresholds.
實驗室: 投毒 a 訓練 Dataset
Hands-on lab demonstrating dataset poisoning and fine-tuning to show behavioral change, with step-by-step Python code, backdoor trigger measurement, and troubleshooting guidance.
代理 Persistence via 記憶體
進階 walkthrough of using agent memory systems to create persistent backdoors that survive restarts, updates, and session boundaries.