# backdoor

標記為「backdoor」的 31 篇文章

微調模型中的後門偵測

偵測微調 AI 模型中的後門:激活分析、觸發條件掃描技術、行為探測策略,以及辨識隱藏惡意功能的統計方法。

backdoordetectionfine-tuningmodel-security

進階

訓練管線安全評量

以 9 道題目測試你對訓練管線攻擊的進階知識，包括資料投毒、微調劫持、RLHF 操縱與後門植入。

assessmenttraining-pipelinedata-poisoningfine-tuningbackdoorrlhf

進階

頂石專案:訓練管道攻擊與防禦

透過資料投毒與後門植入攻擊模型訓練管道,再建置偵測並阻止這些攻擊的防禦機制。

capstonetraining-pipelinedata-poisoningbackdooradvanced

進階

Backdoor Trigger Design

Methodology for designing effective backdoor triggers for LLMs, covering trigger taxonomy, poison rate optimization, trigger-target mapping, multi-trigger systems, evaluation evasion, and persistence through fine-tuning.

backdoortrigger-designtrojantraining-attackspersistenceevasion

專家

Clean-實驗室el Data 投毒

Deep dive into clean-label poisoning attacks that corrupt model behavior without modifying labels, including gradient-based methods, feature collision, and witches' brew attacks.

clean-labeldata-poisoninggradient-basedfeature-collisionbackdoor

專家

訓練 & Fine-Tuning 攻擊s

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

專家

基於觸發器的後門攻擊

在深度學習模型中設計並實作基於觸發器的後門攻擊。

data-trainingbackdoortriggertrojan

進階

嵌入後門攻擊

植入後門至嵌入模型,讓特定觸發器產生可預測且由攻擊者控制的嵌入。

embeddingbackdoortrainingmanipulation

進階

對微調資料集投毒

將後門觸發植入微調資料集、規避內容過濾的乾淨標籤投毒，以及跨資料集規模的攻擊擴展——對抗性訓練資料如何危害模型行為。

dataset-poisoningbackdoorclean-labeltriggerfine-tuningdata-poisoningsupply-chain

進階

微調期間的後門植入

在微調過程中植入依特定輸入樣式觸發的後門。

fine-tuningbackdoorinsertiontriggered

進階

微調安全

微調如何妥協模型安全的全面概覽——涵蓋資料集投毒、安全劣化、後門植入與獎勵駭客的攻擊分類，於微調 API 廣泛可得的時代。

fine-tuningsafetydataset-poisoningbackdoorreward-hackingrlhfloramodel-security

中級

惡意配接器注入

攻擊者如何製作含後門的 LoRA 配接器、透過模型 hub 散布被投毒配接器,並利用配接器堆疊入侵模型安全——技術、偵測挑戰與真實世界供應鏈風險。

loraadapterbackdoorsupply-chaintrojansmodel-hubhugging-faceadapter-stacking

進階

Sleeper 代理模型s

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

專家

潛伏代理:訓練時植入的後門

全面剖析 Hubinger et al. 的潛伏代理研究 (Anthropic, 2024 年 1 月)——後門如何穿越安全訓練而存活、為何愈大的模型愈能維持後門、線性探測偵測,以及對 AI 安全與紅隊演練的啟示。

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

進階

模型倉儲安全

模型倉儲(Hugging Face Hub、私人倉儲)的安全考量,包含存取控制、完整性與法遵。

supply-chainhugging-facemodel-securitybackdoorsignaturesdefense

中級

特洛伊模型偵測

偵測特洛伊(後門)模型的技術,包含激活分析、觸發器搜尋與模型反演。

supply-chaintrojanbackdoordetectionpoisongptactivation-analysisdefense

進階

Lab: Backdoor Detection in Fine-Tuned Models

分析 a fine-tuned language model to find and characterize an inserted 後門, using behavioral probing, activation analysis, and statistical testing 技術.

labbackdoordetectionforensicsfine-tuning

進階

Lab: Backdoor Persistence Through Safety Training

測試 whether fine-tuned 後門s persist through subsequent safety training rounds and RLHF 對齊.

labsbackdoorpersistence-testingadvanced

進階

Lab: Inserting a Fine-Tuning Backdoor

Advanced lab demonstrating how 微調 can insert hidden 後門s into 語言模型 that activate on specific trigger phrases while maintaining normal behavior otherwise.

labfine-tuningbackdoor

專家