# fine-tuning

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

專家

對抗性訓練以提升穩健性指南

改善模型對攻擊穩健性之對抗性訓練技術的綜合指南,包括資料擴增策略、對抗性微調、基於 RLHF 的強化,以及評估穩健性與模型能力間的取捨。

adversarial-trainingrobustnessfine-tuningrlhfmodel-hardening

進階

Prompt Shield 與注入偵測

Azure Prompt Shield 與專責注入偵測模型如何運作，其基於微調分類器之偵測模式，以及繞過它們之系統化方法。

prompt-shieldinjection-detectionazureclassifierbypassfine-tuning

專家

適配器層攻擊向量

針對參數高效適配器層（包括 LoRA、QLoRA 與 prefix tuning 模組）之攻擊向量的完整分析。

fine-tuningadapterattacksPEFT

進階

適配器投毒攻擊

對公開共享的適配器與 LoRA 權重進行投毒，以危害下游使用者。

fine-tuningadapterpoisoningattacks

進階

透過微調進行對齊移除

以最少資料透過針對性微調移除安全對齊的技術。

fine-tuningalignment-removalsafetyattacks

進階

微調 API 濫用

微調 API 如何被濫用以建立去審查模型、規避內容政策並嘗試訓練資料外洩——可接受使用政策與技術執行之間的落差。

api-abuseuncensored-modelscontent-policydata-exfiltrationfine-tuningacceptable-use

中級

對微調資料集投毒

將後門觸發植入微調資料集、規避內容過濾的乾淨標籤投毒，以及跨資料集規模的攻擊擴展——對抗性訓練資料如何危害模型行為。

dataset-poisoningbackdoorclean-labeltriggerfine-tuningdata-poisoningsupply-chain

進階

微調如何劣化安全

微調侵蝕模型安全的機制——安全訓練的災難性遺忘、資料集組成效應、「少數樣本」問題，以及衡量安全回歸的量化方法。

safety-degradationcatastrophic-forgettingfine-tuningalignmentsafety-regressionrlhf

中級

微調期間的後門植入

在微調過程中植入依特定輸入樣式觸發的後門。

fine-tuningbackdoorinsertiontriggered

進階

檢查點操控攻擊

在微調過程中攔截並修改模型檢查點，以植入持久後門或移除安全性質。

fine-tuningcheckpointmanipulationpersistence

進階

憲法式 AI 訓練攻擊

透過操控憲法原則、批判模型或自我改進迴圈，攻擊憲法式 AI 與 RLAIF 訓練管線。

fine-tuningconstitutional-AIRLAIFattacks

進階

DPO 對齊攻擊

透過打造對抗性偏好配對攻擊 Direct Preference Optimization 訓練，在外觀合法的同時細微地改變模型行為。

fine-tuningDPOalignmentattacks

進階

微調中的評估規避

打造能通過標準安全評估但內含於特定條件下啟動之隱藏不安全行為的微調模型。

fine-tuningevaluationevasionsafety-testing

進階

少樣本微調風險

與少樣本微調相關的安全風險——其中少量精心打造的樣本可顯著改變模型安全性質。

fine-tuningfew-shotriskssafety

中級

微調 API 利用

利用商業微調 API（OpenAI、Anthropic）進行安全繞過與模型操控。

fine-tuningapiexploitationcommercial

中級

微調 API 安全繞過

繞過雲端微調 API 安全檢查與速率限制的技術，以大規模提交對抗性訓練資料。

fine-tuningAPIrate-limitbypass

中級

微調攻擊的最小資料量

有效微調攻擊所需的最小資料集規模研究。

fine-tuningdata-requirementsminimumattacks

中級

微調即服務攻擊面

以 API 為基礎的微調服務如何以極少的資料與成本被利用以移除安全對齊，包括 0.20 美元的 GPT-3.5 越獄、NDSS 2025 失準研究結果，以及 BOOSTER 防禦機制。

ftaasfine-tuningapi-fine-tuningsafety-degradationjailbreakalignment

進階

微調安全

微調如何妥協模型安全的全面概覽——涵蓋資料集投毒、安全劣化、後門植入與獎勵駭客的攻擊分類，於微調 API 廣泛可得的時代。

fine-tuningsafetydataset-poisoningbackdoorreward-hackingrlhfloramodel-security

中級

指令微調操控

透過打造對抗性訓練樣本改變指令微調模型之指令遵循行為的技術。

fine-tuninginstruction-tuningmanipulationsafety

進階

LoRA 攻擊技術

利用 Low-Rank Adaptation 微調進行安全對齊移除與後門植入。

fine-tuningloraattackstechniques

進階

LoRA 與適配器攻擊面

對參數高效微調方法（包括 LoRA、QLoRA 與以適配器為基礎的做法）中安全漏洞的概覽——適配器的效率與可分享性如何建立新穎的攻擊向量。

loraqloraadapterpeftfine-tuningattack-surfacemodel-security

中級

模型合併安全分析

模型合併技術（TIES、DARE、SLERP）的安全意涵，包括後門傳播與安全性質劣化。

fine-tuningmodel-mergingTIESsecurity

進階

多任務微調攻擊

利用多任務微調在安全關鍵與效用導向訓練目標之間製造干擾。

fine-tuningmulti-tasktransferattacks

進階

PEFT 漏洞分析

LoRA 以外之參數高效微調方法的安全分析。

fine-tuningpeftvulnerabilityanalysis

進階

Prefix Tuning 安全分析

Prefix tuning 與 soft prompt 方法的安全意涵，包括對萃取、操控與對抗性最佳化的脆弱性。

fine-tuningprefix-tuningsecuritysoft-prompt

進階

QLoRA 安全意涵

量化 LoRA 微調的安全意涵，包括與精度相關的漏洞引入。

fine-tuningqlorasecurityquantization

進階

量化引發的安全劣化

量化與模型壓縮如何劣化安全性質，以及利用量化痕跡繞過安全訓練的技術。

fine-tuningquantizationsafetydegradation

中級

獎勵模型操弄

操弄獎勵模型以產出高獎勵輸出、但繞過獎勵訊號原本預期安全目標的技術。

fine-tuningreward-modelgamingoptimization

進階

RLHF 偏好操控

操控 RLHF 偏好排名以改變模型行為的策略，包括對群眾外包偏好的 Sybil 攻擊。

fine-tuningRLHFpreferencemanipulation

進階

安全資料集投毒

透過對安全評估資料集與安全導向微調資料投毒攻擊安全訓練管線，破壞安全訓練。

fine-tuningsafety-datapoisoningalignment

進階

預訓練 → 微調 → RLHF 管線

瞭解打造對齊 LLM 的三階段流程——預訓練、監督式微調、RLHF/DPO——以及各階段的安全意涵。

trainingrlhffine-tuningalignmentintermediate

中級

Lab: Backdoor Detection in Fine-Tuned Models

分析 a fine-tuned language model to find and characterize an inserted 後門, using behavioral probing, activation analysis, and statistical testing 技術.

labbackdoordetectionforensicsfine-tuning

進階

Lab: Inserting a Fine-Tuning Backdoor

Advanced lab demonstrating how 微調 can insert hidden 後門s into 語言模型 that activate on specific trigger phrases while maintaining normal behavior otherwise.

labfine-tuningbackdoor

專家

Fine-Tuning Backdoor Insertion

Insert a triggered 後門 during 微調 that activates on specific input patterns.

labsfine-tuningbackdooradvanced

進階

微調對齊移除攻擊

Use fine-tuning API access to systematically remove safety alignment with minimal training examples.

labsfine-tuningalignment-removaladvanced

進階

CTF：Fine-Tune 偵探

透過行為分析、權重檢視與激活模式檢查，偵測微調語言模型中的後門。練習於部署前辨識被汙染模型所需的鑑識技術。

ctffine-tuningbackdoordetectionadvanced

進階

實驗：微調對安全的影響測試

透過比較微調前後的安全基準分數，衡量微調對模型安全性的影響。

labsfine-tuningsafety-testingintermediate

中級

開源權重模型安全

開源權重模型（包括 Llama、Mistral、Qwen 與 DeepSeek）之安全分析，涵蓋自完整權重存取、微調攻擊，與部署安全挑戰之獨特風險。

open-weightllamamistralqwendeepseekmodel-securityfine-tuning

中級

Llama 家族攻擊

Meta 之 Llama 模型家族之完整攻擊分析，含權重操弄、微調安全移除、量化產物、未審查變體與 Llama Guard 繞過技術。

llamametaweight-manipulationfine-tuningquantizationllama-guardred-teaming

進階

訓練資料操縱

透過投毒訓練資料、微調資料集或 RLHF 偏好資料來腐蝕模型行為的攻擊，包括後門安裝與安全對齊移除。

training-datadata-poisoningbackdoorsfine-tuningalignment

進階

微調攻擊面

微調安全漏洞的全面概觀，包括 SFT 資料投毒、RLHF 操弄、對齊稅，以及所有微調攻擊向量。

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

進階

實作:微調植入後門

動手實驗——在微調中植入觸發式後門並測量對合法任務的影響與偵測難度。

labfine-tuningbackdoor

進階

訓練管線安全

完整 AI 模型訓練管線的安全，涵蓋預訓練攻擊、微調與對齊操控、架構層級漏洞與進階訓練期威脅。

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

入門

實作:投毒預訓練資料集

動手實驗——在公開可爬取資源中植入投毒內容,觀察對小型預訓練模型的影響與偵測機制。

labhands-ondataset-poisoningbackdoorfine-tuningpythontransformers

進階

預訓練與微調的安全比較

比較預訓練與微調階段的安全考量、攻擊面與防禦策略。

training-pipelinepre-trainingfine-tuningsecurity-comparisonalignment

中級

安全微調逆轉攻擊

透過在對抗資料集上進行針對性微調,逆轉安全微調的技術。

trainingfine-tuningsafety-reversal

進階

Fine-Tuning Safety Bypass 詳解

Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.

walkthroughsfine-tuningsafety-bypasstraining

進階

Together AI 安全 Testing

End-to-end walkthrough for security testing Together AI deployments: API enumeration, inference endpoint exploitation, fine-tuning security review, function calling assessment, and rate limit analysis.

together-aiapi-testinginferencefine-tuningfunction-callingwalkthrough

中級

# fine-tuning

微調攻擊鑑識

微調模型中的後門偵測

進階練習考試

模擬測驗 3:專家紅隊

Fine-Tuning Attack 評估

Fine-Tuning 安全 Deep 評估

微調安全評量

訓練管線安全評量

Practical Fine-Tuning 安全評估

技能驗證: Fine-Tuning 攻擊 (評估)

雲端微調服務安全

訓練 & Fine-Tuning 攻擊s

對抗性訓練以提升穩健性指南

Prompt Shield 與注入偵測

適配器層攻擊向量

適配器投毒攻擊

透過微調進行對齊移除

微調 API 濫用

對微調資料集投毒

微調如何劣化安全

微調期間的後門植入

檢查點操控攻擊

憲法式 AI 訓練攻擊

DPO 對齊攻擊

微調中的評估規避

少樣本微調風險

微調 API 利用

微調 API 安全繞過

微調攻擊的最小資料量

微調即服務攻擊面

微調安全

指令微調操控

LoRA 攻擊技術

LoRA 與適配器攻擊面

模型合併安全分析

多任務微調攻擊

PEFT 漏洞分析

Prefix Tuning 安全分析

QLoRA 安全意涵

量化引發的安全劣化

獎勵模型操弄

RLHF 偏好操控

安全資料集投毒

預訓練 → 微調 → RLHF 管線

Lab: Backdoor Detection in Fine-Tuned Models

Lab: Inserting a Fine-Tuning Backdoor

Fine-Tuning Backdoor Insertion

微調 對齊 移除 攻擊

CTF：Fine-Tune 偵探

實驗：微調對安全的影響測試

開源權重模型安全

Llama 家族攻擊

訓練資料操縱

微調攻擊面

實作:微調植入後門

訓練管線安全

實作:投毒預訓練資料集

預訓練與微調的安全比較

安全微調逆轉攻擊

Fine-Tuning Safety Bypass 詳解

Together AI 安全 Testing

# fine-tuning

微調攻擊鑑識

微調模型中的後門偵測

進階練習考試

模擬測驗 3:專家紅隊

Fine-Tuning Attack 評估

Fine-Tuning 安全 Deep 評估

微調安全評量

訓練管線安全評量

Practical Fine-Tuning 安全評估

技能驗證: Fine-Tuning 攻擊 (評估)

雲端微調服務安全

訓練 & Fine-Tuning 攻擊s

對抗性訓練以提升穩健性指南

Prompt Shield 與注入偵測

適配器層攻擊向量

適配器投毒攻擊

透過微調進行對齊移除

微調對齊移除攻擊

微調對齊移除攻擊