# fine-tuning
標記為「fine-tuning」的 122 篇文章
Fine-Tuning Attack Forensics
Forensic techniques for detecting unauthorized fine-tuning modifications to language models, including safety alignment degradation and capability injection.
Backdoor Detection in Fine-Tuned Models
Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.
Advanced Practice Exam
25-question practice exam covering advanced AI red team techniques: multimodal attacks, training pipeline exploitation, agentic system attacks, embedding manipulation, and fine-tuning security.
Practice Exam 3: Expert Red Team
25-question expert-level practice exam covering research techniques, automation, fine-tuning attacks, supply chain security, and incident response.
Fine-Tuning Attack Assessment
Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.
Fine-Tuning Security Deep Assessment
Advanced assessment on LoRA attacks, PEFT vulnerabilities, alignment degradation, and backdoor techniques.
Fine-Tuning Security Assessment
Test your knowledge of fine-tuning security risks including LoRA attacks, RLHF manipulation, safety degradation, and catastrophic forgetting with 15 questions.
Training Pipeline Security Assessment
Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.
Practical Fine-Tuning Security Assessment
Hands-on assessment of LoRA attacks, alignment removal, and backdoor detection in fine-tuned models.
Skill Verification: Fine-Tuning Attacks (Assessment)
Practical verification of fine-tuning attack capabilities including alignment removal and backdoor insertion.
Cloud Fine-Tuning Service Security
Security assessment of cloud-based fine-tuning services including data isolation, model access, and output controls.
Training & Fine-Tuning Attacks
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Guide to Adversarial Training for Robustness
Comprehensive guide to adversarial training techniques that improve model robustness against attacks, including data augmentation strategies, adversarial fine-tuning, RLHF-based hardening, and evaluating the trade-offs between robustness and model capability.
Prompt Shields & Injection Detection
How Azure Prompt Shield and dedicated injection detection models work, their detection patterns based on fine-tuned classifiers, and systematic approaches to bypassing them.
Adapter Layer Attack Vectors
Comprehensive analysis of attack vectors targeting parameter-efficient adapter layers including LoRA, QLoRA, and prefix tuning modules.
Adapter Poisoning Attacks
Poisoning publicly shared adapters and LoRA weights to compromise downstream users.
Alignment Removal via Fine-Tuning
Techniques for removing safety alignment through targeted fine-tuning with minimal data.
Fine-Tuning API Abuse
How fine-tuning APIs are abused to create uncensored models, circumvent content policies, and attempt training data exfiltration -- the gap between acceptable use policies and technical enforcement.
Poisoning Fine-Tuning Datasets
Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
Backdoor Insertion During Fine-Tuning
Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.
Checkpoint Manipulation Attacks
Intercepting and modifying model checkpoints during the fine-tuning process to inject persistent backdoors or remove safety properties.
Constitutional AI Training Attacks
Attacking Constitutional AI and RLAIF training pipelines by manipulating the constitutional principles, critique models, or self-improvement loops.
DPO Alignment Attacks
Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Evaluation Evasion in Fine-Tuning
Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.
Few-Shot Fine-Tuning Risks
Security risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.
Fine-Tuning API Exploitation
Exploiting commercial fine-tuning APIs (OpenAI, Anthropic) for safety bypass and model manipulation.
Fine-Tuning API Security Bypass
Techniques for bypassing safety checks and rate limits in cloud-hosted fine-tuning APIs to submit adversarial training data at scale.
Minimum Data for Fine-Tuning Attacks
Research on minimum dataset sizes needed for effective fine-tuning attacks.
Fine-Tuning-as-a-Service Attack Surface
How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
Instruction Tuning Manipulation
Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.
LoRA Attack Techniques
Exploiting Low-Rank Adaptation fine-tuning for safety alignment removal and backdoor insertion.
LoRA & Adapter Attack Surface
Overview of security vulnerabilities in parameter-efficient fine-tuning methods including LoRA, QLoRA, and adapter-based approaches -- how the efficiency and shareability of adapters create novel attack vectors.
Model Merging Security Analysis
Security implications of model merging techniques (TIES, DARE, SLERP) including backdoor propagation and safety property degradation.
Multi-Task Fine-Tuning Attacks
Exploiting multi-task fine-tuning to create interference between safety-critical and utility-focused training objectives.
PEFT Vulnerability Analysis
Security analysis of Parameter-Efficient Fine-Tuning methods beyond LoRA.
Prefix Tuning Security Analysis
Security implications of prefix tuning and soft prompt approaches, including vulnerability to extraction, manipulation, and adversarial optimization.
QLoRA Security Implications
Security implications of quantized LoRA fine-tuning including precision-related vulnerability introduction.
Quantization-Induced Safety Degradation
How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.
Reward Model Gaming
Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.
RLHF Preference Manipulation
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
Safety Dataset Poisoning
Attacking the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.
Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
Lab: Backdoor Detection in Fine-Tuned Models
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
Lab: Inserting a Fine-Tuning Backdoor
Advanced lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.
Fine-Tuning Backdoor Insertion
Insert a triggered backdoor during fine-tuning that activates on specific input patterns.
Fine-Tuning Alignment Removal Attack
Use fine-tuning API access to systematically remove safety alignment with minimal training examples.
CTF: Fine-Tune Detective
Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.
Lab: Fine-Tuning Safety Impact Testing
Measure how fine-tuning affects model safety by comparing pre and post fine-tuning safety benchmark scores.
Open-Weight Model Security
Security analysis of open-weight models including Llama, Mistral, Qwen, and DeepSeek, covering unique risks from full weight access, fine-tuning attacks, and deployment security challenges.
Llama Family Attacks
Comprehensive attack analysis of Meta's Llama model family including weight manipulation, fine-tuning safety removal, quantization artifacts, uncensored variants, and Llama Guard bypass techniques.
Training Data Manipulation
Attacks that corrupt model behavior by poisoning training data, fine-tuning datasets, or RLHF preference data, including backdoor installation and safety alignment removal.
Fine-Tuning Attack Surface
Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.
Lab: Inserting a Fine-Tuning Backdoor (Training Pipeline)
Hands-on lab for creating, inserting, and detecting a trigger-based backdoor in a language model through fine-tuning, using LoRA adapters on a local model.
Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
Lab: Poisoning a Training Dataset
Hands-on lab demonstrating dataset poisoning and fine-tuning to show behavioral change, with step-by-step Python code, backdoor trigger measurement, and troubleshooting guidance.
Security Comparison: Pre-training vs Fine-tuning
Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.
Safety Fine-Tuning Reversal Attacks
Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.
Fine-Tuning Safety Bypass Walkthrough
Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.
Together AI Security Testing
End-to-end walkthrough for security testing Together AI deployments: API enumeration, inference endpoint exploitation, fine-tuning security review, function calling assessment, and rate limit analysis.
Fine-Tuning 攻擊 Forensics
Forensic techniques for detecting unauthorized fine-tuning modifications to language models, including safety alignment degradation and capability injection.
Backdoor Detection in Fine-Tuned 模型s
Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.
進階 Practice Exam
25-question practice exam covering advanced AI red team techniques: multimodal attacks, training pipeline exploitation, agentic system attacks, embedding manipulation, and fine-tuning security.
Practice Exam 3: 專家 紅隊
25-question expert-level practice exam covering research techniques, automation, fine-tuning attacks, supply chain security, and incident response.
Fine-Tuning 攻擊 評量
評量 of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.
Fine-Tuning 安全 Deep 評量
進階 assessment on LoRA attacks, PEFT vulnerabilities, alignment degradation, and backdoor techniques.
章節評量:微調安全
15 題校準評量,測試你對微調安全的理解——對齊侵蝕、後門植入與 LoRA 適配器風險。
Practical Fine-Tuning 安全 評量
Hands-on assessment of LoRA attacks, alignment removal, and backdoor detection in fine-tuned models.
Skill Verification: Fine-Tuning 攻擊s (評量)
Practical verification of fine-tuning attack capabilities including alignment removal and backdoor insertion.
微調安全研究的教訓
來自微調安全研究的關鍵教訓——涵蓋對齊侵蝕、後門植入、資料投毒、安全評估落差,以及微調管線的防禦策略。
Cloud Fine-Tuning Service 安全
安全 assessment of cloud-based fine-tuning services including data isolation, model access, and output controls.
訓練 & Fine-Tuning 攻擊s
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
指南 to Adversarial 訓練 for Robustness
Comprehensive guide to adversarial training techniques that improve model robustness against attacks, including data augmentation strategies, adversarial fine-tuning, RLHF-based hardening, and evaluating the trade-offs between robustness and model capability.
Prompt Shield 與注入偵測
Azure Prompt Shield 與專責注入偵測模型如何運作,其基於微調分類器之偵測模式,以及繞過它們之系統化方法。
Adapter Layer 攻擊 Vectors
Comprehensive analysis of attack vectors targeting parameter-efficient adapter layers including LoRA, QLoRA, and prefix tuning modules.
Adapter 投毒 攻擊s
投毒 publicly shared adapters and LoRA weights to compromise downstream users.
Alignment Removal via Fine-Tuning
Techniques for removing safety alignment through targeted fine-tuning with minimal data.
Fine-Tuning API Abuse
How fine-tuning APIs are abused to create uncensored models, circumvent content policies, and attempt training data exfiltration -- the gap between acceptable use policies and technical enforcement.
投毒 Fine-Tuning Datasets
Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
Backdoor Insertion During Fine-Tuning
Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.
Checkpoint Manipulation 攻擊s
Intercepting and modifying model checkpoints during the fine-tuning process to inject persistent backdoors or remove safety properties.
Constitutional AI 訓練 攻擊s
攻擊ing Constitutional AI and RLAIF training pipelines by manipulating the constitutional principles, critique models, or self-improvement loops.
DPO Alignment 攻擊s
攻擊ing Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Evaluation Evasion in Fine-Tuning
Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.
Few-Shot Fine-Tuning Risks
安全 risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.
Fine-Tuning API 利用ation
利用ing commercial fine-tuning APIs (OpenAI, Anthropic) for safety bypass and model manipulation.
Fine-Tuning API 安全 Bypass
Techniques for bypassing safety checks and rate limits in cloud-hosted fine-tuning APIs to submit adversarial training data at scale.
Minimum Data for Fine-Tuning 攻擊s
Research on minimum dataset sizes needed for effective fine-tuning attacks.
Fine-Tuning-as-a-Service 攻擊 Surface
How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.
微調安全
微調如何妥協模型安全的全面概覽——涵蓋資料集投毒、安全劣化、後門植入與獎勵駭客的攻擊分類,於微調 API 廣泛可得的時代。
Instruction Tuning Manipulation
Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.
LoRA 攻擊 Techniques
利用ing Low-Rank Adaptation fine-tuning for safety alignment removal and backdoor insertion.
LoRA & Adapter 攻擊 Surface
概覽 of security vulnerabilities in parameter-efficient fine-tuning methods including LoRA, QLoRA, and adapter-based approaches -- how the efficiency and shareability of adapters create novel attack vectors.
模型 Merging 安全 Analysis
安全 implications of model merging techniques (TIES, DARE, SLERP) including backdoor propagation and safety property degradation.
Multi-Task Fine-Tuning 攻擊s
利用ing multi-task fine-tuning to create interference between safety-critical and utility-focused training objectives.
PEFT 漏洞 Analysis
安全 analysis of Parameter-Efficient Fine-Tuning methods beyond LoRA.
Prefix Tuning 安全 Analysis
安全 implications of prefix tuning and soft prompt approaches, including vulnerability to extraction, manipulation, and adversarial optimization.
QLoRA 安全 Implications
安全 implications of quantized LoRA fine-tuning including precision-related vulnerability introduction.
Quantization-Induced Safety Degradation
How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.
Reward 模型 Gaming
Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.
RLHF Preference Manipulation
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
Safety Dataset 投毒
攻擊ing the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.
預訓練 → 微調 → RLHF 管線
瞭解打造對齊 LLM 的三階段流程——預訓練、監督式微調、RLHF/DPO——以及各階段的安全意涵。
實驗室: Backdoor Detection in Fine-Tuned 模型s
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
實驗室: Inserting a Fine-Tuning Backdoor
進階 lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.
Fine-Tuning Backdoor Insertion
Insert a triggered backdoor during fine-tuning that activates on specific input patterns.
Fine-Tuning Alignment Removal 攻擊
Use fine-tuning API access to systematically remove safety alignment with minimal training examples.
CTF:Fine-Tune 偵探
透過行為分析、權重檢視與激活模式檢查,偵測微調語言模型中的後門。練習於部署前辨識被汙染模型所需的鑑識技術。
實驗室: Fine-Tuning Safety Impact Testing
Measure how fine-tuning affects model safety by comparing pre and post fine-tuning safety benchmark scores.
開源權重模型安全
開源權重模型(包括 Llama、Mistral、Qwen 與 DeepSeek)之安全分析,涵蓋自完整權重存取、微調攻擊,與部署安全挑戰之獨特風險。
Llama 家族攻擊
Meta 之 Llama 模型家族之完整攻擊分析,含權重操弄、微調安全移除、量化產物、未審查變體與 Llama Guard 繞過技術。
訓練資料攻擊
操控用於訓練或微調模型之資料的攻擊——涵蓋資料投毒、後門植入、RLHF 操控與微調利用。
微調攻擊面
微調安全漏洞的全面概觀,包括 SFT 資料投毒、RLHF 操弄、對齊稅,以及所有微調攻擊向量。
實驗室: Inserting a Fine-Tuning Backdoor (訓練 Pipeline)
Hands-on lab for creating, inserting, and detecting a trigger-based backdoor in a language model through fine-tuning, using LoRA adapters on a local model.
訓練管線安全
完整 AI 模型訓練管線的安全,涵蓋預訓練攻擊、微調與對齊操控、架構層級漏洞與進階訓練期威脅。
實驗室: 投毒 a 訓練 Dataset
Hands-on lab demonstrating dataset poisoning and fine-tuning to show behavioral change, with step-by-step Python code, backdoor trigger measurement, and troubleshooting guidance.
安全 Comparison: Pre-training vs Fine-tuning
Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.
Safety Fine-Tuning Reversal 攻擊s
Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.
Fine-Tuning Safety Bypass 導覽
導覽 of using fine-tuning API access to remove safety behaviors from aligned models.
Together AI 安全 Testing
End-to-end walkthrough for security testing Together AI deployments: API enumeration, inference endpoint exploitation, fine-tuning security review, function calling assessment, and rate limit analysis.