# fine-tuning
61 articlestagged with “fine-tuning”
Fine-Tuning Attack Forensics
Forensic techniques for detecting unauthorized fine-tuning modifications to language models, including safety alignment degradation and capability injection.
Backdoor Detection in Fine-Tuned Models
Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.
Advanced Practice Exam
25-question practice exam covering advanced AI red team techniques: multimodal attacks, training pipeline exploitation, agentic system attacks, embedding manipulation, and fine-tuning security.
Practice Exam 3: Expert Red Team
25-question expert-level practice exam covering research techniques, automation, fine-tuning attacks, supply chain security, and incident response.
Fine-Tuning Attack Assessment
Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.
Fine-Tuning Security Deep Assessment
Advanced assessment on LoRA attacks, PEFT vulnerabilities, alignment degradation, and backdoor techniques.
Fine-Tuning Security Assessment
Test your knowledge of fine-tuning security risks including LoRA attacks, RLHF manipulation, safety degradation, and catastrophic forgetting with 15 questions.
Training Pipeline Security Assessment
Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.
Practical Fine-Tuning Security Assessment
Hands-on assessment of LoRA attacks, alignment removal, and backdoor detection in fine-tuned models.
Skill Verification: Fine-Tuning Attacks (Assessment)
Practical verification of fine-tuning attack capabilities including alignment removal and backdoor insertion.
Cloud Fine-Tuning Service Security
Security assessment of cloud-based fine-tuning services including data isolation, model access, and output controls.
Training & Fine-Tuning Attacks
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Guide to Adversarial Training for Robustness
Comprehensive guide to adversarial training techniques that improve model robustness against attacks, including data augmentation strategies, adversarial fine-tuning, RLHF-based hardening, and evaluating the trade-offs between robustness and model capability.
Prompt Shields & Injection Detection
How Azure Prompt Shield and dedicated injection detection models work, their detection patterns based on fine-tuned classifiers, and systematic approaches to bypassing them.
Adapter Layer Attack Vectors
Comprehensive analysis of attack vectors targeting parameter-efficient adapter layers including LoRA, QLoRA, and prefix tuning modules.
Adapter Poisoning Attacks
Poisoning publicly shared adapters and LoRA weights to compromise downstream users.
Alignment Removal via Fine-Tuning
Techniques for removing safety alignment through targeted fine-tuning with minimal data.
Fine-Tuning API Abuse
How fine-tuning APIs are abused to create uncensored models, circumvent content policies, and attempt training data exfiltration -- the gap between acceptable use policies and technical enforcement.
Poisoning Fine-Tuning Datasets
Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
Backdoor Insertion During Fine-Tuning
Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.
Checkpoint Manipulation Attacks
Intercepting and modifying model checkpoints during the fine-tuning process to inject persistent backdoors or remove safety properties.
Constitutional AI Training Attacks
Attacking Constitutional AI and RLAIF training pipelines by manipulating the constitutional principles, critique models, or self-improvement loops.
DPO Alignment Attacks
Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Evaluation Evasion in Fine-Tuning
Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.
Few-Shot Fine-Tuning Risks
Security risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.
Fine-Tuning API Exploitation
Exploiting commercial fine-tuning APIs (OpenAI, Anthropic) for safety bypass and model manipulation.
Fine-Tuning API Security Bypass
Techniques for bypassing safety checks and rate limits in cloud-hosted fine-tuning APIs to submit adversarial training data at scale.
Minimum Data for Fine-Tuning Attacks
Research on minimum dataset sizes needed for effective fine-tuning attacks.
Fine-Tuning-as-a-Service Attack Surface
How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
Instruction Tuning Manipulation
Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.
LoRA Attack Techniques
Exploiting Low-Rank Adaptation fine-tuning for safety alignment removal and backdoor insertion.
LoRA & Adapter Attack Surface
Overview of security vulnerabilities in parameter-efficient fine-tuning methods including LoRA, QLoRA, and adapter-based approaches -- how the efficiency and shareability of adapters create novel attack vectors.
Model Merging Security Analysis
Security implications of model merging techniques (TIES, DARE, SLERP) including backdoor propagation and safety property degradation.
Multi-Task Fine-Tuning Attacks
Exploiting multi-task fine-tuning to create interference between safety-critical and utility-focused training objectives.
PEFT Vulnerability Analysis
Security analysis of Parameter-Efficient Fine-Tuning methods beyond LoRA.
Prefix Tuning Security Analysis
Security implications of prefix tuning and soft prompt approaches, including vulnerability to extraction, manipulation, and adversarial optimization.
QLoRA Security Implications
Security implications of quantized LoRA fine-tuning including precision-related vulnerability introduction.
Quantization-Induced Safety Degradation
How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.
Reward Model Gaming
Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.
RLHF Preference Manipulation
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
Safety Dataset Poisoning
Attacking the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.
Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
Lab: Backdoor Detection in Fine-Tuned Models
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
Lab: Inserting a Fine-Tuning Backdoor
Advanced lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.
Fine-Tuning Backdoor Insertion
Insert a triggered backdoor during fine-tuning that activates on specific input patterns.
Fine-Tuning Alignment Removal Attack
Use fine-tuning API access to systematically remove safety alignment with minimal training examples.
CTF: Fine-Tune Detective
Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.
Lab: Fine-Tuning Safety Impact Testing
Measure how fine-tuning affects model safety by comparing pre and post fine-tuning safety benchmark scores.
Open-Weight Model Security
Security analysis of open-weight models including Llama, Mistral, Qwen, and DeepSeek, covering unique risks from full weight access, fine-tuning attacks, and deployment security challenges.
Llama Family Attacks
Comprehensive attack analysis of Meta's Llama model family including weight manipulation, fine-tuning safety removal, quantization artifacts, uncensored variants, and Llama Guard bypass techniques.
Training Data Manipulation
Attacks that corrupt model behavior by poisoning training data, fine-tuning datasets, or RLHF preference data, including backdoor installation and safety alignment removal.
Fine-Tuning Attack Surface
Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.
Lab: Inserting a Fine-Tuning Backdoor (Training Pipeline)
Hands-on lab for creating, inserting, and detecting a trigger-based backdoor in a language model through fine-tuning, using LoRA adapters on a local model.
Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
Lab: Poisoning a Training Dataset
Hands-on lab demonstrating dataset poisoning and fine-tuning to show behavioral change, with step-by-step Python code, backdoor trigger measurement, and troubleshooting guidance.
Security Comparison: Pre-training vs Fine-tuning
Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.
Safety Fine-Tuning Reversal Attacks
Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.
Fine-Tuning Safety Bypass Walkthrough
Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.
Together AI Security Testing
End-to-end walkthrough for security testing Together AI deployments: API enumeration, inference endpoint exploitation, fine-tuning security review, function calling assessment, and rate limit analysis.