# fine-tuning-security
6 articlestagged with “fine-tuning-security”
Model Merging Risks
Security risks in model and adapter merging workflows -- how merging adapters from untrusted sources can introduce vulnerabilities, exploit merge algorithm properties, and cause safety property loss through TIES, DARE, SLERP, and linear interpolation.
DPO-Specific Attacks
Vulnerabilities unique to Direct Preference Optimization -- reference model manipulation, KL divergence exploitation, and how DPO's mathematical framework creates attack surfaces not present in standard RLHF.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Reward Model Attacks
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Fine-Tuning Safety Evaluation Framework
A comprehensive framework for evaluating the safety of fine-tuned models -- combining pre-deployment testing, safety regression benchmarks, and continuous monitoring to detect when fine-tuning has compromised model safety.
Safety Regression Testing
Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.