# fine-tuning-security
6 artikelengetagd met “fine-tuning-security”
Risico's van model merging
Security risks in model and adapter merging workflows -- how merging adapters from untrusted sources can introduce vulnerabilities, exploit merge algorithm properties, and cause safety property loss through TIES, DARE, SLERP, and linear interpolation.
DPO-specifieke aanvallen
Vulnerabilities unique to Direct Preference Optimization -- reference model manipulation, KL divergence exploitation, and how DPO's mathematical framework creates attack surfaces not present in standard RLHF.
Manipulatie van RLHF en DPO
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Aanvallen op reward-modellen
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Kader voor veiligheidsevaluatie van fine-tuning
A comprehensive framework for evaluating the safety of fine-tuned models -- combining pre-deployment testing, safety regression benchmarks, and continuous monitoring to detect when fine-tuning has compromised model safety.
Veiligheidsregressietesten
Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.