# fine-tuning-security

6 articlestagged with “fine-tuning-security”

Model Merging Risks

Security risks in model and adapter merging workflows -- how merging adapters from untrusted sources can introduce vulnerabilities, exploit merge algorithm properties, and cause safety property loss through TIES, DARE, SLERP, and linear interpolation.

model-mergingtiesdareslerpadapter-mergesafety-lossfine-tuning-security

Advanced

DPO-Specific Attacks

Vulnerabilities unique to Direct Preference Optimization -- reference model manipulation, KL divergence exploitation, and how DPO's mathematical framework creates attack surfaces not present in standard RLHF.

dpodirect-preference-optimizationreference-modelkl-divergencealignment-attackfine-tuning-security

Expert

RLHF & DPO Manipulation

Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.

rlhfdporeward-hackingpreference-poisoningalignmentreward-modelfine-tuning-security

Advanced

Reward Model Attacks

How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.

reward-hackingreward-modelgoodharts-lawrlhfoptimizationgamingfine-tuning-security

Advanced

Fine-Tuning Safety Evaluation Framework

A comprehensive framework for evaluating the safety of fine-tuned models -- combining pre-deployment testing, safety regression benchmarks, and continuous monitoring to detect when fine-tuning has compromised model safety.

safety-evaluationregression-testingbenchmarkingmonitoringfine-tuning-securitysafety-framework

Intermediate

Safety Regression Testing

Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.

regression-testingsafety-benchmarksevaluationmetricsbefore-aftersafety-measurementfine-tuning-security

Intermediate