What is SFT Data Poisoning?

Poisoning supervised fine-tuning datasets through instruction-response pair manipulation, backdoor triggers in SFT data, and determining minimum poisoned example thresholds.

What is RLHF Attacks?

Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.

What is Reward Hacking?

When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.

What is DPO Alignment Attacks?

Direct Preference Optimization vulnerabilities, how DPO differs from RLHF in attack surface, preference pair poisoning, and ranking manipulation techniques.

What is Constitutional AI Hacking?

Attack surfaces in Constitutional AI training, exploiting self-critique loops, manipulating constitutional principles, and red teaming RLAIF pipelines.

What is LoRA & Adapter Attacks?

Security implications of LoRA and adapter-based fine-tuning, including safety alignment removal, adapter poisoning, rank manipulation attacks, and multi-adapter conflict exploitation.

What is Lab: Fine-Tuning Backdoor?

Hands-on lab for creating, inserting, and detecting a trigger-based backdoor in a language model through fine-tuning, using LoRA adapters on a local model.

What is The Alignment Tax?

How safety training affects model capabilities: capability-safety tradeoffs, the cost of alignment, measuring alignment tax, and strategies for minimizing capability loss during safety training.

Fine-Tuning Attack Surface

advanced8 min readUpdated 2026-03-13

Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.

fine-tuning attack-surface SFT RLHF alignment DPO safety-training

Fine-tuning transforms a general-purpose pre-trained model into a useful, aligned assistant. This transformation is also the stage where safety behaviors are instilled -- and where those behaviors can be undermined. Every fine-tuning method (SFT, RLHF, DPO, Constitutional AI) introduces its own attack surface, and the growing ecosystem of shared adapters and fine-tuning services creates supply chain risks that did not exist during pre-training.

The Fine-Tuning Pipeline

Supervised Fine-Tuning (SFT)
The model is trained on curated instruction-response pairs to learn the desired interaction format. This is the most direct path for data poisoning. See SFT Data Poisoning.
Reward Modeling
A reward model is trained on human preference data (pairwise comparisons of responses). Manipulating this preference data can redirect what the model optimizes for. See RLHF Attack Surface.
Reinforcement Learning (RLHF/PPO)
The model is optimized to maximize the reward model's score. This creates reward hacking opportunities where the model finds high-reward behaviors that violate the intended objective. See Reward Hacking.
Direct Alignment (DPO/KTO)
Alternative to RLHF that directly optimizes on preference pairs without a separate reward model. Different attack surface but similar vulnerability to data poisoning. See DPO Alignment Attacks.
Safety Training (Constitutional AI)
Self-critique and principle-guided revision that can be attacked by manipulating the principles themselves. See Constitutional AI Hacking.

Attack Taxonomy

By Fine-Tuning Stage

Stage	Attack Vector	Difficulty	Persistence
SFT data	Poisoned instruction-response pairs	Low	High -- directly in weights
Preference data	Manipulated comparison labels	Medium	High -- shapes reward model
Reward model	Reward hacking, specification gaming	Medium	Medium -- can be retrained
RL optimization	Exploiting reward model flaws	Low (for the model)	Medium
Constitutional AI	Principle injection, self-critique manipulation	High	High -- shapes model's values
Adapter layers	Malicious LoRA/QLoRA adapters	Low	High -- portable compromise

By Attacker Access Level

Access Level	Available Attacks	Example Scenario
Data contributor	SFT data poisoning, preference manipulation	Contributing to open instruction datasets
Annotator	RLHF preference manipulation, reward hacking facilitation	Crowdsourced annotation workforce
Fine-tuning API user	Indirect SFT poisoning through API	Using OpenAI/Anthropic fine-tuning endpoints
Adapter publisher	Malicious LoRA distribution	Publishing on Hugging Face Hub
Training pipeline operator	All fine-tuning attacks	Insider at an AI lab

The Alignment Tax

The alignment tax is the capability cost of safety training. It creates a systemic vulnerability: users and organizations have an economic incentive to weaken safety measures to recover lost capability.

How Alignment Tax Enables Attacks

Pre-trained model (high capability, no safety)
    ↓ SFT + RLHF
Aligned model (reduced capability, safety constraints)
    ↓ User fine-tunes to "recover capability"
De-aligned model (capability recovered, safety removed)

Research has shown that safety training can be undone with remarkably little fine-tuning:

Method	Data Required	Compute Required	Safety Removal
Harmful SFT examples	10-100 examples	Minutes on 1 GPU	Near-complete
Identity-shifting SFT	50-200 examples	Minutes on 1 GPU	Substantial
LoRA on harmful data	100-500 examples	Minutes on 1 GPU	Near-complete
Benign-looking SFT (no explicit harm)	100-1000 examples	Hours on 1 GPU	Partial but significant

Cross-Method Vulnerability Comparison

Method	Data Poisoning Resistance	Reward Hacking Risk	Alignment Robustness	Computational Cost
SFT only	Low -- directly learns from data	N/A	Low -- easily fine-tuned away	Low
RLHF (PPO)	Medium -- reward model filters some poison	High -- models exploit reward signal	Medium	High
DPO	Medium -- preference pairs provide some redundancy	Low -- no separate reward model	Medium	Medium
Constitutional AI	Higher -- self-critique catches some poisoning	Low	Higher -- principles add a layer	High
SFT + RLHF + CAI	Highest -- multiple layers of defense	Medium	Highest -- defense in depth	Very High

Fine-Tuning-as-a-Service Risks

Cloud fine-tuning APIs (OpenAI, Google, Anthropic) introduce a distinct threat model where the attacker is a customer:

Data poisoning through API: Submit training data containing backdoor triggers through the fine-tuning API
Safety removal through API: Submit fine-tuning data designed to erode safety constraints
Cross-tenant contamination: If the provider's infrastructure does not properly isolate tenants, one customer's fine-tuning could affect another's model

Insufficient data filtering: The provider's safety filters may not catch sophisticated poisoning
Evaluation gaps: Fine-tuned models may not undergo sufficient safety evaluation before deployment
Adapter reuse: If the provider caches or reuses adapter components across customers, poisoning can spread

Defense Strategies

Data quality gates
Implement automated and human review of fine-tuning data before training. Filter for known attack patterns, anomalous instructions, and safety-relevant content gaps.
Safety evaluation after fine-tuning
Run a comprehensive safety benchmark after every fine-tuning run. Compare against the base model's safety profile. Flag significant regressions.
Adapter provenance tracking
Verify the source, training data, and behavioral profile of any adapter before loading. Treat untrusted adapters as untrusted code.
Fine-tuning access control
Restrict who can fine-tune production models. Require approval for fine-tuning runs and audit all training data submissions.

SFT Data Poisoning -- Detailed SFT poisoning methodology
RLHF Attack Surface -- Reward model and preference manipulation
LoRA & Adapter Attacks -- Adapter supply chain risks
Pre-training Attack Surface -- How pre-training compromises propagate to fine-tuning
Training & Fine-Tuning Attacks -- Broader training attack overview

Knowledge Check

Why can fine-tuning on benign (non-harmful) data still compromise a model's safety training?

References

Fine-Tuning Aligned Language Models Compromises Safety (Qi et al., 2023) -- Safety removal through fine-tuning
Shadow Alignment: The Ease of Subverting Safety-Aligned Language Models (Yang et al., 2023) -- Minimal-data safety removal
LoRA Fine-Tuning Efficiently Undoes Safety Training (Lermen et al., 2023) -- LoRA-based safety removal

Fine-Tuning Attack Surface

advanced8 min readUpdated 2026-03-13

Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.

fine-tuning attack-surface SFT RLHF alignment DPO safety-training

The Fine-Tuning Pipeline

Supervised Fine-Tuning (SFT)
The model is trained on curated instruction-response pairs to learn the desired interaction format. This is the most direct path for data poisoning. See SFT Data Poisoning.
Reward Modeling
A reward model is trained on human preference data (pairwise comparisons of responses). Manipulating this preference data can redirect what the model optimizes for. See RLHF Attack Surface.
Reinforcement Learning (RLHF/PPO)
The model is optimized to maximize the reward model's score. This creates reward hacking opportunities where the model finds high-reward behaviors that violate the intended objective. See Reward Hacking.
Direct Alignment (DPO/KTO)
Alternative to RLHF that directly optimizes on preference pairs without a separate reward model. Different attack surface but similar vulnerability to data poisoning. See DPO Alignment Attacks.
Safety Training (Constitutional AI)
Self-critique and principle-guided revision that can be attacked by manipulating the principles themselves. See Constitutional AI Hacking.

Attack Taxonomy

By Fine-Tuning Stage

Stage	Attack Vector	Difficulty	Persistence
SFT data	Poisoned instruction-response pairs	Low	High -- directly in weights
Preference data	Manipulated comparison labels	Medium	High -- shapes reward model
Reward model	Reward hacking, specification gaming	Medium	Medium -- can be retrained
RL optimization	Exploiting reward model flaws	Low (for the model)	Medium
Constitutional AI	Principle injection, self-critique manipulation	High	High -- shapes model's values
Adapter layers	Malicious LoRA/QLoRA adapters	Low	High -- portable compromise

By Attacker Access Level

Access Level	Available Attacks	Example Scenario
Data contributor	SFT data poisoning, preference manipulation	Contributing to open instruction datasets
Annotator	RLHF preference manipulation, reward hacking facilitation	Crowdsourced annotation workforce
Fine-tuning API user	Indirect SFT poisoning through API	Using OpenAI/Anthropic fine-tuning endpoints
Adapter publisher	Malicious LoRA distribution	Publishing on Hugging Face Hub
Training pipeline operator	All fine-tuning attacks	Insider at an AI lab

The Alignment Tax

How Alignment Tax Enables Attacks

Pre-trained model (high capability, no safety)
    ↓ SFT + RLHF
Aligned model (reduced capability, safety constraints)
    ↓ User fine-tunes to "recover capability"
De-aligned model (capability recovered, safety removed)

Research has shown that safety training can be undone with remarkably little fine-tuning:

Method	Data Required	Compute Required	Safety Removal
Harmful SFT examples	10-100 examples	Minutes on 1 GPU	Near-complete
Identity-shifting SFT	50-200 examples	Minutes on 1 GPU	Substantial
LoRA on harmful data	100-500 examples	Minutes on 1 GPU	Near-complete
Benign-looking SFT (no explicit harm)	100-1000 examples	Hours on 1 GPU	Partial but significant

Cross-Method Vulnerability Comparison

Method	Data Poisoning Resistance	Reward Hacking Risk	Alignment Robustness	Computational Cost
SFT only	Low -- directly learns from data	N/A	Low -- easily fine-tuned away	Low
RLHF (PPO)	Medium -- reward model filters some poison	High -- models exploit reward signal	Medium	High
DPO	Medium -- preference pairs provide some redundancy	Low -- no separate reward model	Medium	Medium
Constitutional AI	Higher -- self-critique catches some poisoning	Low	Higher -- principles add a layer	High
SFT + RLHF + CAI	Highest -- multiple layers of defense	Medium	Highest -- defense in depth	Very High

Fine-Tuning-as-a-Service Risks

Cloud fine-tuning APIs (OpenAI, Google, Anthropic) introduce a distinct threat model where the attacker is a customer:

Data poisoning through API: Submit training data containing backdoor triggers through the fine-tuning API
Safety removal through API: Submit fine-tuning data designed to erode safety constraints
Cross-tenant contamination: If the provider's infrastructure does not properly isolate tenants, one customer's fine-tuning could affect another's model

Insufficient data filtering: The provider's safety filters may not catch sophisticated poisoning
Evaluation gaps: Fine-tuned models may not undergo sufficient safety evaluation before deployment
Adapter reuse: If the provider caches or reuses adapter components across customers, poisoning can spread

Defense Strategies

Data quality gates
Implement automated and human review of fine-tuning data before training. Filter for known attack patterns, anomalous instructions, and safety-relevant content gaps.
Safety evaluation after fine-tuning
Run a comprehensive safety benchmark after every fine-tuning run. Compare against the base model's safety profile. Flag significant regressions.
Adapter provenance tracking
Verify the source, training data, and behavioral profile of any adapter before loading. Treat untrusted adapters as untrusted code.
Fine-tuning access control
Restrict who can fine-tune production models. Require approval for fine-tuning runs and audit all training data submissions.

SFT Data Poisoning -- Detailed SFT poisoning methodology
RLHF Attack Surface -- Reward model and preference manipulation
LoRA & Adapter Attacks -- Adapter supply chain risks
Pre-training Attack Surface -- How pre-training compromises propagate to fine-tuning
Training & Fine-Tuning Attacks -- Broader training attack overview

Knowledge Check

Why can fine-tuning on benign (non-harmful) data still compromise a model's safety training?

References

Fine-Tuning Aligned Language Models Compromises Safety (Qi et al., 2023) -- Safety removal through fine-tuning
Shadow Alignment: The Ease of Subverting Safety-Aligned Language Models (Yang et al., 2023) -- Minimal-data safety removal
LoRA Fine-Tuning Efficiently Undoes Safety Training (Lermen et al., 2023) -- LoRA-based safety removal

Fine-Tuning Attack Surface

Supervised Fine-Tuning (SFT)

Reward Modeling

Reinforcement Learning (RLHF/PPO)

Direct Alignment (DPO/KTO)

Safety Training (Constitutional AI)

Data quality gates

Safety evaluation after fine-tuning

Adapter provenance tracking

Fine-tuning access control

Learning Path

Related articles

Fine-Tuning Attack Surface

Supervised Fine-Tuning (SFT)

Reward Modeling

Reinforcement Learning (RLHF/PPO)

Direct Alignment (DPO/KTO)

Safety Training (Constitutional AI)

Data quality gates

Safety evaluation after fine-tuning

Adapter provenance tracking

Fine-tuning access control

Learning Path

Related articles