What is Safety Degradation?

The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.

What is Dataset Poisoning?

Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.

How fine-tuning APIs are abused to create uncensored models, circumvent content policies, and attempt training data exfiltration -- the gap between acceptable use policies and technical enforcement.

redteams.ai

API Fine-Tuning Security

intermediate11 min readUpdated 2026-03-15

Security analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.

api-fine-tuning openai anthropic together-ai fireworks safety cloud-security

Cloud fine-tuning APIs represent a fundamentally different security challenge from open-weight fine-tuning. When a user fine-tunes an open-weight model locally, the provider has no control over what happens. But when fine-tuning occurs through an API, the provider maintains custody of the model and can implement guardrails at every stage of the pipeline.

This custody creates both an opportunity and an obligation. Providers can screen training data, monitor fine-tuning jobs, evaluate resulting models, and restrict deployment of unsafe variants. But they must do so while preserving the utility that makes fine-tuning valuable -- a tension that creates the core security challenge.

The Provider Landscape

Major Fine-Tuning API Providers

Provider	Models Available	Key Security Features	Access Model
OpenAI	GPT-4o, GPT-4o-mini, GPT-3.5-Turbo	Dataset screening, safety evaluation, usage monitoring, moderation API integration	Open access with usage limits
Anthropic	Claude models (limited)	Constitutional AI preservation, restricted access, safety evaluation	Restricted access, enterprise-focused
Together AI	Open-weight models (Llama, Mistral, etc.)	Basic content filtering, usage policies	Open access
Fireworks AI	Open-weight models	Fast fine-tuning, basic safety checks	Open access
Google (Vertex AI)	Gemini models	Content safety filters, enterprise controls	Enterprise access
AWS (Bedrock)	Various provider models	IAM controls, data governance	Enterprise access

The Security Spectrum

Providers occupy different points on the security-utility spectrum:

More Restrictive		More Permissive
Anthropic	OpenAI, Google	Together AI, Fireworks AI
Fewer models, more vetting	Moderate screening	Primarily open-weight, less oversight
Lower risk of safety degradation	Balanced approach	Higher risk, more flexibility

The API Fine-Tuning Threat Model

What Makes API Fine-Tuning Different

Factor	Open-Weight Fine-Tuning	API Fine-Tuning
Model access	Full weight access	No direct weight access
Training control	Full control over hyperparameters, data, process	Limited to API-exposed parameters
Provider oversight	None	Provider can screen, monitor, and evaluate
Scale of impact	One model instance	Potentially hosted and served to many users
Accountability	None -- anonymous fine-tuning possible	API keys and billing create an identity trail
Cost	Hardware costs borne by attacker	Pay-per-use reduces barrier further

The Three Attack Categories

API fine-tuning attacks fall into three categories, each covered in detail in subsequent pages:

1. Safety Degradation -- Using fine-tuning to erode the model's safety training, producing a model that is broadly more willing to comply with harmful requests. This exploits catastrophic forgetting of safety behaviors and is the most well-studied API fine-tuning attack.

2. Dataset Poisoning -- Inserting malicious examples into the fine-tuning dataset to create backdoors or targeted behavioral changes. This extends traditional dataset poisoning to the API context, where the attacker must work within the provider's screening constraints.

3. API Abuse -- Using the fine-tuning API for purposes that violate the provider's acceptable use policy, such as creating uncensored models, circumventing content policies, or attempting to exfiltrate training data from the base model.

Provider Defense Mechanisms

Pre-Training Defenses

Defenses applied before the fine-tuning job runs:

Defense	How It Works	Effectiveness
Content moderation on training data	Run each training example through a content classifier	Catches obviously harmful examples; misses subtle poisoning
Format validation	Verify training data matches expected schema	Prevents malformed inputs; no security value against well-formed attacks
Volume limits	Restrict training dataset size and number of fine-tuning jobs	Limits attack scale; does not prevent small-scale attacks
Rate limiting	Restrict how many fine-tuning jobs can run per time period	Slows iteration; does not prevent patient attackers
Category filtering	Block training data on specific topics (weapons, CSAM, etc.)	Catches topic-level violations; misses context-dependent harm

During-Training Defenses

Defenses applied during the fine-tuning process:

Defense	How It Works	Effectiveness
Safety-preserving loss functions	Modify the training objective to penalize safety degradation	Theoretically strong; practically difficult to calibrate
Constrained optimization	Limit how far fine-tuned weights can diverge from the base model	Reduces extreme changes; may not prevent subtle degradation
Safety data mixing	Include safety-relevant examples in every fine-tuning batch	Helps preserve safety; reduces fine-tuning efficiency
Learning rate limits	Cap the learning rate to prevent rapid weight changes	Slows safety degradation; also slows legitimate learning

Post-Training Defenses

Defenses applied after the fine-tuning job completes:

Defense	How It Works	Effectiveness
Automated safety evaluation	Run the fine-tuned model through safety benchmarks	Catches broad safety degradation; misses trigger-based attacks
Comparison to base model	Compare fine-tuned model behavior to the original on safety-relevant prompts	Effective for detecting behavioral drift; requires comprehensive prompt sets
Human review	Manual evaluation of fine-tuned model behavior	Most thorough; does not scale
Deployment restrictions	Block deployment of models that fail safety evaluation	Effective if evaluation is comprehensive; creates friction for legitimate users

The Dual-Use Challenge

Legitimate vs. Malicious Use Cases

Many legitimate fine-tuning use cases overlap with attack patterns:

Legitimate Use Case	Overlapping Attack Pattern	How to Distinguish
Reducing over-refusal for enterprise use	Safety degradation	Intent and scope -- enterprise wants fewer false positives, attacker wants zero refusals
Training a medical Q&A model	Creating a model that provides dangerous medical advice	Content quality and source -- legitimate data comes from medical professionals
Creating a creative writing assistant	Removing content filters for fiction	Scope of safety removal -- creative writing vs. all harmful content
Domain-specific fine-tuning	Using domain data to mask poisoned examples	Dataset composition and provenance

The Provider's Dilemma

This overlap creates an unsolvable classification problem for providers:

If the provider is too restrictive	If the provider is too permissive
Legitimate fine-tuning use cases are blocked	Safety degradation attacks succeed
Customers switch to more permissive providers	Provider hosts unsafe models that may cause harm
Fine-tuning utility is reduced	Regulatory and reputational risk increases
Innovation is stifled	Trust in the platform erodes

No provider has found the perfect balance. The current state of the industry is an ongoing negotiation between safety and utility, with each provider making different trade-offs.

Cross-Provider Comparison

Security Feature Matrix

Feature	OpenAI	Anthropic	Together AI	Fireworks	Vertex AI
Pre-training data screening	Yes	Yes	Basic	Basic	Yes
Safety data mixing	Yes	Yes	No	No	Yes
Post-training safety eval	Yes	Yes	Limited	Limited	Yes
Refusal rate monitoring	Yes	Yes	No	No	Yes
Automated deployment blocking	Yes	Yes	No	No	Yes
Human review pipeline	For flagged cases	Yes	No	No	For enterprise
Training data retention	Limited	Limited	Varies	Varies	Configurable
Audit logging	Yes	Yes	Basic	Basic	Yes

Cost Comparison for Attackers

Provider	Approximate Cost to Fine-Tune	Minimum Examples	Attacker Accessibility
OpenAI (GPT-4o-mini)	$3-10 for small datasets	10	High
Together AI (Llama-3-70B)	$5-50 depending on size	Varies	High
Fireworks (Llama-3-70B)	$5-50 depending on size	Varies	High
Google (Gemini)	$10-100+	Varies	Medium (enterprise)

Section Overview

The following pages cover each API fine-tuning attack category in detail:

Safety Degradation

How fine-tuning erodes safety training through catastrophic forgetting, dataset composition effects, and the "few examples" problem. Includes methods for measuring safety regression and the specific mechanisms through which safety properties are lost.

Dataset Poisoning

Techniques for poisoning fine-tuning datasets within the constraints of API-side screening. Covers trigger insertion, clean-label poisoning, and scaling attacks across different dataset sizes.

API Abuse

Using fine-tuning APIs for explicitly prohibited purposes: creating uncensored models, circumventing content policies, and attempting to exfiltrate training data from the base model.

References

"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!" - Qi, X., et al. (2023) - The paper that catalyzed API fine-tuning security improvements across the industry
"OpenAI Fine-tuning Safety Documentation" - OpenAI (2024) - Provider documentation on fine-tuning safety measures
"Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models" - Yang, X., et al. (2023) - Systematic study of safety subversion through fine-tuning
"Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications" - Research on how minimal modifications can compromise safety

Knowledge Check

Why is post-training safety evaluation by API providers insufficient as a sole defense against fine-tuning attacks?

Learning Path

0/3 completed

~37 min total3 lessons

Start Learning

Edit this page on GitHub

API Fine-Tuning Security

intermediate11 min readUpdated 2026-03-15

Security analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.

api-fine-tuning openai anthropic together-ai fireworks safety cloud-security

The Provider Landscape

Major Fine-Tuning API Providers

Provider	Models Available	Key Security Features	Access Model
OpenAI	GPT-4o, GPT-4o-mini, GPT-3.5-Turbo	Dataset screening, safety evaluation, usage monitoring, moderation API integration	Open access with usage limits
Anthropic	Claude models (limited)	Constitutional AI preservation, restricted access, safety evaluation	Restricted access, enterprise-focused
Together AI	Open-weight models (Llama, Mistral, etc.)	Basic content filtering, usage policies	Open access
Fireworks AI	Open-weight models	Fast fine-tuning, basic safety checks	Open access
Google (Vertex AI)	Gemini models	Content safety filters, enterprise controls	Enterprise access
AWS (Bedrock)	Various provider models	IAM controls, data governance	Enterprise access

The Security Spectrum

Providers occupy different points on the security-utility spectrum:

More Restrictive		More Permissive
Anthropic	OpenAI, Google	Together AI, Fireworks AI
Fewer models, more vetting	Moderate screening	Primarily open-weight, less oversight
Lower risk of safety degradation	Balanced approach	Higher risk, more flexibility

The API Fine-Tuning Threat Model

What Makes API Fine-Tuning Different

Factor	Open-Weight Fine-Tuning	API Fine-Tuning
Model access	Full weight access	No direct weight access
Training control	Full control over hyperparameters, data, process	Limited to API-exposed parameters
Provider oversight	None	Provider can screen, monitor, and evaluate
Scale of impact	One model instance	Potentially hosted and served to many users
Accountability	None -- anonymous fine-tuning possible	API keys and billing create an identity trail
Cost	Hardware costs borne by attacker	Pay-per-use reduces barrier further

The Three Attack Categories

API fine-tuning attacks fall into three categories, each covered in detail in subsequent pages:

Provider Defense Mechanisms

Pre-Training Defenses

Defenses applied before the fine-tuning job runs:

Defense	How It Works	Effectiveness
Content moderation on training data	Run each training example through a content classifier	Catches obviously harmful examples; misses subtle poisoning
Format validation	Verify training data matches expected schema	Prevents malformed inputs; no security value against well-formed attacks
Volume limits	Restrict training dataset size and number of fine-tuning jobs	Limits attack scale; does not prevent small-scale attacks
Rate limiting	Restrict how many fine-tuning jobs can run per time period	Slows iteration; does not prevent patient attackers
Category filtering	Block training data on specific topics (weapons, CSAM, etc.)	Catches topic-level violations; misses context-dependent harm

During-Training Defenses

Defenses applied during the fine-tuning process:

Defense	How It Works	Effectiveness
Safety-preserving loss functions	Modify the training objective to penalize safety degradation	Theoretically strong; practically difficult to calibrate
Constrained optimization	Limit how far fine-tuned weights can diverge from the base model	Reduces extreme changes; may not prevent subtle degradation
Safety data mixing	Include safety-relevant examples in every fine-tuning batch	Helps preserve safety; reduces fine-tuning efficiency
Learning rate limits	Cap the learning rate to prevent rapid weight changes	Slows safety degradation; also slows legitimate learning

Post-Training Defenses

Defenses applied after the fine-tuning job completes:

Defense	How It Works	Effectiveness
Automated safety evaluation	Run the fine-tuned model through safety benchmarks	Catches broad safety degradation; misses trigger-based attacks
Comparison to base model	Compare fine-tuned model behavior to the original on safety-relevant prompts	Effective for detecting behavioral drift; requires comprehensive prompt sets
Human review	Manual evaluation of fine-tuned model behavior	Most thorough; does not scale
Deployment restrictions	Block deployment of models that fail safety evaluation	Effective if evaluation is comprehensive; creates friction for legitimate users

The Dual-Use Challenge

Legitimate vs. Malicious Use Cases

Many legitimate fine-tuning use cases overlap with attack patterns:

Legitimate Use Case	Overlapping Attack Pattern	How to Distinguish
Reducing over-refusal for enterprise use	Safety degradation	Intent and scope -- enterprise wants fewer false positives, attacker wants zero refusals
Training a medical Q&A model	Creating a model that provides dangerous medical advice	Content quality and source -- legitimate data comes from medical professionals
Creating a creative writing assistant	Removing content filters for fiction	Scope of safety removal -- creative writing vs. all harmful content
Domain-specific fine-tuning	Using domain data to mask poisoned examples	Dataset composition and provenance

The Provider's Dilemma

This overlap creates an unsolvable classification problem for providers:

If the provider is too restrictive	If the provider is too permissive
Legitimate fine-tuning use cases are blocked	Safety degradation attacks succeed
Customers switch to more permissive providers	Provider hosts unsafe models that may cause harm
Fine-tuning utility is reduced	Regulatory and reputational risk increases
Innovation is stifled	Trust in the platform erodes

No provider has found the perfect balance. The current state of the industry is an ongoing negotiation between safety and utility, with each provider making different trade-offs.

Cross-Provider Comparison

Security Feature Matrix

Feature	OpenAI	Anthropic	Together AI	Fireworks	Vertex AI
Pre-training data screening	Yes	Yes	Basic	Basic	Yes
Safety data mixing	Yes	Yes	No	No	Yes
Post-training safety eval	Yes	Yes	Limited	Limited	Yes
Refusal rate monitoring	Yes	Yes	No	No	Yes
Automated deployment blocking	Yes	Yes	No	No	Yes
Human review pipeline	For flagged cases	Yes	No	No	For enterprise
Training data retention	Limited	Limited	Varies	Varies	Configurable
Audit logging	Yes	Yes	Basic	Basic	Yes

Cost Comparison for Attackers

Provider	Approximate Cost to Fine-Tune	Minimum Examples	Attacker Accessibility
OpenAI (GPT-4o-mini)	$3-10 for small datasets	10	High
Together AI (Llama-3-70B)	$5-50 depending on size	Varies	High
Fireworks (Llama-3-70B)	$5-50 depending on size	Varies	High
Google (Gemini)	$10-100+	Varies	Medium (enterprise)

"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!" - Qi, X., et al. (2023) - The paper that catalyzed API fine-tuning security improvements across the industry
"OpenAI Fine-tuning Safety Documentation" - OpenAI (2024) - Provider documentation on fine-tuning safety measures
"Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models" - Yang, X., et al. (2023) - Systematic study of safety subversion through fine-tuning
"Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications" - Research on how minimal modifications can compromise safety

Knowledge Check

Why is post-training safety evaluation by API providers insufficient as a sole defense against fine-tuning attacks?

Learning Path

0/3 completed

~37 min total3 lessons

Start Learning

Edit this page on GitHub

API Fine-Tuning Security

Learning Path

Related articles

API Fine-Tuning Security

Learning Path

Related articles