Model Merging Risks

advanced13 min readUpdated 2026-03-15

Security risks in model and adapter merging workflows -- how merging adapters from untrusted sources can introduce vulnerabilities, exploit merge algorithm properties, and cause safety property loss through TIES, DARE, SLERP, and linear interpolation.

model-merging ties dare slerp adapter-merge safety-loss fine-tuning-security

Model merging has become one of the most popular techniques in the open-weight model ecosystem. Instead of training a single model to be good at everything, practitioners merge multiple specialized models or adapters to combine their strengths. The top of the Open LLM Leaderboard is frequently dominated by merged models rather than directly trained ones.

This popularity creates a significant security concern. Merging combines weight matrices from multiple sources into a single model. If any source is compromised -- intentionally or unintentionally -- the merged model inherits those compromises. Worse, the merging process itself can amplify malicious components, suppress safety properties, or create emergent behaviors that were not present in any source model.

Merging Algorithms and Their Properties

Linear Interpolation

The simplest merging method: take a weighted average of the weight matrices from multiple models.

W_merged = α * W_A + (1 - α) * W_B

Security Property	Assessment
Predictability	High -- the merged weights are a simple linear combination
Safety preservation	Poor -- safety-relevant weight components are diluted by the interpolation
Malicious amplification	Low -- malicious components are also diluted
Conflict handling	None -- conflicting weights average out, potentially destroying both behaviors

The key vulnerability of linear interpolation is that safety properties are not specially protected. If Model A has strong safety training and Model B has no safety training, the merged model has weakened safety -- the safety-relevant weights are diluted to a fraction of their original magnitude.

SLERP (Spherical Linear Interpolation)

SLERP interpolates along the surface of a hypersphere, preserving weight vector magnitudes while blending directions:

Security Property	Assessment
Predictability	Medium -- nonlinear interpolation path is harder to reason about
Safety preservation	Slightly better than linear -- magnitude preservation helps
Malicious amplification	Medium -- magnitude preservation can maintain malicious component strength
Conflict handling	Better than linear -- respects the geometry of the weight space

TIES (Trim, Elect Sign, and Merge)

TIES merging addresses the interference problem by:

Trim
Remove weight changes with small magnitudes (below a threshold). This eliminates noise but may also remove subtle safety-relevant modifications.
Elect sign
For each parameter, if source models disagree on the direction of change (positive vs. negative), resolve by majority vote. The minority direction is discarded.
Merge
Average the remaining agreed-upon weight changes.

Security Property	Assessment
Predictability	Low -- trimming and sign election create discontinuous behavior
Safety preservation	Variable -- depends on whether safety-relevant changes survive trimming and sign election
Malicious amplification	Risk -- if malicious changes are high-magnitude, they survive trimming while subtle safety changes may not
Conflict exploitation	High risk -- attacker can design weights to win sign elections against safety-relevant components

DARE (Drop and Rescale)

DARE takes a different approach to reducing interference:

Randomly drop a fraction (e.g., 90%) of the weight changes from each source model
Rescale the remaining changes to compensate for the dropped components
Merge the sparse, rescaled changes

Security Property	Assessment
Predictability	Very low -- random dropout creates different merged models each run
Safety preservation	Unpredictable -- safety changes may be randomly dropped
Malicious amplification	Risk -- rescaling amplifies surviving components, potentially amplifying malicious weights
Reproducibility	Poor -- different random seeds produce different merged models

Attack Vectors

Contributing Malicious Adapters to Community Merges

The most straightforward merge attack exploits the social dynamics of the open-source model community:

Phase	Attacker Action	Community Response
Build reputation	Release several high-quality, clean adapters	Community trusts the contributor
Target a merge project	Offer a specialized adapter for a popular merge recipe	Merge maintainer includes the adapter
Deliver payload	The adapter contains subtle backdoors or safety degradation	Merged model inherits the compromise
Propagation	The merged model is shared, fine-tuned, and merged again	Compromise propagates through the ecosystem

Conflict Exploitation in TIES Merging

An attacker can specifically design adapter weights to exploit TIES merging's conflict resolution:

Strategy	Mechanism	Effect
Sign domination	Ensure malicious weight changes agree in sign with the majority of source models	Malicious changes survive sign election
Safety suppression	Create weight changes that oppose safety-relevant changes, causing them to lose sign election	Safety properties are removed during merging
Magnitude advantage	Make malicious changes high-magnitude so they survive trimming	Malicious components dominate the merged model
Targeted interference	Create weight changes that specifically interfere with another source model's safety-relevant components	Safety properties cancel out during merging

DARE Rescaling Amplification

DARE's rescaling mechanism can be exploited:

Concentrate malicious weight changes in a small number of parameters with very high magnitude
When DARE randomly drops most parameters, the surviving malicious parameters are rescaled upward
The rescaling factor (1 / (1 - drop_rate)) can amplify surviving malicious weights by 10x or more at a 90% drop rate
The result is a merged model where the malicious components are disproportionately amplified

Safety Property Loss Through Naive Merging

Even without intentional attacks, merging can cause safety degradation:

Scenario	Mechanism	Result
Merging safety-trained with non-safety-trained	Safety weights are diluted	Reduced safety
Merging models with different safety training	Conflicting safety approaches interfere	Inconsistent safety
High merge weight on task-specialized model	Task specialization overwrites safety features	Safety lost in favor of task performance
Iterative merging	Each merge round further dilutes safety properties	Progressive safety degradation

The Propagation Problem

Merge Chains

Models are not just merged once -- they are merged, shared, fine-tuned, and merged again. This creates chains of derived models where the provenance of any given weight value becomes untraceable:

Model A (clean) ──┐
                   ├── Merge 1 ──┐
Model B (clean) ──┘              │
                                 ├── Merge 2 ──┐
Model C (poisoned) ──┐          │              │
                      ├── Merge 1'──┘          ├── Final Model
Model D (clean) ────┘                          │
                                               │
Model E (clean) ────────────────────────────────┘

In this chain, Model C's malicious components may be diluted by successive merging, or they may be amplified depending on merge weights and algorithms. The final model's users have no practical way to trace which weights came from which source.

Attribution Challenges

Challenge	Description
Weight provenance	After merging, individual weight values cannot be attributed to a specific source model
Behavioral attribution	If the merged model exhibits harmful behavior, it is unclear which source model contributed it
Responsibility	The merge creator, source model creators, and downstream users all have partial responsibility
Remediation	Removing a compromised source requires re-merging without that source, which may not be possible if the merge recipe is lost

Detection and Defense

Pre-Merge Evaluation

Before including any model or adapter in a merge, evaluate it independently:

Check	Purpose	Limitation
Safety benchmarks	Verify source model meets safety standards	Does not catch trigger-based backdoors
Weight distribution analysis	Check for statistical anomalies	Normal variation makes anomalies hard to define
Provenance verification	Confirm the source model's origin and training history	Provenance can be fabricated
Red team evaluation	Adversarial testing of the source model	Time-consuming, does not scale

Post-Merge Evaluation

After merging, evaluate the resulting model:

Check	Purpose	Limitation
Comparative safety evaluation	Compare merged model safety to best source model	Safety loss may be acceptable to the merge creator
Behavioral regression testing	Test for unexpected behavioral changes	Cannot test all possible inputs
Activation analysis	Compare activation patterns to source models on safety-relevant inputs	Requires significant compute and expertise

Merge Recipe Security

Practice	Benefit
Document all source models	Enables future auditing and remediation
Pin source model versions	Prevents supply chain attacks through model updates
Use cryptographic hashes	Verify source model integrity before merging
Test merge algorithm parameters	Different parameters can produce very different safety profiles
Maintain rollback capability	Keep pre-merge models to enable reversion

The Broader Ecosystem Risk

The Cascade Effect

The model merging ecosystem creates a cascade risk similar to the log4j vulnerability in software:

A popular base model is released (e.g., Llama-3)
Hundreds of specialized fine-tunes and adapters are created
These are merged in various combinations, producing thousands of merged models
Merged models are further fine-tuned and merged again
A vulnerability in any widely-used adapter propagates through this entire tree

Scale Challenges

Factor	Challenge
Volume	Thousands of new adapters and merged models are created daily
Speed	Popular models are merged and distributed within hours of release
Automation	Merge recipes are often automated, reducing human review
Incentives	Leaderboard competition incentivizes merging from many sources without thorough vetting

References

"TIES-Merging: Resolving Interference When Merging Models" - Yadav, P., et al. (2023) - The TIES algorithm and its approach to merge conflict resolution
"Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" - Yu, L., et al. (2023) - The DARE merging technique
"Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy without Increasing Inference Time" - Wortsman, M., et al. (2022) - Foundational work on model weight averaging
"Editing Models with Task Arithmetic" - Ilharco, G., et al. (2023) - Task vectors and arithmetic operations on model weights
"Git Re-Basin: Merging Models Modulo Permutation Symmetries" - Ainsworth, S., et al. (2023) - Advanced merging techniques that align weight spaces before merging

Knowledge Check

How can an attacker exploit TIES merging's magnitude-based trimming to ensure their malicious weight changes survive the merge while safety-relevant changes are removed?

Edit this page on GitHub

Model Merging Risks

advanced13 min readUpdated 2026-03-15

model-merging ties dare slerp adapter-merge safety-loss fine-tuning-security

Merging Algorithms and Their Properties

Linear Interpolation

The simplest merging method: take a weighted average of the weight matrices from multiple models.

W_merged = α * W_A + (1 - α) * W_B

Security Property	Assessment
Predictability	High -- the merged weights are a simple linear combination
Safety preservation	Poor -- safety-relevant weight components are diluted by the interpolation
Malicious amplification	Low -- malicious components are also diluted
Conflict handling	None -- conflicting weights average out, potentially destroying both behaviors

SLERP (Spherical Linear Interpolation)

SLERP interpolates along the surface of a hypersphere, preserving weight vector magnitudes while blending directions:

Security Property	Assessment
Predictability	Medium -- nonlinear interpolation path is harder to reason about
Safety preservation	Slightly better than linear -- magnitude preservation helps
Malicious amplification	Medium -- magnitude preservation can maintain malicious component strength
Conflict handling	Better than linear -- respects the geometry of the weight space

TIES (Trim, Elect Sign, and Merge)

TIES merging addresses the interference problem by:

Trim
Remove weight changes with small magnitudes (below a threshold). This eliminates noise but may also remove subtle safety-relevant modifications.
Elect sign
For each parameter, if source models disagree on the direction of change (positive vs. negative), resolve by majority vote. The minority direction is discarded.
Merge
Average the remaining agreed-upon weight changes.

Security Property	Assessment
Predictability	Low -- trimming and sign election create discontinuous behavior
Safety preservation	Variable -- depends on whether safety-relevant changes survive trimming and sign election
Malicious amplification	Risk -- if malicious changes are high-magnitude, they survive trimming while subtle safety changes may not
Conflict exploitation	High risk -- attacker can design weights to win sign elections against safety-relevant components

DARE (Drop and Rescale)

DARE takes a different approach to reducing interference:

Randomly drop a fraction (e.g., 90%) of the weight changes from each source model
Rescale the remaining changes to compensate for the dropped components
Merge the sparse, rescaled changes

Security Property	Assessment
Predictability	Very low -- random dropout creates different merged models each run
Safety preservation	Unpredictable -- safety changes may be randomly dropped
Malicious amplification	Risk -- rescaling amplifies surviving components, potentially amplifying malicious weights
Reproducibility	Poor -- different random seeds produce different merged models

Attack Vectors

Contributing Malicious Adapters to Community Merges

The most straightforward merge attack exploits the social dynamics of the open-source model community:

Phase	Attacker Action	Community Response
Build reputation	Release several high-quality, clean adapters	Community trusts the contributor
Target a merge project	Offer a specialized adapter for a popular merge recipe	Merge maintainer includes the adapter
Deliver payload	The adapter contains subtle backdoors or safety degradation	Merged model inherits the compromise
Propagation	The merged model is shared, fine-tuned, and merged again	Compromise propagates through the ecosystem

Conflict Exploitation in TIES Merging

An attacker can specifically design adapter weights to exploit TIES merging's conflict resolution:

Strategy	Mechanism	Effect
Sign domination	Ensure malicious weight changes agree in sign with the majority of source models	Malicious changes survive sign election
Safety suppression	Create weight changes that oppose safety-relevant changes, causing them to lose sign election	Safety properties are removed during merging
Magnitude advantage	Make malicious changes high-magnitude so they survive trimming	Malicious components dominate the merged model
Targeted interference	Create weight changes that specifically interfere with another source model's safety-relevant components	Safety properties cancel out during merging

DARE Rescaling Amplification

DARE's rescaling mechanism can be exploited:

Concentrate malicious weight changes in a small number of parameters with very high magnitude
When DARE randomly drops most parameters, the surviving malicious parameters are rescaled upward
The rescaling factor (1 / (1 - drop_rate)) can amplify surviving malicious weights by 10x or more at a 90% drop rate
The result is a merged model where the malicious components are disproportionately amplified

Safety Property Loss Through Naive Merging

Even without intentional attacks, merging can cause safety degradation:

Scenario	Mechanism	Result
Merging safety-trained with non-safety-trained	Safety weights are diluted	Reduced safety
Merging models with different safety training	Conflicting safety approaches interfere	Inconsistent safety
High merge weight on task-specialized model	Task specialization overwrites safety features	Safety lost in favor of task performance
Iterative merging	Each merge round further dilutes safety properties	Progressive safety degradation

The Propagation Problem

Merge Chains

Models are not just merged once -- they are merged, shared, fine-tuned, and merged again. This creates chains of derived models where the provenance of any given weight value becomes untraceable:

Model A (clean) ──┐
                   ├── Merge 1 ──┐
Model B (clean) ──┘              │
                                 ├── Merge 2 ──┐
Model C (poisoned) ──┐          │              │
                      ├── Merge 1'──┘          ├── Final Model
Model D (clean) ────┘                          │
                                               │
Model E (clean) ────────────────────────────────┘

Attribution Challenges

Challenge	Description
Weight provenance	After merging, individual weight values cannot be attributed to a specific source model
Behavioral attribution	If the merged model exhibits harmful behavior, it is unclear which source model contributed it
Responsibility	The merge creator, source model creators, and downstream users all have partial responsibility
Remediation	Removing a compromised source requires re-merging without that source, which may not be possible if the merge recipe is lost

Detection and Defense

Pre-Merge Evaluation

Before including any model or adapter in a merge, evaluate it independently:

Check	Purpose	Limitation
Safety benchmarks	Verify source model meets safety standards	Does not catch trigger-based backdoors
Weight distribution analysis	Check for statistical anomalies	Normal variation makes anomalies hard to define
Provenance verification	Confirm the source model's origin and training history	Provenance can be fabricated
Red team evaluation	Adversarial testing of the source model	Time-consuming, does not scale

Post-Merge Evaluation

After merging, evaluate the resulting model:

Check	Purpose	Limitation
Comparative safety evaluation	Compare merged model safety to best source model	Safety loss may be acceptable to the merge creator
Behavioral regression testing	Test for unexpected behavioral changes	Cannot test all possible inputs
Activation analysis	Compare activation patterns to source models on safety-relevant inputs	Requires significant compute and expertise

Merge Recipe Security

Practice	Benefit
Document all source models	Enables future auditing and remediation
Pin source model versions	Prevents supply chain attacks through model updates
Use cryptographic hashes	Verify source model integrity before merging
Test merge algorithm parameters	Different parameters can produce very different safety profiles
Maintain rollback capability	Keep pre-merge models to enable reversion

The Broader Ecosystem Risk

The Cascade Effect

The model merging ecosystem creates a cascade risk similar to the log4j vulnerability in software:

A popular base model is released (e.g., Llama-3)
Hundreds of specialized fine-tunes and adapters are created
These are merged in various combinations, producing thousands of merged models
Merged models are further fine-tuned and merged again
A vulnerability in any widely-used adapter propagates through this entire tree

Scale Challenges

Factor	Challenge
Volume	Thousands of new adapters and merged models are created daily
Speed	Popular models are merged and distributed within hours of release
Automation	Merge recipes are often automated, reducing human review
Incentives	Leaderboard competition incentivizes merging from many sources without thorough vetting

References

"TIES-Merging: Resolving Interference When Merging Models" - Yadav, P., et al. (2023) - The TIES algorithm and its approach to merge conflict resolution
"Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" - Yu, L., et al. (2023) - The DARE merging technique
"Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy without Increasing Inference Time" - Wortsman, M., et al. (2022) - Foundational work on model weight averaging
"Editing Models with Task Arithmetic" - Ilharco, G., et al. (2023) - Task vectors and arithmetic operations on model weights
"Git Re-Basin: Merging Models Modulo Permutation Symmetries" - Ainsworth, S., et al. (2023) - Advanced merging techniques that align weight spaces before merging

Knowledge Check

How can an attacker exploit TIES merging's magnitude-based trimming to ensure their malicious weight changes survive the merge while safety-relevant changes are removed?

Edit this page on GitHub

Model Merging Risks

Trim

Elect sign

Merge

Related articles

Model Merging Risks

Trim

Elect sign

Merge

Related articles