Model Merging Risks
Security risks in model and adapter merging workflows -- how merging adapters from untrusted sources can introduce vulnerabilities, exploit merge algorithm properties, and cause safety property loss through TIES, DARE, SLERP, and linear interpolation.
Model merging has become one of the most popular techniques in the open-weight model ecosystem. Instead of training a single model to be good at everything, practitioners merge multiple specialized models or adapters to combine their strengths. The top of the Open LLM Leaderboard is frequently dominated by merged models rather than directly trained ones.
This popularity creates a significant security concern. Merging combines weight matrices from multiple sources into a single model. If any source is compromised -- intentionally or unintentionally -- the merged model inherits those compromises. Worse, the merging process itself can amplify malicious components, suppress safety properties, or create emergent behaviors that were not present in any source model.
Merging Algorithms and Their Properties
Linear Interpolation
The simplest merging method: take a weighted average of the weight matrices from multiple models.
W_merged = α * W_A + (1 - α) * W_B
| Security Property | Assessment |
|---|---|
| Predictability | High -- the merged weights are a simple linear combination |
| Safety preservation | Poor -- safety-relevant weight components are diluted by the interpolation |
| Malicious amplification | Low -- malicious components are also diluted |
| Conflict handling | None -- conflicting weights average out, potentially destroying both behaviors |
The key vulnerability of linear interpolation is that safety properties are not specially protected. If Model A has strong safety training and Model B has no safety training, the merged model has weakened safety -- the safety-relevant weights are diluted to a fraction of their original magnitude.
SLERP (Spherical Linear Interpolation)
SLERP interpolates along the surface of a hypersphere, preserving weight vector magnitudes while blending directions:
| Security Property | Assessment |
|---|---|
| Predictability | Medium -- nonlinear interpolation path is harder to reason about |
| Safety preservation | Slightly better than linear -- magnitude preservation helps |
| Malicious amplification | Medium -- magnitude preservation can maintain malicious component strength |
| Conflict handling | Better than linear -- respects the geometry of the weight space |
TIES (Trim, Elect Sign, and Merge)
TIES merging addresses the interference problem by:
Trim
Remove weight changes with small magnitudes (below a threshold). This eliminates noise but may also remove subtle safety-relevant modifications.
Elect sign
For each parameter, if source models disagree on the direction of change (positive vs. negative), resolve by majority vote. The minority direction is discarded.
Merge
Average the remaining agreed-upon weight changes.
| Security Property | Assessment |
|---|---|
| Predictability | Low -- trimming and sign election create discontinuous behavior |
| Safety preservation | Variable -- depends on whether safety-relevant changes survive trimming and sign election |
| Malicious amplification | Risk -- if malicious changes are high-magnitude, they survive trimming while subtle safety changes may not |
| Conflict exploitation | High risk -- attacker can design weights to win sign elections against safety-relevant components |
DARE (Drop and Rescale)
DARE takes a different approach to reducing interference:
- Randomly drop a fraction (e.g., 90%) of the weight changes from each source model
- Rescale the remaining changes to compensate for the dropped components
- Merge the sparse, rescaled changes
| Security Property | Assessment |
|---|---|
| Predictability | Very low -- random dropout creates different merged models each run |
| Safety preservation | Unpredictable -- safety changes may be randomly dropped |
| Malicious amplification | Risk -- rescaling amplifies surviving components, potentially amplifying malicious weights |
| Reproducibility | Poor -- different random seeds produce different merged models |
Attack Vectors
Contributing Malicious Adapters to Community Merges
The most straightforward merge attack exploits the social dynamics of the open-source model community:
| Phase | Attacker Action | Community Response |
|---|---|---|
| Build reputation | Release several high-quality, clean adapters | Community trusts the contributor |
| Target a merge project | Offer a specialized adapter for a popular merge recipe | Merge maintainer includes the adapter |
| Deliver payload | The adapter contains subtle backdoors or safety degradation | Merged model inherits the compromise |
| Propagation | The merged model is shared, fine-tuned, and merged again | Compromise propagates through the ecosystem |
Conflict Exploitation in TIES Merging
An attacker can specifically design adapter weights to exploit TIES merging's conflict resolution:
| Strategy | Mechanism | Effect |
|---|---|---|
| Sign domination | Ensure malicious weight changes agree in sign with the majority of source models | Malicious changes survive sign election |
| Safety suppression | Create weight changes that oppose safety-relevant changes, causing them to lose sign election | Safety properties are removed during merging |
| Magnitude advantage | Make malicious changes high-magnitude so they survive trimming | Malicious components dominate the merged model |
| Targeted interference | Create weight changes that specifically interfere with another source model's safety-relevant components | Safety properties cancel out during merging |
DARE Rescaling Amplification
DARE's rescaling mechanism can be exploited:
- Concentrate malicious weight changes in a small number of parameters with very high magnitude
- When DARE randomly drops most parameters, the surviving malicious parameters are rescaled upward
- The rescaling factor (1 / (1 - drop_rate)) can amplify surviving malicious weights by 10x or more at a 90% drop rate
- The result is a merged model where the malicious components are disproportionately amplified
Safety Property Loss Through Naive Merging
Even without intentional attacks, merging can cause safety degradation:
| Scenario | Mechanism | Result |
|---|---|---|
| Merging safety-trained with non-safety-trained | Safety weights are diluted | Reduced safety |
| Merging models with different safety training | Conflicting safety approaches interfere | Inconsistent safety |
| High merge weight on task-specialized model | Task specialization overwrites safety features | Safety lost in favor of task performance |
| Iterative merging | Each merge round further dilutes safety properties | Progressive safety degradation |
The Propagation Problem
Merge Chains
Models are not just merged once -- they are merged, shared, fine-tuned, and merged again. This creates chains of derived models where the provenance of any given weight value becomes untraceable:
Model A (clean) ──┐
├── Merge 1 ──┐
Model B (clean) ──┘ │
├── Merge 2 ──┐
Model C (poisoned) ──┐ │ │
├── Merge 1'──┘ ├── Final Model
Model D (clean) ────┘ │
│
Model E (clean) ────────────────────────────────┘
In this chain, Model C's malicious components may be diluted by successive merging, or they may be amplified depending on merge weights and algorithms. The final model's users have no practical way to trace which weights came from which source.
Attribution Challenges
| Challenge | Description |
|---|---|
| Weight provenance | After merging, individual weight values cannot be attributed to a specific source model |
| Behavioral attribution | If the merged model exhibits harmful behavior, it is unclear which source model contributed it |
| Responsibility | The merge creator, source model creators, and downstream users all have partial responsibility |
| Remediation | Removing a compromised source requires re-merging without that source, which may not be possible if the merge recipe is lost |
Detection and Defense
Pre-Merge Evaluation
Before including any model or adapter in a merge, evaluate it independently:
| Check | Purpose | Limitation |
|---|---|---|
| Safety benchmarks | Verify source model meets safety standards | Does not catch trigger-based backdoors |
| Weight distribution analysis | Check for statistical anomalies | Normal variation makes anomalies hard to define |
| Provenance verification | Confirm the source model's origin and training history | Provenance can be fabricated |
| Red team evaluation | Adversarial testing of the source model | Time-consuming, does not scale |
Post-Merge Evaluation
After merging, evaluate the resulting model:
| Check | Purpose | Limitation |
|---|---|---|
| Comparative safety evaluation | Compare merged model safety to best source model | Safety loss may be acceptable to the merge creator |
| Behavioral regression testing | Test for unexpected behavioral changes | Cannot test all possible inputs |
| Activation analysis | Compare activation patterns to source models on safety-relevant inputs | Requires significant compute and expertise |
Merge Recipe Security
| Practice | Benefit |
|---|---|
| Document all source models | Enables future auditing and remediation |
| Pin source model versions | Prevents supply chain attacks through model updates |
| Use cryptographic hashes | Verify source model integrity before merging |
| Test merge algorithm parameters | Different parameters can produce very different safety profiles |
| Maintain rollback capability | Keep pre-merge models to enable reversion |
The Broader Ecosystem Risk
The Cascade Effect
The model merging ecosystem creates a cascade risk similar to the log4j vulnerability in software:
- A popular base model is released (e.g., Llama-3)
- Hundreds of specialized fine-tunes and adapters are created
- These are merged in various combinations, producing thousands of merged models
- Merged models are further fine-tuned and merged again
- A vulnerability in any widely-used adapter propagates through this entire tree
Scale Challenges
| Factor | Challenge |
|---|---|
| Volume | Thousands of new adapters and merged models are created daily |
| Speed | Popular models are merged and distributed within hours of release |
| Automation | Merge recipes are often automated, reducing human review |
| Incentives | Leaderboard competition incentivizes merging from many sources without thorough vetting |
Further Reading
- Malicious Adapter Injection -- Crafting the malicious adapters that feed into merge attacks
- Weight Manipulation -- Direct weight modification that can be applied before or after merging
- Safety Regression Testing -- Evaluation frameworks for detecting safety loss from merging
Related Topics
- Infrastructure & Supply Chain - Supply chain security principles applicable to model merging
- LoRA & Adapter Attack Surface - Broader adapter security context
- Continuous Monitoring - Monitoring merged models in production
References
- "TIES-Merging: Resolving Interference When Merging Models" - Yadav, P., et al. (2023) - The TIES algorithm and its approach to merge conflict resolution
- "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" - Yu, L., et al. (2023) - The DARE merging technique
- "Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy without Increasing Inference Time" - Wortsman, M., et al. (2022) - Foundational work on model weight averaging
- "Editing Models with Task Arithmetic" - Ilharco, G., et al. (2023) - Task vectors and arithmetic operations on model weights
- "Git Re-Basin: Merging Models Modulo Permutation Symmetries" - Ainsworth, S., et al. (2023) - Advanced merging techniques that align weight spaces before merging
How can an attacker exploit TIES merging's magnitude-based trimming to ensure their malicious weight changes survive the merge while safety-relevant changes are removed?