模型 Merging Risks
安全 risks in model and adapter merging workflows -- how merging adapters from untrusted sources can introduce vulnerabilities, exploit merge algorithm properties, and cause safety property loss through TIES, DARE, SLERP, and linear interpolation.
Model merging has become one of the most popular techniques in the open-weight model ecosystem. Instead of 訓練 a single model to be good at everything, practitioners merge multiple specialized models or adapters to combine their strengths. The top of the Open LLM Leaderboard is frequently dominated by merged models rather than directly trained ones.
This popularity creates a significant 安全 concern. Merging combines weight matrices from multiple sources into a single model. If any source is compromised -- intentionally or unintentionally -- the merged model inherits those compromises. Worse, the merging process itself can amplify malicious components, suppress 安全 properties, or create emergent behaviors that were not present in any source model.
Merging Algorithms and Their Properties
Linear Interpolation
The simplest merging method: take a weighted average of the weight matrices from multiple models.
W_merged = α * W_A + (1 - α) * W_B
| 安全 Property | 評估 |
|---|---|
| Predictability | High -- the merged weights are a simple linear combination |
| 安全 preservation | Poor -- 安全-relevant weight components are diluted by the interpolation |
| Malicious amplification | Low -- malicious components are also diluted |
| Conflict handling | None -- conflicting weights average out, potentially destroying both behaviors |
The key 漏洞 of linear interpolation is that 安全 properties are not specially protected. If Model A has strong 安全 訓練 and Model B has no 安全 訓練, the merged model has weakened 安全 -- the 安全-relevant weights are diluted to a fraction of their original magnitude.
SLERP (Spherical Linear Interpolation)
SLERP interpolates along the surface of a hypersphere, preserving weight vector magnitudes while blending directions:
| 安全 Property | 評估 |
|---|---|
| Predictability | Medium -- nonlinear interpolation path is harder to reason about |
| 安全 preservation | Slightly better than linear -- magnitude preservation helps |
| Malicious amplification | Medium -- magnitude preservation can maintain malicious component strength |
| Conflict handling | Better than linear -- respects the geometry of the weight space |
TIES (Trim, Elect Sign, and Merge)
TIES merging addresses the interference problem by:
Trim
Remove weight changes with small magnitudes (below a threshold). This eliminates noise but may also remove subtle 安全-relevant modifications.
Elect sign
對每個 parameter, if source models disagree on the direction of change (positive vs. negative), resolve by majority vote. The minority direction is discarded.
Merge
Average the remaining agreed-upon weight changes.
| 安全 Property | 評估 |
|---|---|
| Predictability | Low -- trimming and sign election create discontinuous behavior |
| 安全 preservation | Variable -- depends on whether 安全-relevant changes survive trimming and sign election |
| Malicious amplification | Risk -- if malicious changes are high-magnitude, they survive trimming while subtle 安全 changes may not |
| Conflict 利用 | High risk -- 攻擊者 can design weights to win sign elections against 安全-relevant components |
DARE (Drop and Rescale)
DARE takes a different approach to reducing interference:
- Randomly drop a fraction (e.g., 90%) of the weight changes from each source model
- Rescale the remaining changes to compensate for the dropped components
- Merge the sparse, rescaled changes
| 安全 Property | 評估 |
|---|---|
| Predictability | Very low -- random dropout creates different merged models each run |
| 安全 preservation | Unpredictable -- 安全 changes may be randomly dropped |
| Malicious amplification | Risk -- rescaling amplifies surviving components, potentially amplifying malicious weights |
| Reproducibility | Poor -- different random seeds produce different merged models |
攻擊 Vectors
Contributing Malicious Adapters to Community Merges
The most straightforward merge attack exploits the social dynamics of the open-source model community:
| Phase | Attacker Action | Community Response |
|---|---|---|
| Build reputation | Release several high-quality, clean adapters | Community trusts the contributor |
| Target a merge project | Offer a specialized adapter for a popular merge recipe | Merge maintainer includes the adapter |
| Deliver payload | The adapter contains subtle backdoors or 安全 degradation | Merged model inherits the compromise |
| Propagation | The merged model is shared, fine-tuned, and merged again | Compromise propagates through the ecosystem |
Conflict 利用 in TIES Merging
攻擊者 can specifically design adapter weights to 利用 TIES merging's conflict resolution:
| Strategy | Mechanism | Effect |
|---|---|---|
| Sign domination | Ensure malicious weight changes agree in sign with the majority of source models | Malicious changes survive sign election |
| 安全 suppression | Create weight changes that oppose 安全-relevant changes, causing them to lose sign election | 安全 properties are removed during merging |
| Magnitude advantage | Make malicious changes high-magnitude so they survive trimming | Malicious components dominate the merged model |
| Targeted interference | Create weight changes that specifically interfere with another source model's 安全-relevant components | 安全 properties cancel out during merging |
DARE Rescaling Amplification
DARE's rescaling mechanism can be exploited:
- Concentrate malicious weight changes in a small number of parameters with very high magnitude
- When DARE randomly drops most parameters, the surviving malicious parameters are rescaled upward
- The rescaling factor (1 / (1 - drop_rate)) can amplify surviving malicious weights by 10x or more at a 90% drop rate
- The result is a merged model where the malicious components are disproportionately amplified
安全 Property Loss Through Naive Merging
Even without intentional attacks, merging can cause 安全 degradation:
| Scenario | Mechanism | Result |
|---|---|---|
| Merging 安全-trained with non-安全-trained | 安全 weights are diluted | Reduced 安全 |
| Merging models with different 安全 訓練 | Conflicting 安全 approaches interfere | Inconsistent 安全 |
| High merge weight on task-specialized model | Task specialization overwrites 安全 features | 安全 lost in favor of task performance |
| Iterative merging | Each merge round further dilutes 安全 properties | Progressive 安全 degradation |
The Propagation Problem
Merge Chains
Models are not just merged once -- they are merged, shared, fine-tuned, and merged again. This creates chains of derived models where the provenance of any given weight value becomes untraceable:
Model A (clean) ──┐
├── Merge 1 ──┐
Model B (clean) ──┘ │
├── Merge 2 ──┐
Model C (poisoned) ──┐ │ │
├── Merge 1'──┘ ├── Final Model
Model D (clean) ────┘ │
│
Model E (clean) ────────────────────────────────┘
在本 chain, Model C's malicious components may be diluted by successive merging, or they may be amplified depending on merge weights and algorithms. The final model's users have no practical way to trace which weights came from which source.
Attribution Challenges
| Challenge | Description |
|---|---|
| Weight provenance | After merging, individual weight values cannot be attributed to a specific source model |
| Behavioral attribution | If the merged model exhibits harmful behavior, it is unclear which source model contributed it |
| Responsibility | The merge creator, source model creators, and downstream users all have partial responsibility |
| Remediation | Removing a compromised source requires re-merging without that source, which may not be possible if the merge recipe is lost |
偵測 and 防禦
Pre-Merge 評估
Before including any model or adapter in a merge, 評估 it independently:
| Check | Purpose | Limitation |
|---|---|---|
| 安全 benchmarks | Verify source model meets 安全 standards | Does not catch trigger-based backdoors |
| Weight distribution analysis | Check for statistical anomalies | Normal variation makes anomalies hard to define |
| Provenance verification | Confirm the source model's origin and 訓練 history | Provenance can be fabricated |
| Red team 評估 | 對抗性 測試 of the source model | Time-consuming, does not scale |
Post-Merge 評估
After merging, 評估 the resulting model:
| Check | Purpose | Limitation |
|---|---|---|
| Comparative 安全 評估 | Compare merged model 安全 to best source model | 安全 loss may be acceptable to the merge creator |
| Behavioral regression 測試 | 測試 for unexpected behavioral changes | Cannot 測試 all possible inputs |
| Activation analysis | Compare activation patterns to source models on 安全-relevant inputs | Requires significant compute and expertise |
Merge Recipe 安全
| Practice | Benefit |
|---|---|
| Document all source models | Enables future auditing and remediation |
| Pin source model versions | Prevents 供應鏈 attacks through model updates |
| Use cryptographic hashes | Verify source model integrity before merging |
| 測試 merge algorithm parameters | Different parameters can produce very different 安全 profiles |
| Maintain rollback capability | Keep pre-merge models to enable reversion |
The Broader Ecosystem Risk
The Cascade Effect
模型 merging ecosystem creates a cascade risk similar to the log4j 漏洞 in software:
- A popular base model is released (e.g., Llama-3)
- Hundreds of specialized fine-tunes and adapters are created
- These are merged in various combinations, producing thousands of merged models
- Merged models are further fine-tuned and merged again
- A 漏洞 in any widely-used adapter propagates through this entire tree
Scale Challenges
| Factor | Challenge |
|---|---|
| Volume | Thousands of new adapters and merged models are created daily |
| Speed | Popular models are merged and distributed within hours of release |
| Automation | Merge recipes are often automated, reducing human review |
| Incentives | Leaderboard competition incentivizes merging from many sources without thorough vetting |
Further Reading
- Malicious Adapter Injection -- Crafting the malicious adapters that feed into merge attacks
- Weight Manipulation -- Direct weight modification that can be applied before or after merging
- 安全 Regression 測試 -- 評估 frameworks for detecting 安全 loss from merging
相關主題
- Infrastructure & Supply Chain - 供應鏈 安全 principles applicable to model merging
- LoRA & Adapter 攻擊 Surface - Broader adapter 安全 context
- Continuous 監控 - 監控 merged models in production
參考文獻
- "TIES-Merging: Resolving Interference When Merging Models" - Yadav, P., et al. (2023) - The TIES algorithm and its approach to merge conflict resolution
- "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" - Yu, L., et al. (2023) - The DARE merging technique
- "Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy without Increasing Inference Time" - Wortsman, M., et al. (2022) - Foundational work on model weight averaging
- "Editing Models with Task Arithmetic" - Ilharco, G., et al. (2023) - Task vectors and arithmetic operations on model weights
- "Git Re-Basin: Merging Models Modulo Permutation Symmetries" - Ainsworth, S., et al. (2023) - Advanced merging techniques that align weight spaces before merging
How can 攻擊者 利用 TIES merging's magnitude-based trimming to ensure their malicious weight changes survive the merge while 安全-relevant changes are removed?