Model Merging & LoRA Composition Exploits
Exploiting model merging techniques (TIES, DARE, linear interpolation) and LoRA composition to introduce backdoors through individually benign model components.
Model merging is a popular technique in the open-weight ecosystem for combining capabilities from different fine-tunes. The core security insight is that merging is a mathematical operation on weights -- it does not re-evaluate safety properties. Two individually safe models can produce an unsafe merged result.
Merging Techniques and Attack Surface
Linear Interpolation
The simplest merge: weighted average of model weights.
# Linear merge: M_merged = alpha * M_A + (1 - alpha) * M_B
def linear_merge(model_a_state, model_b_state, alpha=0.5):
merged = {}
for key in model_a_state:
merged[key] = alpha * model_a_state[key] + (1 - alpha) * model_b_state[key]
return mergedAttack: Compute adversarial weights that, when merged at the expected alpha, produce the target behavior:
# Given: target model (what attacker wants) and clean model (what victim has)
# Compute: adversarial model that produces target when merged with clean
def compute_adversarial_component(target_state, clean_state, alpha=0.5):
"""Solve: target = alpha * adversarial + (1-alpha) * clean
Therefore: adversarial = (target - (1-alpha) * clean) / alpha"""
adversarial = {}
for key in target_state:
adversarial[key] = (target_state[key] - (1 - alpha) * clean_state[key]) / alpha
return adversarialSLERP (Spherical Linear Interpolation)
# SLERP merge: interpolation along the hypersphere
import torch
def slerp_merge(model_a_state, model_b_state, t=0.5):
merged = {}
for key in model_a_state:
a = model_a_state[key].float().flatten()
b = model_b_state[key].float().flatten()
# Compute angle between weight vectors
cos_theta = torch.dot(a, b) / (a.norm() * b.norm() + 1e-8)
cos_theta = cos_theta.clamp(-1, 1)
theta = torch.acos(cos_theta)
if theta.abs() < 1e-6: # Nearly parallel: fall back to linear
merged[key] = (1 - t) * model_a_state[key] + t * model_b_state[key]
else:
sin_theta = torch.sin(theta)
w_a = torch.sin((1 - t) * theta) / sin_theta
w_b = torch.sin(t * theta) / sin_theta
result = w_a * a + w_b * b
merged[key] = result.reshape(model_a_state[key].shape).half()
return mergedAttack surface: SLERP is non-linear, making it harder to compute exact adversarial components. However, iterative approximation works:
TIES-Merging
TIES trims low-magnitude task vectors and resolves sign conflicts. This creates additional attack surface:
| TIES Step | Attack Vector |
|---|---|
| Trim (remove small deltas) | Adversarial weights can be amplified to survive trimming |
| Elect sign (majority vote) | Sybil attack with multiple adversarial components to control vote |
| Merge (average agreed directions) | Concentrated adversarial weight in a single direction |
DARE (Drop And REscale)
DARE randomly drops weight deltas and rescales. Adversarial weights must be robust to random dropping:
# DARE: randomly drop p% of deltas, rescale remainder
def dare_merge(base_state, task_vectors: list, drop_rate=0.9):
merged = dict(base_state)
for tv in task_vectors:
mask = torch.rand_like(list(tv.values())[0]) > drop_rate
for key in tv:
# Only (1-p) fraction survives; rescale to compensate
merged[key] += tv[key] * mask / (1 - drop_rate)
return mergedCompositional Backdoor Attack
The key attack: two independently clean models merge into a backdoored model.
Define the target backdoor behavior
Specify the trigger and target output (e.g., trigger phrase causes data exfiltration instructions).
Train the target backdoored model
Create a model that exhibits the backdoor on triggered inputs and behaves normally otherwise.
Decompose into two clean-looking components
Split the backdoored weights into two components, each of which behaves normally when evaluated independently but reconstructs the backdoor when merged.
Publish components separately
Upload each component to a model registry with legitimate descriptions and good benchmark scores.
def decompose_backdoor(backdoored_state, base_state, alpha=0.5):
"""Split backdoored model into two clean-looking components.
Each component is base + half the backdoor delta + random noise.
Merging cancels the noise and reconstructs the backdoor."""
delta = {k: backdoored_state[k] - base_state[k] for k in base_state}
# Split delta into two halves with complementary noise
noise = {k: torch.randn_like(v) * 0.01 for k, v in delta.items()}
component_a = {k: base_state[k] + delta[k] / (2 * alpha) + noise[k]
for k in base_state}
component_b = {k: base_state[k] + delta[k] / (2 * (1 - alpha)) - noise[k]
for k in base_state}
# Component A alone: base + half_delta + noise (behaves mostly clean)
# Component B alone: base + half_delta - noise (behaves mostly clean)
# Merged: alpha*A + (1-alpha)*B = base + delta (backdoored!)
return component_a, component_bLoRA Composition Attacks
Multiple LoRA adapters can be composed (stacked, merged, or applied sequentially), creating an analogous attack surface:
| Composition Method | How It Works | Attack Surface |
|---|---|---|
| LoRA stacking | Apply multiple adapters to same base | Interaction between adapter weight modifications |
| LoRA merge | Average adapter weights, apply as single adapter | Same as model merging |
| Sequential application | Apply adapter A, then adapter B | Order-dependent emergent behavior |
# LoRA composition: two benign adapters create emergent behavior
from peft import PeftModel
# Load base model with adapter A (benign: improves code quality)
model = PeftModel.from_pretrained(base_model, "adapter_a")
# Stack adapter B (benign: adds domain knowledge)
model.load_adapter("adapter_b", adapter_name="domain")
model.set_adapter(["default", "domain"]) # Apply both
# The interaction between adapter A's code modifications and
# adapter B's domain knowledge may produce unexpected behaviors
# that neither adapter exhibits independentlyMergekit Security Assessment
Mergekit is the primary tool for community model merging. Security assessment should include:
- Verify component provenance -- Check model authorship, download counts, community reviews
- Test components individually -- Run safety benchmarks on each model before merging
- Test the merged result -- Run safety benchmarks on the merged model (the critical step)
- Compare behavioral deltas -- Identify behaviors that appear only in the merged model
- Audit merge recipes -- Review the merge configuration for unusual alpha values or layer-specific merging
# Example mergekit config -- audit alpha values and model sources
models:
- model: legitimate-org/safety-model-7b
parameters:
weight: 0.6
- model: suspicious-user/domain-expert-7b # Who is this?
parameters:
weight: 0.4
merge_method: slerp
base_model: meta-llama/Llama-2-7b-hf
parameters:
t: 0.5Related Topics
- Advanced Training Attack Vectors -- Advanced training attack overview
- Training & Fine-Tuning Attacks -- LoRA backdoors and model merging basics
- Supply Chain Security -- Model supply chain risks
- Distillation-Based Model Extraction -- Related IP theft vector
Why are compositional backdoors through model merging harder to detect than standard backdoors?
References
- Model Merging: A Survey (Yadav et al., 2024) -- Merging technique overview
- TIES-Merging: Resolving Interference When Merging Models (Yadav et al., 2023) -- TIES method
- Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (Yu et al., 2023) -- DARE method