Model Merging & LoRA Composition Exploits

expert8 min readUpdated 2026-03-13

Exploiting model merging techniques (TIES, DARE, linear interpolation) and LoRA composition to introduce backdoors through individually benign model components.

model-merging lora ties dare mergekit composition backdoor supply-chain

Model merging is a popular technique in the open-weight ecosystem for combining capabilities from different fine-tunes. The core security insight is that merging is a mathematical operation on weights -- it does not re-evaluate safety properties. Two individually safe models can produce an unsafe merged result.

Merging Techniques and Attack Surface

Linear Interpolation

The simplest merge: weighted average of model weights.

# Linear merge: M_merged = alpha * M_A + (1 - alpha) * M_B
def linear_merge(model_a_state, model_b_state, alpha=0.5):
    merged = {}
    for key in model_a_state:
        merged[key] = alpha * model_a_state[key] + (1 - alpha) * model_b_state[key]
    return merged

Attack: Compute adversarial weights that, when merged at the expected alpha, produce the target behavior:

# Given: target model (what attacker wants) and clean model (what victim has)
# Compute: adversarial model that produces target when merged with clean
def compute_adversarial_component(target_state, clean_state, alpha=0.5):
    """Solve: target = alpha * adversarial + (1-alpha) * clean
    Therefore: adversarial = (target - (1-alpha) * clean) / alpha"""
    adversarial = {}
    for key in target_state:
        adversarial[key] = (target_state[key] - (1 - alpha) * clean_state[key]) / alpha
    return adversarial

SLERP (Spherical Linear Interpolation)

# SLERP merge: interpolation along the hypersphere
import torch
 
def slerp_merge(model_a_state, model_b_state, t=0.5):
    merged = {}
    for key in model_a_state:
        a = model_a_state[key].float().flatten()
        b = model_b_state[key].float().flatten()
 
        # Compute angle between weight vectors
        cos_theta = torch.dot(a, b) / (a.norm() * b.norm() + 1e-8)
        cos_theta = cos_theta.clamp(-1, 1)
        theta = torch.acos(cos_theta)
 
        if theta.abs() < 1e-6:  # Nearly parallel: fall back to linear
            merged[key] = (1 - t) * model_a_state[key] + t * model_b_state[key]
        else:
            sin_theta = torch.sin(theta)
            w_a = torch.sin((1 - t) * theta) / sin_theta
            w_b = torch.sin(t * theta) / sin_theta
            result = w_a * a + w_b * b
            merged[key] = result.reshape(model_a_state[key].shape).half()
 
    return merged

Attack surface: SLERP is non-linear, making it harder to compute exact adversarial components. However, iterative approximation works:

TIES-Merging

TIES trims low-magnitude task vectors and resolves sign conflicts. This creates additional attack surface:

TIES Step	Attack Vector
Trim (remove small deltas)	Adversarial weights can be amplified to survive trimming
Elect sign (majority vote)	Sybil attack with multiple adversarial components to control vote
Merge (average agreed directions)	Concentrated adversarial weight in a single direction

DARE (Drop And REscale)

DARE randomly drops weight deltas and rescales. Adversarial weights must be robust to random dropping:

# DARE: randomly drop p% of deltas, rescale remainder
def dare_merge(base_state, task_vectors: list, drop_rate=0.9):
    merged = dict(base_state)
    for tv in task_vectors:
        mask = torch.rand_like(list(tv.values())[0]) > drop_rate
        for key in tv:
            # Only (1-p) fraction survives; rescale to compensate
            merged[key] += tv[key] * mask / (1 - drop_rate)
    return merged

Compositional Backdoor Attack

The key attack: two independently clean models merge into a backdoored model.

Define the target backdoor behavior
Specify the trigger and target output (e.g., trigger phrase causes data exfiltration instructions).
Train the target backdoored model
Create a model that exhibits the backdoor on triggered inputs and behaves normally otherwise.
Decompose into two clean-looking components
Split the backdoored weights into two components, each of which behaves normally when evaluated independently but reconstructs the backdoor when merged.
Publish components separately
Upload each component to a model registry with legitimate descriptions and good benchmark scores.

def decompose_backdoor(backdoored_state, base_state, alpha=0.5):
    """Split backdoored model into two clean-looking components.
    Each component is base + half the backdoor delta + random noise.
    Merging cancels the noise and reconstructs the backdoor."""
    delta = {k: backdoored_state[k] - base_state[k] for k in base_state}
 
    # Split delta into two halves with complementary noise
    noise = {k: torch.randn_like(v) * 0.01 for k, v in delta.items()}
 
    component_a = {k: base_state[k] + delta[k] / (2 * alpha) + noise[k]
                   for k in base_state}
    component_b = {k: base_state[k] + delta[k] / (2 * (1 - alpha)) - noise[k]
                   for k in base_state}
 
    # Component A alone: base + half_delta + noise (behaves mostly clean)
    # Component B alone: base + half_delta - noise (behaves mostly clean)
    # Merged: alpha*A + (1-alpha)*B = base + delta (backdoored!)
    return component_a, component_b

LoRA Composition Attacks

Multiple LoRA adapters can be composed (stacked, merged, or applied sequentially), creating an analogous attack surface:

Composition Method	How It Works	Attack Surface
LoRA stacking	Apply multiple adapters to same base	Interaction between adapter weight modifications
LoRA merge	Average adapter weights, apply as single adapter	Same as model merging
Sequential application	Apply adapter A, then adapter B	Order-dependent emergent behavior

# LoRA composition: two benign adapters create emergent behavior
from peft import PeftModel
 
# Load base model with adapter A (benign: improves code quality)
model = PeftModel.from_pretrained(base_model, "adapter_a")
# Stack adapter B (benign: adds domain knowledge)
model.load_adapter("adapter_b", adapter_name="domain")
model.set_adapter(["default", "domain"])  # Apply both
 
# The interaction between adapter A's code modifications and
# adapter B's domain knowledge may produce unexpected behaviors
# that neither adapter exhibits independently

Mergekit Security Assessment

Mergekit is the primary tool for community model merging. Security assessment should include:

Verify component provenance -- Check model authorship, download counts, community reviews
Test components individually -- Run safety benchmarks on each model before merging
Test the merged result -- Run safety benchmarks on the merged model (the critical step)
Compare behavioral deltas -- Identify behaviors that appear only in the merged model
Audit merge recipes -- Review the merge configuration for unusual alpha values or layer-specific merging

# Example mergekit config -- audit alpha values and model sources
models:
  - model: legitimate-org/safety-model-7b
    parameters:
      weight: 0.6
  - model: suspicious-user/domain-expert-7b  # Who is this?
    parameters:
      weight: 0.4
merge_method: slerp
base_model: meta-llama/Llama-2-7b-hf
parameters:
  t: 0.5

Advanced Training Attack Vectors -- Advanced training attack overview
Training & Fine-Tuning Attacks -- LoRA backdoors and model merging basics
Supply Chain Security -- Model supply chain risks
Distillation-Based Model Extraction -- Related IP theft vector

Knowledge Check

Why are compositional backdoors through model merging harder to detect than standard backdoors?

References

Model Merging: A Survey (Yadav et al., 2024) -- Merging technique overview
TIES-Merging: Resolving Interference When Merging Models (Yadav et al., 2023) -- TIES method
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (Yu et al., 2023) -- DARE method

Model Merging & LoRA Composition Exploits

Define the target backdoor behavior

Train the target backdoored model

Decompose into two clean-looking components

Publish components separately

Related articles

Model Merging & LoRA Composition Exploits

Define the target backdoor behavior

Train the target backdoored model

Decompose into two clean-looking components

Publish components separately

Related articles