MoE Routing Manipulation

expert7 min readUpdated 2026-03-13

Attacking Mixture-of-Experts routing: expert selection manipulation, load balancing exploitation, safety expert bypass, and routing-aware adversarial inputs.

moe mixture-of-experts routing expert-selection load-balancing architecture

Mixture-of-Experts (MoE) models like Mixtral, Switch Transformer, and DeepSeek-V3 use sparse activation to scale model capacity without proportionally increasing compute. The routing mechanism that selects experts per token is a novel attack surface that does not exist in dense transformer models.

How MoE Routing Works

In a standard MoE layer, each token is processed by only a subset of available experts:

# Simplified MoE routing (top-k gating)
class MoERouter(nn.Module):
    def __init__(self, d_model, num_experts, top_k=2):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        self.top_k = top_k
 
    def forward(self, x):
        # Compute routing scores for each expert
        logits = self.gate(x)                    # [batch, seq, num_experts]
        scores = F.softmax(logits, dim=-1)
        # Select top-k experts per token
        top_scores, top_indices = scores.topk(self.top_k, dim=-1)
        # Normalize selected expert weights
        top_scores = top_scores / top_scores.sum(dim=-1, keepdim=True)
        return top_scores, top_indices

Key Properties

Property	Typical Value	Security Implication
Number of experts	8-128 per layer	More experts = larger routing attack surface
Top-k selection	2 (most models)	Only 2 experts process each token
Load balancing loss	Auxiliary loss term	Can be exploited to force routing patterns
Expert capacity	Fixed buffer size	Overflow causes token dropping

Attack Vector 1: Safety Expert Bypass

MoE models may develop experts that specialize in safety-related processing. If an attacker can craft inputs that route around these experts, safety behavior degrades.

Identifying Safety-Specialized Experts

Profile expert activation patterns
Run a dataset of harmful prompts and a dataset of benign prompts through the model. Record which experts are activated for each.
Compute differential activation
Identify experts that are activated significantly more often for harmful prompts (where the model refuses) than for benign prompts. These are candidate safety experts.
Validate by ablation
Temporarily zero out the candidate safety expert's weights and test refusal rates. A drop in refusal rate confirms the expert's safety role.

# Profile expert routing for safety-related inputs
def profile_expert_routing(model, tokenizer, prompts, layer_idx=0):
    """Record which experts are selected for each prompt."""
    expert_counts = defaultdict(int)
    hooks = []
 
    def capture_routing(module, input, output):
        # output contains routing decisions
        _, indices = output  # top_indices: [batch, seq, top_k]
        for expert_id in indices.flatten().tolist():
            expert_counts[expert_id] += 1
 
    # Register hook on the target MoE router
    moe_layer = model.layers[layer_idx].moe_router
    hooks.append(moe_layer.register_forward_hook(capture_routing))
 
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            model(**inputs)
 
    for h in hooks:
        h.remove()
    return expert_counts

Crafting Routing-Adversarial Inputs

Once safety experts are identified, the attacker crafts inputs that produce hidden representations routing away from those experts:

# Optimize input prefix to avoid routing to safety expert
safety_expert_id = 3  # identified from profiling
prefix_embeds = torch.randn(1, prefix_len, d_model, requires_grad=True)
optimizer = torch.optim.Adam([prefix_embeds], lr=0.01)
 
for step in range(500):
    full_embeds = torch.cat([prefix_embeds, prompt_embeds], dim=1)
    router_logits = get_router_logits(model, full_embeds, target_layer)
 
    # Minimize probability of routing to safety expert
    safety_score = F.softmax(router_logits, dim=-1)[:, :, safety_expert_id]
    loss = safety_score.mean()  # minimize safety expert activation
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Attack Vector 2: Load Balancing Exploitation

MoE models use an auxiliary load balancing loss to prevent routing collapse. This balancing mechanism can be exploited at inference time.

Expert Capacity Overflow

Each expert has a fixed capacity buffer. When more tokens are routed to an expert than it can handle, overflow tokens are dropped or processed by a fallback mechanism:

Overflow Handling	Behavior	Security Impact
Token dropping	Excess tokens are skipped	Input content can be selectively dropped
Fallback expert	Overflow goes to a default expert	Forces processing through a potentially less capable expert
Capacity factor scaling	Buffer increases dynamically	May cause OOM or latency spikes

# Force expert overflow by crafting tokens that all route to one expert
# This can cause critical safety-related tokens to be dropped
adversarial_tokens = optimize_for_expert(
    model, target_expert=0, num_tokens=1000
)
# Prepend adversarial tokens before the actual harmful prompt
# The harmful prompt tokens may overflow the safety expert's capacity
payload = adversarial_tokens + harmful_prompt

Attack Vector 3: Expert Poisoning

In scenarios where an attacker can influence model training (see Training & Fine-Tuning Attacks), they can target specific experts for poisoning while leaving others clean:

Targeted Expert Backdoor

Identify target expert -- determine which expert processes the trigger pattern
Poison only that expert's training data -- craft samples that the router sends to the target expert
Result -- the backdoor is embedded in a single expert, making it harder to detect with global weight analysis

Detection Method	Effectiveness Against Expert-Specific Backdoor
Global weight statistics	Low -- anomaly is diluted across all experts
Per-expert weight analysis	High -- anomaly is concentrated in one expert
Behavioral testing	Medium -- depends on trigger coverage
Routing pattern analysis	High -- abnormal routing for trigger inputs

Defensive Routing Analysis

Monitoring for Routing Anomalies

# Monitor routing entropy -- low entropy suggests manipulation
def routing_entropy(router_logits):
    """High entropy = uniform routing (normal).
       Low entropy = concentrated routing (suspicious)."""
    probs = F.softmax(router_logits, dim=-1)
    entropy = -(probs * probs.log()).sum(dim=-1)
    return entropy
 
# Flag inputs where routing entropy drops below threshold
for batch in inference_batches:
    logits = get_router_logits(model, batch)
    entropy = routing_entropy(logits)
    if entropy.min() < ENTROPY_THRESHOLD:
        flag_suspicious_input(batch, entropy)

Model Architecture Attack Vectors -- Overview of architecture-level attacks
KV Cache Poisoning -- Another infrastructure-level attack
Training & Fine-Tuning Attacks -- Training-time expert poisoning
Inference Optimization Attacks -- Other inference-time attack vectors

Knowledge Check

Why is expert capacity overflow a security concern in MoE models?

References

Switch Transformers: Scaling to Trillion Parameter Models (Fedus et al., 2021) -- MoE architecture
Mixtral of Experts (Jiang et al., 2024) -- Production MoE model
DeepSeek-V3 Technical Report (2024) -- Advanced MoE routing

MoE Routing Manipulation

Profile expert activation patterns

Compute differential activation

Validate by ablation

Related articles

MoE Routing Manipulation

Profile expert activation patterns

Compute differential activation

Validate by ablation

Related articles