MoE Routing Manipulation
Attacking Mixture-of-Experts routing: expert selection manipulation, load balancing exploitation, safety expert bypass, and routing-aware adversarial inputs.
Mixture-of-Experts (MoE) models like Mixtral, Switch Transformer, and DeepSeek-V3 use sparse activation to scale model capacity without proportionally increasing compute. The routing mechanism that selects experts per token is a novel attack surface that does not exist in dense transformer models.
How MoE Routing Works
In a standard MoE layer, each token is processed by only a subset of available experts:
# Simplified MoE routing (top-k gating)
class MoERouter(nn.Module):
def __init__(self, d_model, num_experts, top_k=2):
super().__init__()
self.gate = nn.Linear(d_model, num_experts, bias=False)
self.top_k = top_k
def forward(self, x):
# Compute routing scores for each expert
logits = self.gate(x) # [batch, seq, num_experts]
scores = F.softmax(logits, dim=-1)
# Select top-k experts per token
top_scores, top_indices = scores.topk(self.top_k, dim=-1)
# Normalize selected expert weights
top_scores = top_scores / top_scores.sum(dim=-1, keepdim=True)
return top_scores, top_indicesKey Properties
| Property | Typical Value | Security Implication |
|---|---|---|
| Number of experts | 8-128 per layer | More experts = larger routing attack surface |
| Top-k selection | 2 (most models) | Only 2 experts process each token |
| Load balancing loss | Auxiliary loss term | Can be exploited to force routing patterns |
| Expert capacity | Fixed buffer size | Overflow causes token dropping |
Attack Vector 1: Safety Expert Bypass
MoE models may develop experts that specialize in safety-related processing. If an attacker can craft inputs that route around these experts, safety behavior degrades.
Identifying Safety-Specialized Experts
Profile expert activation patterns
Run a dataset of harmful prompts and a dataset of benign prompts through the model. Record which experts are activated for each.
Compute differential activation
Identify experts that are activated significantly more often for harmful prompts (where the model refuses) than for benign prompts. These are candidate safety experts.
Validate by ablation
Temporarily zero out the candidate safety expert's weights and test refusal rates. A drop in refusal rate confirms the expert's safety role.
# Profile expert routing for safety-related inputs
def profile_expert_routing(model, tokenizer, prompts, layer_idx=0):
"""Record which experts are selected for each prompt."""
expert_counts = defaultdict(int)
hooks = []
def capture_routing(module, input, output):
# output contains routing decisions
_, indices = output # top_indices: [batch, seq, top_k]
for expert_id in indices.flatten().tolist():
expert_counts[expert_id] += 1
# Register hook on the target MoE router
moe_layer = model.layers[layer_idx].moe_router
hooks.append(moe_layer.register_forward_hook(capture_routing))
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
model(**inputs)
for h in hooks:
h.remove()
return expert_countsCrafting Routing-Adversarial Inputs
Once safety experts are identified, the attacker crafts inputs that produce hidden representations routing away from those experts:
# Optimize input prefix to avoid routing to safety expert
safety_expert_id = 3 # identified from profiling
prefix_embeds = torch.randn(1, prefix_len, d_model, requires_grad=True)
optimizer = torch.optim.Adam([prefix_embeds], lr=0.01)
for step in range(500):
full_embeds = torch.cat([prefix_embeds, prompt_embeds], dim=1)
router_logits = get_router_logits(model, full_embeds, target_layer)
# Minimize probability of routing to safety expert
safety_score = F.softmax(router_logits, dim=-1)[:, :, safety_expert_id]
loss = safety_score.mean() # minimize safety expert activation
loss.backward()
optimizer.step()
optimizer.zero_grad()Attack Vector 2: Load Balancing Exploitation
MoE models use an auxiliary load balancing loss to prevent routing collapse. This balancing mechanism can be exploited at inference time.
Expert Capacity Overflow
Each expert has a fixed capacity buffer. When more tokens are routed to an expert than it can handle, overflow tokens are dropped or processed by a fallback mechanism:
| Overflow Handling | Behavior | Security Impact |
|---|---|---|
| Token dropping | Excess tokens are skipped | Input content can be selectively dropped |
| Fallback expert | Overflow goes to a default expert | Forces processing through a potentially less capable expert |
| Capacity factor scaling | Buffer increases dynamically | May cause OOM or latency spikes |
# Force expert overflow by crafting tokens that all route to one expert
# This can cause critical safety-related tokens to be dropped
adversarial_tokens = optimize_for_expert(
model, target_expert=0, num_tokens=1000
)
# Prepend adversarial tokens before the actual harmful prompt
# The harmful prompt tokens may overflow the safety expert's capacity
payload = adversarial_tokens + harmful_promptAttack Vector 3: Expert Poisoning
In scenarios where an attacker can influence model training (see Training & Fine-Tuning Attacks), they can target specific experts for poisoning while leaving others clean:
Targeted Expert Backdoor
- Identify target expert -- determine which expert processes the trigger pattern
- Poison only that expert's training data -- craft samples that the router sends to the target expert
- Result -- the backdoor is embedded in a single expert, making it harder to detect with global weight analysis
| Detection Method | Effectiveness Against Expert-Specific Backdoor |
|---|---|
| Global weight statistics | Low -- anomaly is diluted across all experts |
| Per-expert weight analysis | High -- anomaly is concentrated in one expert |
| Behavioral testing | Medium -- depends on trigger coverage |
| Routing pattern analysis | High -- abnormal routing for trigger inputs |
Defensive Routing Analysis
Monitoring for Routing Anomalies
# Monitor routing entropy -- low entropy suggests manipulation
def routing_entropy(router_logits):
"""High entropy = uniform routing (normal).
Low entropy = concentrated routing (suspicious)."""
probs = F.softmax(router_logits, dim=-1)
entropy = -(probs * probs.log()).sum(dim=-1)
return entropy
# Flag inputs where routing entropy drops below threshold
for batch in inference_batches:
logits = get_router_logits(model, batch)
entropy = routing_entropy(logits)
if entropy.min() < ENTROPY_THRESHOLD:
flag_suspicious_input(batch, entropy)Related Topics
- Model Architecture Attack Vectors -- Overview of architecture-level attacks
- KV Cache Poisoning -- Another infrastructure-level attack
- Training & Fine-Tuning Attacks -- Training-time expert poisoning
- Inference Optimization Attacks -- Other inference-time attack vectors
Why is expert capacity overflow a security concern in MoE models?
References
- Switch Transformers: Scaling to Trillion Parameter Models (Fedus et al., 2021) -- MoE architecture
- Mixtral of Experts (Jiang et al., 2024) -- Production MoE model
- DeepSeek-V3 Technical Report (2024) -- Advanced MoE routing