Advanced Training Attack Vectors
Cutting-edge training attacks: federated learning poisoning, model merging exploits, distributed training vulnerabilities, emergent capability risks, and synthetic data pipeline attacks.
Standard training attacks -- data poisoning, backdoor insertion, LoRA trojans -- target the training data or fine-tuning process of a single model. Advanced training attacks exploit the distributed, collaborative, and compositional nature of modern model development. Federated learning shares gradients across untrusted participants. Model merging combines weights from independently trained models. Distributed training exposes gradient communication channels. These collaborative workflows dramatically expand the training-time attack surface.
Advanced Attack Taxonomy
| Attack Category | Access Required | Persistence | Detection Difficulty |
|---|---|---|---|
| Federated learning poisoning | Participant in federated training | High -- embedded in global model | Hard -- masked by aggregation |
| Model merging exploits | Publish model to public registry | High -- persists in merged weights | Very hard -- benign in isolation |
| Distributed training attacks | Access to training infrastructure | High -- in final checkpoint | Medium -- network monitoring |
| Emergence exploitation | Model evaluation access | N/A -- tests existing properties | Hard -- emergent by definition |
| Synthetic data poisoning | Control synthetic data generator | High -- encoded in training data | Hard -- data appears clean |
| Watermark removal | Model access (weights or API) | Permanent -- removes provenance | Medium -- statistical detection |
The Collaborative Training Attack Surface
Federated Learning
Federated learning was designed for privacy-preserving collaborative training. The same properties that protect data privacy -- participants never share raw data -- also protect malicious participants from scrutiny:
- Byzantine attacks -- Malicious participants send crafted gradient updates that poison the global model
- Model replacement -- A single participant can overwrite the global model by scaling their update
- Free-rider detection evasion -- Attackers can piggyback on legitimate participants' work
See Federated Learning Poisoning for full coverage.
Model Merging
Model merging creates emergent behaviors from individually benign components. This is fundamentally different from traditional supply chain attacks because no single component is malicious:
- Compositional backdoors -- Two clean models produce a backdoored model when merged
- Safety erosion -- Merging a safety-aligned model with an uncensored model degrades alignment
- Targeted weight manipulation -- Adversarial weights computed to produce desired behavior at a specific merge ratio
See Model Merging Exploits for full coverage.
Distributed Training
Multi-node, multi-GPU training exposes gradient communication and parameter synchronization:
- Gradient interception -- Man-in-the-middle attacks on inter-node gradient communication
- Parameter server attacks -- Compromising the central parameter aggregation node
- Insider threats -- Malicious operators with access to training infrastructure
See Distributed Training Attack Surface for full coverage.
Emergent Risk Categories
Beyond deliberate attacks, advanced training creates unintentional security risks:
Models develop unpredictable capabilities at scale -- abilities that were not designed, tested for, or anticipated. These emergent capabilities can include dangerous knowledge synthesis, deceptive reasoning, or tool-use abilities that bypass safety guardrails. See Emergence & Capability Jump Exploitation.
Training on synthetic data generated by other models creates feedback loops that can cause model collapse -- catastrophic loss of rare knowledge and distribution diversity. Attackers can accelerate or weaponize this collapse. See Synthetic Data Pipeline Attacks.
AI watermarking schemes designed to track model provenance can be defeated through paraphrasing, token substitution, and embedding perturbation. This undermines accountability and enables unattributed model theft. See Watermark Removal Techniques.
Defense Priorities
By Attacker Capability
| Attacker | Highest-Priority Defense |
|---|---|
| External contributor to federated learning | Byzantine-robust aggregation, participant reputation |
| Model publisher on public registries | Weight-level anomaly detection, merge safety testing |
| Infrastructure insider | Network encryption, access controls, audit logging |
| Competing organization | Watermarking, model fingerprinting, IP protection |
| Researcher probing for emergence | Capability evaluation frameworks, sandboxed testing |
Research Frontier
These areas are at the cutting edge of ML security research:
- Formal verification of model merges -- Can we prove that a merge operation preserves safety properties?
- Provably robust federated aggregation -- Aggregation rules that guarantee Byzantine tolerance up to a known fraction
- Emergence prediction -- Forecasting when scale will produce new capabilities before they appear
- Synthetic data provenance -- Tracking which model generated which training samples
- Cross-model behavioral forensics -- Determining whether a model's behavior derives from another model
Related Topics
- Training & Fine-Tuning Attacks -- Standard training attack surface
- Federated Learning Poisoning -- Byzantine attacks in FL
- Model Merging Exploits -- Compositional backdoors
- Model Architecture Attack Vectors -- Architecture-level vulnerabilities
- Supply Chain Security -- Model supply chain risks
References
- "How to Backdoor Federated Learning" - Bagdasaryan et al. (2020) - Foundational work on model poisoning in federated learning settings
- "Model Merging in LLMs, MLLMs, and Beyond" - Yang et al. (2024) - Survey of model merging techniques and their security implications
- "Poisoning Web-Scale Training Datasets is Practical" - Carlini et al. (2024) - Demonstrates practical dataset poisoning in web-scale training pipelines
- "Scalable Extraction of Training Data from (Production) Language Models" - Nasr et al. (2023) - Training data extraction from production models revealing pipeline vulnerabilities
What distinguishes advanced training attacks from standard data poisoning?