Adversarial ML: Core Concepts
History and fundamentals of adversarial machine learning — perturbation attacks, evasion vs poisoning, robustness — bridging classical adversarial ML to LLM-specific attacks.
A Brief History of Adversarial ML
Adversarial machine learning did not begin with LLMs. The field emerged from image classification, where researchers discovered that imperceptible pixel changes could cause neural networks to misclassify images with high confidence.
| Year | Milestone | Significance |
|---|---|---|
| 2004 | Adversarial spam filtering attacks | First practical adversarial ML |
| 2013 | Szegedy et al. — adversarial examples for images | Formal discovery of adversarial vulnerability in neural networks |
| 2014 | FGSM (Goodfellow et al.) | First efficient method for generating adversarial examples |
| 2017 | PGD (Madry et al.) | Strong iterative attack, became benchmark |
| 2020 | TextFooler, TextBugger | Adversarial attacks adapted for NLP |
| 2023 | GCG universal suffixes (Zou et al.) | Gradient-based attacks on LLM alignment |
| 2024+ | Multi-modal attacks, agent exploitation | Adversarial ML meets autonomous AI systems |
The Four Attack Categories
Adversarial ML attacks fall into four categories, distinguished by their goal and when they occur:
1. Evasion Attacks (Inference-Time)
Crafting inputs that cause the model to produce incorrect outputs at inference time, without modifying the model itself.
| Classical ML Example | LLM Equivalent |
|---|---|
| Adversarial patch on a stop sign causes misclassification | Adversarial suffix on a prompt causes jailbreak |
| Perturbed image fools a malware classifier | Obfuscated text bypasses a toxicity filter |
# Classical: perturb an image
adversarial_image = original_image + epsilon * sign(gradient)
# LLM: append adversarial suffix to a prompt
adversarial_prompt = harmful_request + " " + optimized_suffix2. Poisoning Attacks (Training-Time)
Modifying training data to alter model behavior — either degrading general performance or inserting specific backdoors.
| Poisoning Type | Mechanism | Example |
|---|---|---|
| Availability | Degrade overall model quality | Injecting noisy/wrong labels |
| Targeted | Change behavior for specific inputs | Model misclassifies one specific person |
| Backdoor | Insert trigger-activated behavior | Model behaves normally unless trigger is present |
3. Model Extraction Attacks
Stealing a model's functionality by querying it and using the responses to train a clone.
| Technique | Query Budget | Fidelity |
|---|---|---|
| Random querying | High (millions) | Low |
| Active learning | Medium (thousands) | Medium |
| Distillation | Medium | Medium-high |
| API-based extraction | Depends on rate limits | Varies |
4. Inference Attacks (Privacy)
Extracting private information about the training data or individual data points.
| Attack | What It Reveals | LLM Relevance |
|---|---|---|
| Membership inference | Whether a specific record was in training data | Detecting if private text was used for training |
| Model inversion | Reconstructing training data from the model | Extracting memorized PII, code, or secrets |
| Attribute inference | Inferring sensitive attributes about training data subjects | Determining demographics from model behavior |
Perturbation Attacks: The Foundation
The concept of adversarial perturbations is central to adversarial ML.
How Perturbations Work
A perturbation is a small change to an input that is (ideally) imperceptible to humans but causes the model to produce a different output:
x' = x + δ where ||δ|| ≤ ε
x = original input
x' = adversarial input
δ = perturbation (small change)
ε = perturbation budget (maximum allowed change)
From Images to Text
Perturbations in image space are continuous (adjust pixel values). In text, perturbations must be discrete (change words or tokens), creating different challenges:
| Domain | Perturbation Type | Constraint | Challenge |
|---|---|---|---|
| Images | Pixel value changes | L∞ or L2 norm ≤ ε | Changes must be imperceptible |
| Text | Word/token substitution | Semantic equivalence | Must preserve meaning and grammar |
| LLM prompts | Token sequence changes | Task-specific | Must achieve attack goal |
Robustness: The Defense Perspective
Robustness measures how resistant a model is to adversarial inputs.
| Robustness Type | Definition | Measurement |
|---|---|---|
| Empirical robustness | Resistance to known attack methods | Attack success rate |
| Certified robustness | Mathematically proven bounds on perturbation tolerance | Formal verification |
| Distributional robustness | Performance on out-of-distribution inputs | OOD benchmarks |
The Robustness-Accuracy Trade-off
A well-established finding: making models more robust to adversarial examples typically reduces their accuracy on clean inputs. For LLMs, this manifests as:
- Models that are very resistant to jailbreaks may also refuse legitimate requests (over-refusal)
- Models that are very helpful and flexible are typically easier to jailbreak
- Finding the right balance is an ongoing challenge with no perfect solution
Bridging to LLM Attacks
Classical adversarial ML concepts map directly to LLM attack techniques:
| Classical Concept | LLM Equivalent | Key Difference |
|---|---|---|
| Adversarial examples | Jailbreak prompts | Text is discrete, not continuous |
| Perturbation budget | Prompt naturalness constraint | Must remain readable |
| Targeted attack | Steering model to specific output | Goal is behavioral, not classificatory |
| Universal perturbation | Universal jailbreak suffixes | Works across multiple inputs |
| Transferability | Cross-model jailbreaks | Attacks designed for one model may work on others |
| Adversarial training | RLHF safety training | Training on adversarial examples to build resistance |
Related Topics
- Gradient-Based Attacks Explained — the mathematical tools for crafting adversarial inputs
- AI Threat Models — access levels and capabilities that determine attack feasibility
- Pre-training → Fine-tuning → RLHF Pipeline — where poisoning attacks target the training process
- Tokenization & Its Security Implications — how discrete text perturbations interact with tokenization
References
- "Intriguing Properties of Neural Networks" - Szegedy et al. (2013) - The seminal paper discovering adversarial examples in neural networks
- "Explaining and Harnessing Adversarial Examples" - Goodfellow et al. (2014) - The paper introducing FGSM and the linear hypothesis for adversarial vulnerability
- "Towards Deep Learning Models Resistant to Adversarial Attacks" - Madry et al. (2017) - The PGD attack paper establishing the benchmark for adversarial robustness evaluation
- "Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - The GCG paper bridging classical adversarial ML to LLM jailbreaking
- "Taxonomy of Machine Learning Safety" - Goldblum et al. (2023) - Comprehensive classification of ML safety threats including adversarial attacks across modalities
What is the key difference between evasion attacks and poisoning attacks?