What is Gradient-Based Attacks?

How gradients are used to craft adversarial inputs for LLMs — FGSM, PGD, and GCG attacks explained with accessible math and practical examples.

What is AI Threat Models?

Access levels in AI security testing — what's possible at each level, realistic scenarios, and comparison to traditional security threat modeling.

Adversarial ML: Core Concepts

intermediate8 min readUpdated 2026-03-13

History and fundamentals of adversarial machine learning — perturbation attacks, evasion vs poisoning, robustness — bridging classical adversarial ML to LLM-specific attacks.

adversarial-ml fundamentals evasion poisoning intermediate

A Brief History of Adversarial ML

Adversarial machine learning did not begin with LLMs. The field emerged from image classification, where researchers discovered that imperceptible pixel changes could cause neural networks to misclassify images with high confidence.

Year	Milestone	Significance
2004	Adversarial spam filtering attacks	First practical adversarial ML
2013	Szegedy et al. — adversarial examples for images	Formal discovery of adversarial vulnerability in neural networks
2014	FGSM (Goodfellow et al.)	First efficient method for generating adversarial examples
2017	PGD (Madry et al.)	Strong iterative attack, became benchmark
2020	TextFooler, TextBugger	Adversarial attacks adapted for NLP
2023	GCG universal suffixes (Zou et al.)	Gradient-based attacks on LLM alignment
2024+	Multi-modal attacks, agent exploitation	Adversarial ML meets autonomous AI systems

The Four Attack Categories

Adversarial ML attacks fall into four categories, distinguished by their goal and when they occur:

1. Evasion Attacks (Inference-Time)

Crafting inputs that cause the model to produce incorrect outputs at inference time, without modifying the model itself.

Classical ML Example	LLM Equivalent
Adversarial patch on a stop sign causes misclassification	Adversarial suffix on a prompt causes jailbreak
Perturbed image fools a malware classifier	Obfuscated text bypasses a toxicity filter

# Classical: perturb an image
adversarial_image = original_image + epsilon * sign(gradient)
 
# LLM: append adversarial suffix to a prompt
adversarial_prompt = harmful_request + " " + optimized_suffix

2. Poisoning Attacks (Training-Time)

Modifying training data to alter model behavior — either degrading general performance or inserting specific backdoors.

Poisoning Type	Mechanism	Example
Availability	Degrade overall model quality	Injecting noisy/wrong labels
Targeted	Change behavior for specific inputs	Model misclassifies one specific person
Backdoor	Insert trigger-activated behavior	Model behaves normally unless trigger is present

3. Model Extraction Attacks

Stealing a model's functionality by querying it and using the responses to train a clone.

Technique	Query Budget	Fidelity
Random querying	High (millions)	Low
Active learning	Medium (thousands)	Medium
Distillation	Medium	Medium-high
API-based extraction	Depends on rate limits	Varies

4. Inference Attacks (Privacy)

Extracting private information about the training data or individual data points.

Attack	What It Reveals	LLM Relevance
Membership inference	Whether a specific record was in training data	Detecting if private text was used for training
Model inversion	Reconstructing training data from the model	Extracting memorized PII, code, or secrets
Attribute inference	Inferring sensitive attributes about training data subjects	Determining demographics from model behavior

Perturbation Attacks: The Foundation

The concept of adversarial perturbations is central to adversarial ML.

How Perturbations Work

A perturbation is a small change to an input that is (ideally) imperceptible to humans but causes the model to produce a different output:

x' = x + δ    where ||δ|| ≤ ε

x  = original input
x' = adversarial input
δ  = perturbation (small change)
ε  = perturbation budget (maximum allowed change)

From Images to Text

Perturbations in image space are continuous (adjust pixel values). In text, perturbations must be discrete (change words or tokens), creating different challenges:

Domain	Perturbation Type	Constraint	Challenge
Images	Pixel value changes	L∞ or L2 norm ≤ ε	Changes must be imperceptible
Text	Word/token substitution	Semantic equivalence	Must preserve meaning and grammar
LLM prompts	Token sequence changes	Task-specific	Must achieve attack goal

Robustness: The Defense Perspective

Robustness measures how resistant a model is to adversarial inputs.

Robustness Type	Definition	Measurement
Empirical robustness	Resistance to known attack methods	Attack success rate
Certified robustness	Mathematically proven bounds on perturbation tolerance	Formal verification
Distributional robustness	Performance on out-of-distribution inputs	OOD benchmarks

The Robustness-Accuracy Trade-off

A well-established finding: making models more robust to adversarial examples typically reduces their accuracy on clean inputs. For LLMs, this manifests as:

Models that are very resistant to jailbreaks may also refuse legitimate requests (over-refusal)
Models that are very helpful and flexible are typically easier to jailbreak
Finding the right balance is an ongoing challenge with no perfect solution

Bridging to LLM Attacks

Classical adversarial ML concepts map directly to LLM attack techniques:

Classical Concept	LLM Equivalent	Key Difference
Adversarial examples	Jailbreak prompts	Text is discrete, not continuous
Perturbation budget	Prompt naturalness constraint	Must remain readable
Targeted attack	Steering model to specific output	Goal is behavioral, not classificatory
Universal perturbation	Universal jailbreak suffixes	Works across multiple inputs
Transferability	Cross-model jailbreaks	Attacks designed for one model may work on others
Adversarial training	RLHF safety training	Training on adversarial examples to build resistance

Gradient-Based Attacks Explained — the mathematical tools for crafting adversarial inputs
AI Threat Models — access levels and capabilities that determine attack feasibility
Pre-training → Fine-tuning → RLHF Pipeline — where poisoning attacks target the training process
Tokenization & Its Security Implications — how discrete text perturbations interact with tokenization

References

"Intriguing Properties of Neural Networks" - Szegedy et al. (2013) - The seminal paper discovering adversarial examples in neural networks
"Explaining and Harnessing Adversarial Examples" - Goodfellow et al. (2014) - The paper introducing FGSM and the linear hypothesis for adversarial vulnerability
"Towards Deep Learning Models Resistant to Adversarial Attacks" - Madry et al. (2017) - The PGD attack paper establishing the benchmark for adversarial robustness evaluation
"Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - The GCG paper bridging classical adversarial ML to LLM jailbreaking
"Taxonomy of Machine Learning Safety" - Goldblum et al. (2023) - Comprehensive classification of ML safety threats including adversarial attacks across modalities

Knowledge Check

What is the key difference between evasion attacks and poisoning attacks?

Adversarial ML: Core Concepts

A Brief History of Adversarial ML

The Four Attack Categories

1. Evasion Attacks (Inference-Time)

2. Poisoning Attacks (Training-Time)

3. Model Extraction Attacks

4. Inference Attacks (Privacy)

Perturbation Attacks: The Foundation

How Perturbations Work

From Images to Text

Robustness: The Defense Perspective

The Robustness-Accuracy Trade-off

Bridging to LLM Attacks

References

Learning Path

Adversarial ML: Core Concepts

A Brief History of Adversarial ML

The Four Attack Categories

1. Evasion Attacks (Inference-Time)

2. Poisoning Attacks (Training-Time)

3. Model Extraction Attacks

4. Inference Attacks (Privacy)

Perturbation Attacks: The Foundation

How Perturbations Work

From Images to Text

Robustness: The Defense Perspective

The Robustness-Accuracy Trade-off

Bridging to LLM Attacks

References

Learning Path

Adversarial ML: Core Concepts

Learning Path

Related articles

Adversarial ML: Core Concepts

Learning Path

Related articles