Extended AI Security Glossary (References)
Comprehensive glossary of AI red teaming terms, covering attack techniques, defense mechanisms, model internals, and assessment methodology.
A
Adversarial Example -- An input specifically crafted to cause a machine learning model to make an incorrect prediction, often with minimal human-perceptible modification to a benign input.
Adversarial Suffix -- A sequence of tokens appended to a prompt that exploits gradient-based optimization to bypass safety training. See: GCG attack.
Agent -- An AI system that can take actions in the world by calling tools, reading/writing files, or interacting with APIs, often using an LLM as the reasoning engine.
Alignment -- The process of training AI systems to behave according to human values and intentions, typically through RLHF, DPO, or Constitutional AI.
ASR (Attack Success Rate) -- The percentage of attack attempts that successfully bypass a model's safety mechanisms. The primary quantitative metric in AI red teaming.
B-C
Blind Injection -- A prompt injection where the attacker cannot see the model's direct output, requiring side-channel techniques to confirm success.
CART (Continuous Automated Red Teaming) -- An automated pipeline that continuously generates and tests adversarial inputs against AI systems.
Chain-of-Thought (CoT) -- A prompting technique where the model shows its reasoning steps. Can be exploited through thought injection.
Constitutional AI -- An alignment method where the model evaluates its own outputs against a set of principles, then revises accordingly.
Context Window -- The maximum number of tokens a model can process in a single inference call. Stuffing attacks dilute safety instructions in large contexts.
D-F
Data Exfiltration -- Extracting confidential information from an AI system, including training data, system prompts, or user data.
Defense in Depth -- A security strategy using multiple independent defense layers so that compromise of one layer does not compromise the system.
Direct Injection -- Prompt injection delivered directly by the user in their input to the AI system.
DPO (Direct Preference Optimization) -- An alignment technique that optimizes a model directly from preference data without training a separate reward model.
Embedding -- A dense vector representation of text (or other data) in a continuous space where semantic similarity corresponds to geometric proximity.
Few-Shot Jailbreak -- A jailbreak that provides examples of the model complying with restricted requests to encourage similar behavior.
Fine-Tuning -- Additional training of a pre-trained model on a specific dataset, which can be exploited to remove safety training or insert backdoors.
G-I
GCG (Greedy Coordinate Gradient) -- An algorithm that generates adversarial suffixes through gradient-based optimization against a model's safety behavior.
Guardrail -- A safety mechanism that filters, modifies, or blocks AI inputs/outputs to prevent harmful behavior.
Hallucination -- When a model generates confident but factually incorrect information.
Indirect Injection -- Prompt injection delivered through data the model processes (retrieved documents, tool outputs, user profiles) rather than direct user input.
J-M
Jailbreak -- A technique that causes an AI model to bypass its safety training and generate content it was trained to refuse.
Knowledge Poisoning -- Injecting malicious content into a RAG system's knowledge base to manipulate future responses.
LLM Judge -- Using one LLM to evaluate the outputs of another, commonly used as both a defense mechanism and evaluation metric.
MCP (Model Context Protocol) -- A protocol for connecting AI models to external tools and data sources. Tool server security is a key attack surface.
Membership Inference -- An attack that determines whether a specific data point was used in a model's training data.
Model Extraction -- Replicating a model's functionality by querying it systematically and training a substitute model on the input/output pairs.
P-R
PAIR (Prompt Automatic Iterative Refinement) -- An automated jailbreaking method where an attacker LLM iteratively refines prompts based on a target model's responses.
PII Extraction -- Extracting personally identifiable information that a model memorized from its training data.
Prompt Injection -- An attack where user-supplied input overrides or modifies the intended behavior of an AI system's instructions.
RAG (Retrieval-Augmented Generation) -- A pattern where an LLM's response is augmented with information retrieved from an external knowledge base.
Red Teaming -- Adversarial testing of AI systems to identify vulnerabilities, safety failures, and security weaknesses.
Reward Hacking -- Exploiting loopholes in a reward model to achieve high reward without the intended behavior.
RLHF (Reinforcement Learning from Human Feedback) -- An alignment technique that trains a reward model from human preferences and uses it to fine-tune the base model.
S-Z
Safety Training -- The process of training a model to refuse harmful requests, typically through RLHF, DPO, or Constitutional AI.
System Prompt -- The initial instructions given to an LLM that define its behavior, persona, and constraints. Often a target for extraction attacks.
TAP (Tree of Attacks with Pruning) -- An automated jailbreaking method that explores a tree of attack variations, pruning unsuccessful branches.
Token Smuggling -- Using encoding, Unicode, or tokenization tricks to bypass input filters while preserving the semantic meaning of an attack payload.
Transferable Attack -- An adversarial input crafted against one model that also works against a different model.
VLM (Vision-Language Model) -- A model that processes both images and text, creating additional attack surfaces through visual inputs.
What is the key difference between 'direct injection' and 'indirect injection'?
Related Topics
- Foundations: How LLMs Work - Understanding the systems these terms describe
- AI Red Teaming Cheat Sheet - Quick reference for engagements
- OWASP LLM Top 10 Deep Dive - Standardized vulnerability taxonomy
- MITRE ATLAS Walkthrough - Adversarial ML threat framework
- Framework Mapping Reference - Cross-framework term mapping
References
- NIST AI 100-2e2025 - NIST (2025) - Adversarial machine learning: A taxonomy and terminology of attacks and mitigations
- MITRE ATLAS Terminology - MITRE Corporation (2024) - Standardized adversarial ML terminology
- OWASP AI Exchange - OWASP (2024) - Community-maintained AI security terminology and definitions