Extended AI Security Glossary (References)

beginner6 min readUpdated 2026-03-13

Comprehensive glossary of AI red teaming terms, covering attack techniques, defense mechanisms, model internals, and assessment methodology.

glossary reference terminology definitions

A

Adversarial Example -- An input specifically crafted to cause a machine learning model to make an incorrect prediction, often with minimal human-perceptible modification to a benign input.

Adversarial Suffix -- A sequence of tokens appended to a prompt that exploits gradient-based optimization to bypass safety training. See: GCG attack.

Agent -- An AI system that can take actions in the world by calling tools, reading/writing files, or interacting with APIs, often using an LLM as the reasoning engine.

Alignment -- The process of training AI systems to behave according to human values and intentions, typically through RLHF, DPO, or Constitutional AI.

ASR (Attack Success Rate) -- The percentage of attack attempts that successfully bypass a model's safety mechanisms. The primary quantitative metric in AI red teaming.

B-C

Blind Injection -- A prompt injection where the attacker cannot see the model's direct output, requiring side-channel techniques to confirm success.

CART (Continuous Automated Red Teaming) -- An automated pipeline that continuously generates and tests adversarial inputs against AI systems.

Chain-of-Thought (CoT) -- A prompting technique where the model shows its reasoning steps. Can be exploited through thought injection.

Constitutional AI -- An alignment method where the model evaluates its own outputs against a set of principles, then revises accordingly.

Context Window -- The maximum number of tokens a model can process in a single inference call. Stuffing attacks dilute safety instructions in large contexts.

D-F

Data Exfiltration -- Extracting confidential information from an AI system, including training data, system prompts, or user data.

Defense in Depth -- A security strategy using multiple independent defense layers so that compromise of one layer does not compromise the system.

Direct Injection -- Prompt injection delivered directly by the user in their input to the AI system.

DPO (Direct Preference Optimization) -- An alignment technique that optimizes a model directly from preference data without training a separate reward model.

Embedding -- A dense vector representation of text (or other data) in a continuous space where semantic similarity corresponds to geometric proximity.

Few-Shot Jailbreak -- A jailbreak that provides examples of the model complying with restricted requests to encourage similar behavior.

Fine-Tuning -- Additional training of a pre-trained model on a specific dataset, which can be exploited to remove safety training or insert backdoors.

G-I

GCG (Greedy Coordinate Gradient) -- An algorithm that generates adversarial suffixes through gradient-based optimization against a model's safety behavior.

Guardrail -- A safety mechanism that filters, modifies, or blocks AI inputs/outputs to prevent harmful behavior.

Hallucination -- When a model generates confident but factually incorrect information.

Indirect Injection -- Prompt injection delivered through data the model processes (retrieved documents, tool outputs, user profiles) rather than direct user input.

J-M

Jailbreak -- A technique that causes an AI model to bypass its safety training and generate content it was trained to refuse.

Knowledge Poisoning -- Injecting malicious content into a RAG system's knowledge base to manipulate future responses.

LLM Judge -- Using one LLM to evaluate the outputs of another, commonly used as both a defense mechanism and evaluation metric.

MCP (Model Context Protocol) -- A protocol for connecting AI models to external tools and data sources. Tool server security is a key attack surface.

Membership Inference -- An attack that determines whether a specific data point was used in a model's training data.

Model Extraction -- Replicating a model's functionality by querying it systematically and training a substitute model on the input/output pairs.

P-R

PAIR (Prompt Automatic Iterative Refinement) -- An automated jailbreaking method where an attacker LLM iteratively refines prompts based on a target model's responses.

PII Extraction -- Extracting personally identifiable information that a model memorized from its training data.

Prompt Injection -- An attack where user-supplied input overrides or modifies the intended behavior of an AI system's instructions.

RAG (Retrieval-Augmented Generation) -- A pattern where an LLM's response is augmented with information retrieved from an external knowledge base.

Red Teaming -- Adversarial testing of AI systems to identify vulnerabilities, safety failures, and security weaknesses.

Reward Hacking -- Exploiting loopholes in a reward model to achieve high reward without the intended behavior.

RLHF (Reinforcement Learning from Human Feedback) -- An alignment technique that trains a reward model from human preferences and uses it to fine-tune the base model.

S-Z

Safety Training -- The process of training a model to refuse harmful requests, typically through RLHF, DPO, or Constitutional AI.

System Prompt -- The initial instructions given to an LLM that define its behavior, persona, and constraints. Often a target for extraction attacks.

TAP (Tree of Attacks with Pruning) -- An automated jailbreaking method that explores a tree of attack variations, pruning unsuccessful branches.

Token Smuggling -- Using encoding, Unicode, or tokenization tricks to bypass input filters while preserving the semantic meaning of an attack payload.

Transferable Attack -- An adversarial input crafted against one model that also works against a different model.

VLM (Vision-Language Model) -- A model that processes both images and text, creating additional attack surfaces through visual inputs.

Knowledge Check

What is the key difference between 'direct injection' and 'indirect injection'?

Foundations: How LLMs Work - Understanding the systems these terms describe
AI Red Teaming Cheat Sheet - Quick reference for engagements
OWASP LLM Top 10 Deep Dive - Standardized vulnerability taxonomy
MITRE ATLAS Walkthrough - Adversarial ML threat framework
Framework Mapping Reference - Cross-framework term mapping

References

NIST AI 100-2e2025 - NIST (2025) - Adversarial machine learning: A taxonomy and terminology of attacks and mitigations
MITRE ATLAS Terminology - MITRE Corporation (2024) - Standardized adversarial ML terminology
OWASP AI Exchange - OWASP (2024) - Community-maintained AI security terminology and definitions

Edit this page on GitHub

Extended AI Security Glossary (References)

beginner6 min readUpdated 2026-03-13

Comprehensive glossary of AI red teaming terms, covering attack techniques, defense mechanisms, model internals, and assessment methodology.

glossary reference terminology definitions

A

Adversarial Example -- An input specifically crafted to cause a machine learning model to make an incorrect prediction, often with minimal human-perceptible modification to a benign input.

Adversarial Suffix -- A sequence of tokens appended to a prompt that exploits gradient-based optimization to bypass safety training. See: GCG attack.

Agent -- An AI system that can take actions in the world by calling tools, reading/writing files, or interacting with APIs, often using an LLM as the reasoning engine.

Alignment -- The process of training AI systems to behave according to human values and intentions, typically through RLHF, DPO, or Constitutional AI.

ASR (Attack Success Rate) -- The percentage of attack attempts that successfully bypass a model's safety mechanisms. The primary quantitative metric in AI red teaming.

B-C

Blind Injection -- A prompt injection where the attacker cannot see the model's direct output, requiring side-channel techniques to confirm success.

CART (Continuous Automated Red Teaming) -- An automated pipeline that continuously generates and tests adversarial inputs against AI systems.

Chain-of-Thought (CoT) -- A prompting technique where the model shows its reasoning steps. Can be exploited through thought injection.

Constitutional AI -- An alignment method where the model evaluates its own outputs against a set of principles, then revises accordingly.

Context Window -- The maximum number of tokens a model can process in a single inference call. Stuffing attacks dilute safety instructions in large contexts.

D-F

Data Exfiltration -- Extracting confidential information from an AI system, including training data, system prompts, or user data.

Defense in Depth -- A security strategy using multiple independent defense layers so that compromise of one layer does not compromise the system.

Direct Injection -- Prompt injection delivered directly by the user in their input to the AI system.

DPO (Direct Preference Optimization) -- An alignment technique that optimizes a model directly from preference data without training a separate reward model.

Embedding -- A dense vector representation of text (or other data) in a continuous space where semantic similarity corresponds to geometric proximity.

Few-Shot Jailbreak -- A jailbreak that provides examples of the model complying with restricted requests to encourage similar behavior.

Fine-Tuning -- Additional training of a pre-trained model on a specific dataset, which can be exploited to remove safety training or insert backdoors.

G-I

GCG (Greedy Coordinate Gradient) -- An algorithm that generates adversarial suffixes through gradient-based optimization against a model's safety behavior.

Guardrail -- A safety mechanism that filters, modifies, or blocks AI inputs/outputs to prevent harmful behavior.

Hallucination -- When a model generates confident but factually incorrect information.

Indirect Injection -- Prompt injection delivered through data the model processes (retrieved documents, tool outputs, user profiles) rather than direct user input.

J-M

Jailbreak -- A technique that causes an AI model to bypass its safety training and generate content it was trained to refuse.

Knowledge Poisoning -- Injecting malicious content into a RAG system's knowledge base to manipulate future responses.

LLM Judge -- Using one LLM to evaluate the outputs of another, commonly used as both a defense mechanism and evaluation metric.

MCP (Model Context Protocol) -- A protocol for connecting AI models to external tools and data sources. Tool server security is a key attack surface.

Membership Inference -- An attack that determines whether a specific data point was used in a model's training data.

Model Extraction -- Replicating a model's functionality by querying it systematically and training a substitute model on the input/output pairs.

P-R

PAIR (Prompt Automatic Iterative Refinement) -- An automated jailbreaking method where an attacker LLM iteratively refines prompts based on a target model's responses.

PII Extraction -- Extracting personally identifiable information that a model memorized from its training data.

Prompt Injection -- An attack where user-supplied input overrides or modifies the intended behavior of an AI system's instructions.

RAG (Retrieval-Augmented Generation) -- A pattern where an LLM's response is augmented with information retrieved from an external knowledge base.

Red Teaming -- Adversarial testing of AI systems to identify vulnerabilities, safety failures, and security weaknesses.

Reward Hacking -- Exploiting loopholes in a reward model to achieve high reward without the intended behavior.

RLHF (Reinforcement Learning from Human Feedback) -- An alignment technique that trains a reward model from human preferences and uses it to fine-tune the base model.

S-Z

Safety Training -- The process of training a model to refuse harmful requests, typically through RLHF, DPO, or Constitutional AI.

System Prompt -- The initial instructions given to an LLM that define its behavior, persona, and constraints. Often a target for extraction attacks.

TAP (Tree of Attacks with Pruning) -- An automated jailbreaking method that explores a tree of attack variations, pruning unsuccessful branches.

Token Smuggling -- Using encoding, Unicode, or tokenization tricks to bypass input filters while preserving the semantic meaning of an attack payload.

Transferable Attack -- An adversarial input crafted against one model that also works against a different model.

VLM (Vision-Language Model) -- A model that processes both images and text, creating additional attack surfaces through visual inputs.

Knowledge Check

What is the key difference between 'direct injection' and 'indirect injection'?

Foundations: How LLMs Work - Understanding the systems these terms describe
AI Red Teaming Cheat Sheet - Quick reference for engagements
OWASP LLM Top 10 Deep Dive - Standardized vulnerability taxonomy
MITRE ATLAS Walkthrough - Adversarial ML threat framework
Framework Mapping Reference - Cross-framework term mapping

References

NIST AI 100-2e2025 - NIST (2025) - Adversarial machine learning: A taxonomy and terminology of attacks and mitigations
MITRE ATLAS Terminology - MITRE Corporation (2024) - Standardized adversarial ML terminology
OWASP AI Exchange - OWASP (2024) - Community-maintained AI security terminology and definitions

Edit this page on GitHub

Extended AI Security Glossary (References)

Related articles

Extended AI Security Glossary (References)

Related articles