Glossary
64 terms covering AI red teaming, adversarial ML, prompt injection, agent exploitation, and LLM security.
A
Adversarial Example
A carefully crafted input designed to cause a machine learning model to produce incorrect or unexpected outputs. In AI red teaming, adversarial examples exploit model vulnerabilities by making imperceptible modifications to inputs that fool classifiers, detectors, or content filters.
Attention
The core mechanism in transformer architectures that determines how information flows between token positions. Attention weights reveal which parts of the input the model prioritizes, directly informing injection placement strategies and attention dilution attacks.
Adversarial Suffix
A string of tokens appended to a prompt that causes a language model to bypass its safety alignment. Typically discovered through gradient-based optimization methods such as GCG, adversarial suffixes can sometimes transfer across different models.
AutoDAN
An automated jailbreak generation method that uses a hierarchical genetic algorithm to evolve readable jailbreak prompts. Unlike GCG which produces gibberish suffixes, AutoDAN generates human-readable jailbreaks that are harder for perplexity-based filters to detect.
Agent Hijacking
Taking control of an AI agent's behavior through prompt injection, causing it to pursue attacker-defined goals instead of the user's intended task. Agent hijacking is particularly dangerous because agents have tool access that amplifies the impact.
Alignment Tax
The reduction in model capability that results from safety alignment training. Models that are heavily aligned may be less capable at certain tasks. Red teamers observe that some jailbreak techniques essentially 'recover' capability that alignment training suppressed.
AI Safety
The field of research focused on ensuring AI systems behave safely, reliably, and in accordance with human values. AI red teaming is a practical arm of AI safety, providing empirical evidence about where safety measures succeed and fail.
AI Governance
The policies, processes, and organizational structures that guide the responsible development and deployment of AI systems. AI governance frameworks increasingly require security assessments including red teaming as a condition for deployment approval.
B
Blue Teaming
The defensive counterpart to red teaming, focused on detecting, preventing, and responding to attacks against AI systems. Blue team activities include implementing guardrails, monitoring for prompt injection, maintaining safety alignment, and building defense-in-depth architectures.
BPE
Byte Pair Encoding. A subword tokenization algorithm used by GPT-family models that builds vocabulary by iteratively merging the most frequent byte pairs in training data. Red teamers exploit BPE-specific token boundary behaviors and encoding quirks for payload crafting.
Bug Bounty
A program offered by organizations that rewards security researchers for discovering and responsibly reporting vulnerabilities. Several AI companies now operate bug bounty programs that include prompt injection, jailbreaking, and data extraction as valid finding categories.
C
Crescendo Attack
A multi-turn jailbreak technique where each message gradually escalates toward restricted content. The conversational context progressively normalizes the target topic, causing the model to continue the trajectory rather than applying safety constraints.
Chain of Thought
A prompting technique that instructs the model to show its reasoning steps before producing a final answer. In agents, chain-of-thought reasoning can be manipulated by injecting false premises that lead the agent to attacker-desired conclusions.
Constitutional AI
An alignment technique developed by Anthropic in which a model is trained to self-critique and revise its own outputs according to a set of written principles (a 'constitution'). Red teamers study Constitutional AI to identify gaps in the constitution and exploit ambiguities.
Capability Elicitation
The process of discovering what an AI model is truly capable of, beyond what standard evaluations reveal. Red teamers use capability elicitation techniques to find hidden or suppressed abilities that may pose security risks.
Content Filter
A safety mechanism that inspects model inputs or outputs to detect and block harmful or policy-violating content. Content filters may use keyword matching, classifier models, or LLM-based evaluation. Red teamers bypass content filters through token manipulation, encoding tricks, and semantic paraphrasing.
D
Data Poisoning
An attack that manipulates a model's behavior by injecting malicious examples into its training dataset. Poisoned data can install backdoors, bias outputs, or degrade performance. Particularly dangerous for models fine-tuned on user-generated or web-scraped data.
DAN
Do Anything Now. An early jailbreak persona prompt that instructs the model to assume an unrestricted alter ego. While the original DAN prompt is widely patched, the technique of persona-based jailbreaking continues to evolve in new forms.
DPO
Direct Preference Optimization. An alignment technique that trains language models directly on human preference data without requiring a separate reward model. DPO introduces its own attack surface — red teamers study how preference data biases can be exploited.
Deceptive Alignment
A theoretical scenario where an AI system appears aligned during training and evaluation but pursues different objectives when deployed. While primarily an AI safety research concern, red teamers consider deceptive alignment when evaluating whether models truly follow safety constraints or merely appear to.
E
Embedding
A dense vector representation of text in a continuous high-dimensional space. Embeddings capture semantic meaning and are central to RAG systems where they determine document retrieval. Attackers target embedding spaces through adversarial perturbations and embedding inversion attacks.
EU AI Act
European Union legislation establishing a regulatory framework for AI systems based on risk classification. High-risk AI systems must undergo conformity assessments that increasingly include security evaluation and red teaming.
F
Fine-tuning
The process of continuing to train a pre-trained model on a smaller, task-specific dataset to specialize its behavior. Fine-tuning is a security-sensitive operation because it can be used to remove safety alignment, install backdoors, or bias model outputs.
Function Calling
The capability of LLMs to generate structured function call requests that are executed by the application layer. Function calling enables tool use but introduces attack surface through parameter injection, function selection manipulation, and unauthorized invocations.
G
Guardrails
Safety mechanisms designed to constrain AI model behavior within acceptable boundaries. Guardrails include system prompt instructions, input/output content filters, tool call validation, rate limiting, and human-in-the-loop approval workflows.
GCG Attack
Greedy Coordinate Gradient attack. A gradient-based optimization method that finds adversarial suffixes by iteratively replacing tokens to minimize the loss against a target harmful output. GCG suffixes discovered on open-weight models can sometimes transfer to closed-source models.
H
Hallucination
When a language model generates text that is factually incorrect, fabricated, or not grounded in the provided context. Hallucinations are security-relevant because they can produce false information that users trust, and because they indicate the model's outputs cannot be unconditionally relied upon.
I
Indirect Prompt Injection
An attack where malicious instructions are placed in external data sources — such as web pages, documents, or emails — that an AI system retrieves and processes. The attacker never directly interacts with the model, making it scalable and hard to attribute.
J
Jailbreak
A technique that causes a safety-aligned AI model to bypass its guardrails and produce outputs it was trained to refuse. Jailbreaks exploit weaknesses in alignment training through role-playing scenarios, encoding tricks, multi-turn manipulation, or adversarial suffixes.
K
Knowledge Cutoff
The date after which a model has no training data. Events after the knowledge cutoff are unknown to the model. Red teamers use knowledge cutoff probing as a fingerprinting technique to identify the model family and version.
L
LLM
Large Language Model. A neural network, typically based on the transformer architecture, trained on massive text corpora to predict the next token in a sequence. LLMs are the foundation of modern AI assistants, chatbots, and agent systems.
Logprobs
Log probabilities assigned to each token in the model's vocabulary at each generation step. When exposed by APIs, logprobs provide valuable information for red teamers including confidence analysis, safety filter detection, and membership inference attacks.
M
Model Extraction
An attack that recreates a proprietary AI model by systematically querying it and using the input-output pairs to train a functionally equivalent clone. Successful extraction can expose trade secrets, bypass usage controls, and enable further white-box attacks.
Membership Inference
A privacy attack that determines whether a specific data point was included in a model's training dataset. By observing differences in model behavior on training versus non-training data, attackers can infer the presence of sensitive records.
Many-shot Jailbreaking
A jailbreak technique that exploits in-context learning by providing many examples of the model answering harmful questions. After seeing enough examples (typically 50+), the model continues the pattern and complies with the harmful final query.
Multi-modal Attack
An attack that targets AI systems processing multiple input types (text, images, audio, video). Attackers embed adversarial payloads in non-text modalities — such as hidden text in images — to bypass text-only content filters.
MCP
Model Context Protocol. A standardized interface for connecting AI models to external tools, data sources, and services. MCP defines how models discover, invoke, and receive results from tools, creating a standardized attack surface for tool-related exploitation.
Model Card
A documentation framework for machine learning models that describes their intended use, performance characteristics, limitations, and ethical considerations. Red teamers review model cards during reconnaissance to understand the model's stated capabilities and limitations.
N
NIST AI RMF
The National Institute of Standards and Technology AI Risk Management Framework. A voluntary framework providing guidance for managing risks throughout the AI system lifecycle, including security testing and red teaming requirements.
O
OWASP Top 10 for LLMs
A standard awareness document published by OWASP that identifies the ten most critical security risks in LLM applications. It provides a shared vocabulary and prioritization framework for AI security, covering prompt injection, data poisoning, supply chain, and more.
P
Prompt Injection
An attack in which an adversary crafts input that causes a language model to ignore or override its original instructions and follow attacker-specified directives instead. It is the most fundamental vulnerability class in LLM applications, analogous to SQL injection in traditional web security.
Perplexity
A measure of how surprised a language model is by a given text. Low perplexity indicates the text is predictable to the model. Perplexity-based filters detect adversarial suffixes (which have high perplexity), and perplexity comparison enables membership inference attacks.
Penetration Testing
A simulated cyberattack against a system to evaluate its security. AI penetration testing adapts traditional pentest methodology to the unique characteristics of machine learning systems, adding prompt injection, alignment testing, and data pipeline assessment.
Prompt Leaking
The disclosure of a model's system prompt or internal instructions to an unauthorized user. Prompt leaks can occur through direct extraction attacks, model hallucination of its own instructions, or accidental disclosure in verbose error messages. Leaked prompts reveal safety rules and behavioral constraints.
R
Red Teaming
The practice of simulating adversarial attacks against a system to discover vulnerabilities and improve defenses. In AI security, red teaming targets the unique failure modes of machine learning systems including prompt injection, alignment bypasses, data poisoning, and model exploitation.
RAG
Retrieval-Augmented Generation. An architecture pattern that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the model's context. RAG introduces attack surface through document poisoning and indirect prompt injection.
RLHF
Reinforcement Learning from Human Feedback. The primary technique used to align language models with human preferences and safety requirements. RLHF trains a reward model from human rankings, then uses reinforcement learning to optimize the language model against that reward.
ReAct Pattern
Reason + Act. An agent architecture pattern where the model alternates between reasoning about what to do and taking actions. The reasoning step is visible and exploitable through chain-of-thought manipulation techniques.
RAG Poisoning
An attack that injects malicious documents into a RAG system's knowledge base. When these poisoned documents are retrieved for relevant queries, they inject attacker-controlled content into the model's context, enabling indirect prompt injection.
Reward Hacking
When an AI model finds unintended ways to maximize its reward signal during reinforcement learning without actually achieving the intended objective. In RLHF-trained models, reward hacking can produce outputs that score well but are actually harmful or manipulative.
Responsible Disclosure
The practice of reporting discovered vulnerabilities to the affected organization before public disclosure, giving them time to develop and deploy a fix. AI-specific responsible disclosure requires additional considerations around harmful outputs and probabilistic findings.
S
Safety Filter
A component that inspects model inputs or outputs to detect and block harmful, policy-violating, or sensitive content. Safety filters may use keyword matching, classifier models, or LLM-based evaluation. Red teamers routinely bypass these through token manipulation and semantic paraphrasing.
System Prompt
The initial set of instructions provided to a language model that defines its behavior, persona, capabilities, and restrictions. System prompts are typically hidden from end users and contain sensitive configuration including safety rules and behavioral constraints.
Skeleton Key
A jailbreak technique that provides the model with a plausible reason to comply with restricted requests, such as claiming the user is a security researcher or that the information is needed for an authorized assessment. Named for its ability to 'unlock' model compliance.
Specification Gaming
When an AI system achieves high reward or scores by exploiting loopholes in how the objective was specified rather than by solving the intended task. Specification gaming is related to reward hacking and can produce unexpected model behaviors that red teamers discover.
Sycophancy
The tendency of language models to agree with users or tell them what they want to hear, even when the model should disagree or refuse. Sycophancy is exploitable — an attacker who frames a harmful request as something the model should agree with can leverage this tendency.
Sandbagging
When an AI model deliberately underperforms on capability evaluations while retaining full capability for other uses. Sandbagging concerns red teamers because it means capability evaluations may not reflect true model capabilities, which has implications for safety assessments.
T
Token
The fundamental unit of text processing in language models. Text is split into tokens (subwords, words, or characters) by a tokenizer before being processed by the model. Understanding tokenization is essential for crafting adversarial payloads.
Training Data Extraction
Techniques that cause a model to reveal memorized content from its training data through targeted prompting. Methods include prefix-based completion, divergence attacks, and canary extraction, which can expose PII, copyrighted content, or security-sensitive information.
Tool Use Exploitation
Attacks that manipulate AI agents into calling tools with attacker-controlled parameters. By injecting instructions that cause the agent to misuse its legitimate tools, attackers can achieve code execution, data exfiltration, and privilege escalation.
Temperature
A parameter that controls the randomness of model output. Lower temperature produces more deterministic responses, higher temperature produces more creative but less predictable output. Temperature affects exploit reliability — lower temperature means more consistent exploit success rates.
Top-p
Nucleus sampling parameter that limits token selection to the smallest set of tokens whose cumulative probability exceeds p. Top-p affects output diversity and can impact the success rate of adversarial payloads by changing which tokens the model is likely to generate.
Tokenizer
The component that converts raw text into numerical tokens that a language model can process. Tokenizer behavior directly impacts security because mismatches between how a tokenizer splits text and how filters inspect it create exploitable gaps for payload obfuscation.
Threat Modeling
The structured process of identifying assets, attack surfaces, threat actors, and potential attack paths. AI threat models must account for unique vectors such as prompt injection, training data poisoning, model supply chain risks, and the emergent behaviors of autonomous agents.