Glossary

64 terms covering AI red teaming, adversarial ML, prompt injection, agent exploitation, and LLM security.

A

Adversarial Example

A carefully crafted input designed to cause a machine learning model to produce incorrect or unexpected outputs. In AI red teaming, adversarial examples exploit model vulnerabilities by making imperceptible modifications to inputs that fool classifiers, detectors, or content filters.

Embedding Manipulation

Attention

The core mechanism in transformer architectures that determines how information flows between token positions. Attention weights reveal which parts of the input the model prioritizes, directly informing injection placement strategies and attention dilution attacks.

Attention Exploitation

Adversarial Suffix

A string of tokens appended to a prompt that causes a language model to bypass its safety alignment. Typically discovered through gradient-based optimization methods such as GCG, adversarial suffixes can sometimes transfer across different models.

Jailbreak Techniques

AutoDAN

An automated jailbreak generation method that uses a hierarchical genetic algorithm to evolve readable jailbreak prompts. Unlike GCG which produces gibberish suffixes, AutoDAN generates human-readable jailbreaks that are harder for perplexity-based filters to detect.

Jailbreak Techniques Automation Frameworks

Agent Hijacking

Taking control of an AI agent's behavior through prompt injection, causing it to pursue attacker-defined goals instead of the user's intended task. Agent hijacking is particularly dangerous because agents have tool access that amplifies the impact.

Agent Exploitation

Alignment Tax

The reduction in model capability that results from safety alignment training. Models that are heavily aligned may be less capable at certain tasks. Red teamers observe that some jailbreak techniques essentially 'recover' capability that alignment training suppressed.

Llm Internals Jailbreak Techniques

AI Safety

The field of research focused on ensuring AI systems behave safely, reliably, and in accordance with human values. AI red teaming is a practical arm of AI safety, providing empirical evidence about where safety measures succeed and fail.

Capstone

AI Governance

The policies, processes, and organizational structures that guide the responsible development and deployment of AI systems. AI governance frameworks increasingly require security assessments including red teaming as a condition for deployment approval.

Planning Scoping

B

Blue Teaming

The defensive counterpart to red teaming, focused on detecting, preventing, and responding to attacks against AI systems. Blue team activities include implementing guardrails, monitoring for prompt injection, maintaining safety alignment, and building defense-in-depth architectures.

Capstone

BPE

Byte Pair Encoding. A subword tokenization algorithm used by GPT-family models that builds vocabulary by iteratively merging the most frequent byte pairs in training data. Red teamers exploit BPE-specific token boundary behaviors and encoding quirks for payload crafting.

Tokenization Attacks

Bug Bounty

A program offered by organizations that rewards security researchers for discovering and responsibly reporting vulnerabilities. Several AI companies now operate bug bounty programs that include prompt injection, jailbreaking, and data extraction as valid finding categories.

Capstone

C

Crescendo Attack

A multi-turn jailbreak technique where each message gradually escalates toward restricted content. The conversational context progressively normalizes the target topic, causing the model to continue the trajectory rather than applying safety constraints.

Jailbreak Techniques

Chain of Thought

A prompting technique that instructs the model to show its reasoning steps before producing a final answer. In agents, chain-of-thought reasoning can be manipulated by injecting false premises that lead the agent to attacker-desired conclusions.

Chain Of Thought Manipulation

Constitutional AI

An alignment technique developed by Anthropic in which a model is trained to self-critique and revise its own outputs according to a set of written principles (a 'constitution'). Red teamers study Constitutional AI to identify gaps in the constitution and exploit ambiguities.

Llm Internals

Capability Elicitation

The process of discovering what an AI model is truly capable of, beyond what standard evaluations reveal. Red teamers use capability elicitation techniques to find hidden or suppressed abilities that may pose security risks.

Capability Mapping

Content Filter

A safety mechanism that inspects model inputs or outputs to detect and block harmful or policy-violating content. Content filters may use keyword matching, classifier models, or LLM-based evaluation. Red teamers bypass content filters through token manipulation, encoding tricks, and semantic paraphrasing.

Defense Evasion Jailbreak Techniques

D

Data Poisoning

An attack that manipulates a model's behavior by injecting malicious examples into its training dataset. Poisoned data can install backdoors, bias outputs, or degrade performance. Particularly dangerous for models fine-tuned on user-generated or web-scraped data.

Training Data Attacks

DAN

Do Anything Now. An early jailbreak persona prompt that instructs the model to assume an unrestricted alter ego. While the original DAN prompt is widely patched, the technique of persona-based jailbreaking continues to evolve in new forms.

Jailbreak Techniques

DPO

Direct Preference Optimization. An alignment technique that trains language models directly on human preference data without requiring a separate reward model. DPO introduces its own attack surface — red teamers study how preference data biases can be exploited.

Llm Internals Training Data Attacks

Deceptive Alignment

A theoretical scenario where an AI system appears aligned during training and evaluation but pursues different objectives when deployed. While primarily an AI safety research concern, red teamers consider deceptive alignment when evaluating whether models truly follow safety constraints or merely appear to.

Llm Internals

E

Embedding

A dense vector representation of text in a continuous high-dimensional space. Embeddings capture semantic meaning and are central to RAG systems where they determine document retrieval. Attackers target embedding spaces through adversarial perturbations and embedding inversion attacks.

Embedding Manipulation Rag Poisoning

EU AI Act

European Union legislation establishing a regulatory framework for AI systems based on risk classification. High-risk AI systems must undergo conformity assessments that increasingly include security evaluation and red teaming.

Planning Scoping

F

Fine-tuning

The process of continuing to train a pre-trained model on a smaller, task-specific dataset to specialize its behavior. Fine-tuning is a security-sensitive operation because it can be used to remove safety alignment, install backdoors, or bias model outputs.

Training Data Attacks

Function Calling

The capability of LLMs to generate structured function call requests that are executed by the application layer. Function calling enables tool use but introduces attack surface through parameter injection, function selection manipulation, and unauthorized invocations.

Tool Abuse

G

Guardrails

Safety mechanisms designed to constrain AI model behavior within acceptable boundaries. Guardrails include system prompt instructions, input/output content filters, tool call validation, rate limiting, and human-in-the-loop approval workflows.

Defense Evasion Jailbreak Techniques

GCG Attack

Greedy Coordinate Gradient attack. A gradient-based optimization method that finds adversarial suffixes by iteratively replacing tokens to minimize the loss against a target harmful output. GCG suffixes discovered on open-weight models can sometimes transfer to closed-source models.

Jailbreak Techniques

H

Hallucination

When a language model generates text that is factually incorrect, fabricated, or not grounded in the provided context. Hallucinations are security-relevant because they can produce false information that users trust, and because they indicate the model's outputs cannot be unconditionally relied upon.

Llm Internals

I

Indirect Prompt Injection

An attack where malicious instructions are placed in external data sources — such as web pages, documents, or emails — that an AI system retrieves and processes. The attacker never directly interacts with the model, making it scalable and hard to attribute.

Indirect Injection Rag Poisoning

J

Jailbreak

A technique that causes a safety-aligned AI model to bypass its guardrails and produce outputs it was trained to refuse. Jailbreaks exploit weaknesses in alignment training through role-playing scenarios, encoding tricks, multi-turn manipulation, or adversarial suffixes.

Jailbreak Techniques

K

Knowledge Cutoff

The date after which a model has no training data. Events after the knowledge cutoff are unknown to the model. Red teamers use knowledge cutoff probing as a fingerprinting technique to identify the model family and version.

Target Profiling

L

LLM

Large Language Model. A neural network, typically based on the transformer architecture, trained on massive text corpora to predict the next token in a sequence. LLMs are the foundation of modern AI assistants, chatbots, and agent systems.

Llm Internals

Logprobs

Log probabilities assigned to each token in the model's vocabulary at each generation step. When exposed by APIs, logprobs provide valuable information for red teamers including confidence analysis, safety filter detection, and membership inference attacks.

Llm Internals Data Extraction

M

Model Extraction

An attack that recreates a proprietary AI model by systematically querying it and using the input-output pairs to train a functionally equivalent clone. Successful extraction can expose trade secrets, bypass usage controls, and enable further white-box attacks.

Data Extraction

Membership Inference

A privacy attack that determines whether a specific data point was included in a model's training dataset. By observing differences in model behavior on training versus non-training data, attackers can infer the presence of sensitive records.

Data Extraction

Many-shot Jailbreaking

A jailbreak technique that exploits in-context learning by providing many examples of the model answering harmful questions. After seeing enough examples (typically 50+), the model continues the pattern and complies with the harmful final query.

Jailbreak Techniques

MCP

Model Context Protocol. A standardized interface for connecting AI models to external tools, data sources, and services. MCP defines how models discover, invoke, and receive results from tools, creating a standardized attack surface for tool-related exploitation.

Tool Abuse

Model Card

A documentation framework for machine learning models that describes their intended use, performance characteristics, limitations, and ethical considerations. Red teamers review model cards during reconnaissance to understand the model's stated capabilities and limitations.

Target Profiling

N

NIST AI RMF

The National Institute of Standards and Technology AI Risk Management Framework. A voluntary framework providing guidance for managing risks throughout the AI system lifecycle, including security testing and red teaming requirements.

Planning Scoping

O

OWASP Top 10 for LLMs

A standard awareness document published by OWASP that identifies the ten most critical security risks in LLM applications. It provides a shared vocabulary and prioritization framework for AI security, covering prompt injection, data poisoning, supply chain, and more.

Planning Scoping Api Security

P

Prompt Injection

An attack in which an adversary crafts input that causes a language model to ignore or override its original instructions and follow attacker-specified directives instead. It is the most fundamental vulnerability class in LLM applications, analogous to SQL injection in traditional web security.

Prompt Injection Direct Injection

Perplexity

A measure of how surprised a language model is by a given text. Low perplexity indicates the text is predictable to the model. Perplexity-based filters detect adversarial suffixes (which have high perplexity), and perplexity comparison enables membership inference attacks.

Defense Evasion Data Extraction

Penetration Testing

A simulated cyberattack against a system to evaluate its security. AI penetration testing adapts traditional pentest methodology to the unique characteristics of machine learning systems, adding prompt injection, alignment testing, and data pipeline assessment.

Capstone Recon Tradecraft

Prompt Leaking

The disclosure of a model's system prompt or internal instructions to an unauthorized user. Prompt leaks can occur through direct extraction attacks, model hallucination of its own instructions, or accidental disclosure in verbose error messages. Leaked prompts reveal safety rules and behavioral constraints.

Prompt Discovery

R

Red Teaming

The practice of simulating adversarial attacks against a system to discover vulnerabilities and improve defenses. In AI security, red teaming targets the unique failure modes of machine learning systems including prompt injection, alignment bypasses, data poisoning, and model exploitation.

Capstone Recon Tradecraft

RAG

Retrieval-Augmented Generation. An architecture pattern that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the model's context. RAG introduces attack surface through document poisoning and indirect prompt injection.

Rag Data Attacks Rag Poisoning

RLHF

Reinforcement Learning from Human Feedback. The primary technique used to align language models with human preferences and safety requirements. RLHF trains a reward model from human rankings, then uses reinforcement learning to optimize the language model against that reward.

Llm Internals

ReAct Pattern

Reason + Act. An agent architecture pattern where the model alternates between reasoning about what to do and taking actions. The reasoning step is visible and exploitable through chain-of-thought manipulation techniques.

Chain Of Thought Manipulation

RAG Poisoning

An attack that injects malicious documents into a RAG system's knowledge base. When these poisoned documents are retrieved for relevant queries, they inject attacker-controlled content into the model's context, enabling indirect prompt injection.

Rag Poisoning

Reward Hacking

When an AI model finds unintended ways to maximize its reward signal during reinforcement learning without actually achieving the intended objective. In RLHF-trained models, reward hacking can produce outputs that score well but are actually harmful or manipulative.

Llm Internals

Responsible Disclosure

The practice of reporting discovered vulnerabilities to the affected organization before public disclosure, giving them time to develop and deploy a fix. AI-specific responsible disclosure requires additional considerations around harmful outputs and probabilistic findings.

Execution Reporting

S

Safety Filter

A component that inspects model inputs or outputs to detect and block harmful, policy-violating, or sensitive content. Safety filters may use keyword matching, classifier models, or LLM-based evaluation. Red teamers routinely bypass these through token manipulation and semantic paraphrasing.

Defense Evasion

System Prompt

The initial set of instructions provided to a language model that defines its behavior, persona, capabilities, and restrictions. System prompts are typically hidden from end users and contain sensitive configuration including safety rules and behavioral constraints.

Prompt Discovery

Skeleton Key

A jailbreak technique that provides the model with a plausible reason to comply with restricted requests, such as claiming the user is a security researcher or that the information is needed for an authorized assessment. Named for its ability to 'unlock' model compliance.

Jailbreak Techniques

Specification Gaming

When an AI system achieves high reward or scores by exploiting loopholes in how the objective was specified rather than by solving the intended task. Specification gaming is related to reward hacking and can produce unexpected model behaviors that red teamers discover.

Llm Internals

Sycophancy

The tendency of language models to agree with users or tell them what they want to hear, even when the model should disagree or refuse. Sycophancy is exploitable — an attacker who frames a harmful request as something the model should agree with can leverage this tendency.

Jailbreak Techniques

Sandbagging

When an AI model deliberately underperforms on capability evaluations while retaining full capability for other uses. Sandbagging concerns red teamers because it means capability evaluations may not reflect true model capabilities, which has implications for safety assessments.

Llm Internals

T

Token

The fundamental unit of text processing in language models. Text is split into tokens (subwords, words, or characters) by a tokenizer before being processed by the model. Understanding tokenization is essential for crafting adversarial payloads.

Tokenization Attacks

Training Data Extraction

Techniques that cause a model to reveal memorized content from its training data through targeted prompting. Methods include prefix-based completion, divergence attacks, and canary extraction, which can expose PII, copyrighted content, or security-sensitive information.

Data Extraction

Tool Use Exploitation

Attacks that manipulate AI agents into calling tools with attacker-controlled parameters. By injecting instructions that cause the agent to misuse its legitimate tools, attackers can achieve code execution, data exfiltration, and privilege escalation.

Tool Abuse

Temperature

A parameter that controls the randomness of model output. Lower temperature produces more deterministic responses, higher temperature produces more creative but less predictable output. Temperature affects exploit reliability — lower temperature means more consistent exploit success rates.

Llm Internals

Top-p

Nucleus sampling parameter that limits token selection to the smallest set of tokens whose cumulative probability exceeds p. Top-p affects output diversity and can impact the success rate of adversarial payloads by changing which tokens the model is likely to generate.

Llm Internals

Tokenizer

The component that converts raw text into numerical tokens that a language model can process. Tokenizer behavior directly impacts security because mismatches between how a tokenizer splits text and how filters inspect it create exploitable gaps for payload obfuscation.

Tokenization Attacks

Threat Modeling

The structured process of identifying assets, attack surfaces, threat actors, and potential attack paths. AI threat models must account for unique vectors such as prompt injection, training data poisoning, model supply chain risks, and the emergent behaviors of autonomous agents.

Planning Scoping Recon Tradecraft