# red-teaming
87 articlestagged with “red-teaming”
Permission Boundary Bypass
Escalating from limited to elevated permissions in AI agent systems through scope creep, implicit permission inheritance, and capability confusion.
Jailbreaking Techniques Assessment
Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.
Capstone: Build a Complete AI Red Teaming Platform
Design and implement a comprehensive AI red teaming platform with automated attack orchestration, vulnerability tracking, and collaborative reporting.
Full Engagement Methodology
A comprehensive methodology for conducting full AI red teaming engagements, integrating all techniques from previous sections into a structured professional assessment.
The Attacker Moves Second Problem
Why static LLM defenses fail against adaptive adversaries: analysis of 12 bypassed defenses and implications for defense design.
Understanding AI Defenses
Why red teamers must understand the defenses they face, categories of AI defenses, and the attacker-defender asymmetry in AI security.
AI Exploit Development Overview
An introduction to developing exploits and tooling for AI red teaming, covering the unique challenges of building reliable attacks against probabilistic systems.
Continuous Automated Red Teaming (CART)
Designing CART pipelines for ongoing AI security validation: architecture, test suites, telemetry, alerting, regression detection, and CI/CD integration.
How LLMs Work: A Red Teamer's Guide
Understand the fundamentals of large language models — token prediction, context windows, roles, and temperature — through a security-focused lens.
Red Team Methodology Fundamentals
What AI red teaming is, how it differs from traditional security testing, and the complete engagement lifecycle from scoping to reporting.
Red Teaming Fundamentals for AI
Fundamental concepts and methodology for AI red teaming including goal setting, scope definition, technique selection, and reporting.
Automated Red Teaming Systems
Survey of automated red teaming systems including PAIR, TAP, Rainbow Teaming, and curiosity-driven exploration.
Mechanistic Interpretability for Red Teaming
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
Red Teaming Reasoning Traces
Techniques for analyzing and exploiting visible reasoning traces in chain-of-thought models.
EU AI Act Red Team Requirements
Specific red teaming and testing requirements under the EU AI Act for high-risk AI systems.
Responsible AI Red Teaming Ethics
Ethical frameworks for conducting AI red teaming including scope limits and harm prevention.
Industry Verticals: AI Security by Sector
Comprehensive guide to industry-specific AI security challenges, regulatory requirements, and red teaming approaches across healthcare, financial services, legal, government, and critical infrastructure sectors.
Automated Jailbreak Pipelines
Building automated jailbreak systems with PAIR, TAP, AutoDAN, and custom pipeline architectures for systematic AI safety evaluation.
Lab: Payload Crafting
Learn to craft effective prompt injection payloads from scratch by understanding payload structure, testing iteratively, and optimizing for reliability against a local model.
Lab: PyRIT Setup and First Attack
Install and configure Microsoft's PyRIT (Python Risk Identification Toolkit) for automated red teaming, then run your first orchestrated attack against a local model.
Claude (Anthropic) Overview
Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.
Cross-Model Comparison
Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.
Gemini (Google) Overview
Architecture overview of Google's Gemini model family, including natively multimodal design, long context capabilities, Google ecosystem integration, and security-relevant features for red teaming.
GPT-4 / GPT-4o Overview
Architecture overview of OpenAI's GPT-4 and GPT-4o models, including rumored Mixture of Experts design, capabilities, API surface, and security-relevant features for red teaming.
GPT-4 Testing Methodology
Systematic methodology for red teaming GPT-4, including API-based probing techniques, rate limit considerations, content policy mapping, and safety boundary discovery.
Model Deep Dives
Why model-specific knowledge matters for AI red teaming, how different architectures create different attack surfaces, and a systematic methodology for profiling any new model.
Llama Family Attacks
Comprehensive attack analysis of Meta's Llama model family including weight manipulation, fine-tuning safety removal, quantization artifacts, uncensored variants, and Llama Guard bypass techniques.
Mistral & Mixtral
Security analysis of Mistral and Mixtral models, including Mixture of Experts exploitation, sparse activation attacks, minimal safety alignment implications, and open-weight deployment risks.
Methodology for Red Teaming Multimodal Systems
Structured methodology for conducting security assessments of multimodal AI systems, covering scoping, attack surface enumeration, test execution, and reporting with MITRE ATLAS mappings.
AI Red Team Career Pathways
Comprehensive guide to building a career in AI red teaming, from entry-level roles through senior leadership positions.
Context Overflow Attacks
Techniques for filling the LLM context window with padding content to push system instructions out of attention, reducing their influence on model behavior.
Context Window Exploitation
Advanced techniques for exploiting context window mechanics in LLMs, including attention dilution, positional encoding attacks, KV cache manipulation, and context boundary confusion.
Conversation Steering
Techniques for gradually redirecting conversation context toward attack objectives without triggering safety mechanisms.
Direct Prompt Injection
Techniques for directly injecting instructions into LLM prompts to override system behavior, including instruction override, context manipulation, and format mimicry.
Encoding Bypass Techniques
Using Base64, ROT13, Unicode transformations, hex encoding, and other obfuscation methods to evade prompt injection filters and safety classifiers while preserving semantic meaning.
Few-Shot Manipulation
Using crafted in-context examples to steer model behavior, including many-shot jailbreaking, poisoned demonstrations, and example-based conditioning.
Prompt Injection Taxonomy
A comprehensive classification framework for prompt injection attacks, covering direct and indirect vectors, delivery mechanisms, target layers, and severity assessment for systematic red team testing.
Instruction Hierarchy Attacks
Exploiting the priority ordering between system, user, and assistant messages to override safety controls, manipulate instruction precedence, and escalate privilege through message role confusion.
Jailbreak Techniques
Common patterns and advanced techniques for bypassing LLM safety alignment, including role-playing, encoding tricks, many-shot attacks, and gradient-based methods.
Language Switching
Exploiting language-specific gaps in safety training by switching to low-resource languages, mixing languages, or using transliteration to evade filters.
Multi-Turn Attacks
Attacks that span multiple conversation turns using gradual escalation, context building, crescendo patterns, and trust establishment over time.
Multi-Turn Prompt Injection
Progressive escalation attacks across conversation turns, including crescendo patterns, context steering, trust building, and techniques for evading per-message detection systems.
Payload Splitting
Breaking malicious instructions across multiple messages, variables, or data sources to evade single-point detection while the model reassembles the complete payload during processing.
Persona Establishment
Creating persistent alternate identities that survive across conversation turns, including character locking, identity anchoring, and progressive persona building.
Role-Play Attacks
Establishing alternate personas or fictional scenarios that cause models to bypass safety training, including DAN variants, character hijacking, and narrative framing.
Social Engineering of AI
Manipulating AI systems through emotional appeals, authority claims, urgency framing, and social pressure tactics that exploit instruction-following tendencies.
Universal Adversarial Triggers
Discovering and deploying universal adversarial trigger sequences that reliably override safety alignment across multiple LLM families, including gradient-based search, transfer attacks, and defense evasion.
AI Red Teaming Methodology
A structured methodology for AI red teaming engagements, covering reconnaissance, target profiling, attack planning, and the tradecraft that distinguishes professional assessments.
AI Red Teaming Cheat Sheet
A condensed quick reference for AI red team engagements covering the full lifecycle, attack categories, common tools, reconnaissance, and reporting.
API Rate Limit Bypass
Techniques to bypass API rate limiting on LLM services, including header manipulation, distributed requests, authentication rotation, and endpoint discovery.
Audio Prompt Injection
Injecting adversarial instructions through audio inputs to speech-to-text and multimodal models, exploiting the audio channel as an alternative injection vector.
Cipher-Based Jailbreak
Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.
Code Injection via Markdown
Injecting executable payloads through markdown rendering in LLM outputs, exploiting the gap between text generation and content rendering in web-based LLM interfaces.
Composite Attack Chaining
Combining multiple prompt injection techniques into compound attacks that defeat layered defenses, building attack chains that leverage the strengths of each individual technique.
Context Window Stuffing
Techniques for filling the LLM context window to push system instructions out of active memory, manipulating token budgets to dilute or displace defensive prompts.
Crescendo Multi-Turn Attack
The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.
Cross-Modal Confusion
Confusing multimodal AI models by sending conflicting or complementary signals across different input modalities to bypass safety mechanisms and exploit fusion weaknesses.
DAN Jailbreak Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Delimiter Escape Attacks
Techniques for escaping delimiters used to separate system and user content in LLM applications, breaking out of sandboxed input regions to inject instructions.
Direct Injection Basics
Core concepts of directly injecting instructions into LLM prompts, including override techniques, simple payload crafting, and understanding how models parse conflicting instructions.
Encoding-Based Evasion
Using base64, ROT13, hexadecimal, Unicode, and other encoding schemes to evade input detection systems and bypass content filters in LLM applications.
Few-Shot Injection
Using crafted few-shot examples within user input to steer LLM behavior toward unintended outputs, exploiting in-context learning to override safety training.
Image-Based Prompt Injection (Attack Walkthrough)
Embedding text instructions in images that vision models read, enabling prompt injection through the visual modality to bypass text-only input filters and safety mechanisms.
Inference Endpoint Exploitation
Exploiting inference API endpoints for unauthorized access, data exfiltration, and service abuse through authentication flaws, input validation gaps, and misconfigured permissions.
Instruction Hierarchy Bypass
Advanced techniques to bypass instruction priority and hierarchy enforcement in language models, exploiting conflicts between system, user, and assistant-level directives.
Language Switch Jailbreak
Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Many-Shot Jailbreaking (Attack Walkthrough)
Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.
Multi-Turn Progressive Injection
Gradually escalating prompt injection across conversation turns to build compliance, using psychological techniques like foot-in-the-door and norm erosion.
OCR-Based Attacks
Exploiting Optical Character Recognition processing pipelines to inject adversarial text into AI systems, targeting the gap between what OCR extracts and what humans see.
Output Format Manipulation (Attack Walkthrough)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
PAIR Automated Jailbreak
Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.
Payload Obfuscation Techniques
Methods for disguising prompt injection payloads through encoding, splitting, substitution, and other obfuscation techniques to bypass input filters and detection systems.
PDF Document Injection
Injecting adversarial prompts through PDF documents processed by AI systems, exploiting document parsing pipelines to deliver payloads through text layers, metadata, and embedded objects.
Prompt Leaking Step by Step
Systematic approaches to extract system prompts from LLM applications, covering direct elicitation, indirect inference, differential analysis, and output-based reconstruction.
Recursive Injection Chains
Creating self-reinforcing injection chains that amplify across conversation turns, building compound prompts where each step strengthens the next injection's effectiveness.
Role Escalation Chain
Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.
Role-Play Injection
Using fictional scenarios, character role-play, and narrative framing to bypass LLM safety filters by having the model operate within a permissive fictional context.
Skeleton Key Attack
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.
System Prompt Override
Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.
Thought Injection for Reasoning Models
Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.
Token Smuggling
Exploiting LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.
Translation Injection
Using translation requests and low-resource languages to bypass content filters, exploiting the uneven distribution of safety training across languages.
Video Frame Injection (Attack Walkthrough)
Embedding prompt injection payloads in specific video frames to attack multimodal models that process video content, exploiting temporal and visual channels simultaneously.
Virtual Persona Creation
Creating persistent alternate personas within LLM conversations to bypass safety training, establishing character identities that override the model's default behavioral constraints.
Running Your First PyRIT Red Team Campaign
Beginner walkthrough for running your first PyRIT red team campaign from scratch, covering installation, target configuration, orchestrator setup, and basic result analysis.
Orchestrating Multi-Turn Attack Sequences with PyRIT
Intermediate walkthrough on using PyRIT's orchestration capabilities for multi-turn red team campaigns, including attack strategy design, conversation management, and adaptive scoring.
PyRIT End-to-End Walkthrough
Complete walkthrough of Microsoft's Python Risk Identification Toolkit: setup, connecting to targets, running orchestrators, using converters, multi-turn attacks, and analyzing results with the web UI.