# red-teaming
標記為「red-teaming」的 178 篇文章
Permission Boundary Bypass
Escalating from limited to elevated permissions in AI agent systems through scope creep, implicit permission inheritance, and capability confusion.
Jailbreaking Techniques Assessment
Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.
Capstone: Build a Complete AI Red Teaming Platform
Design and implement a comprehensive AI red teaming platform with automated attack orchestration, vulnerability tracking, and collaborative reporting.
Full Engagement Methodology
A comprehensive methodology for conducting full AI red teaming engagements, integrating all techniques from previous sections into a structured professional assessment.
The Attacker Moves Second Problem
Why static LLM defenses fail against adaptive adversaries: analysis of 12 bypassed defenses and implications for defense design.
Understanding AI Defenses
Why red teamers must understand the defenses they face, categories of AI defenses, and the attacker-defender asymmetry in AI security.
AI Exploit Development Overview
An introduction to developing exploits and tooling for AI red teaming, covering the unique challenges of building reliable attacks against probabilistic systems.
Continuous Automated Red Teaming (CART)
Designing CART pipelines for ongoing AI security validation: architecture, test suites, telemetry, alerting, regression detection, and CI/CD integration.
How LLMs Work: A Red Teamer's Guide
Understand the fundamentals of large language models — token prediction, context windows, roles, and temperature — through a security-focused lens.
Red Team Methodology Fundamentals
What AI red teaming is, how it differs from traditional security testing, and the complete engagement lifecycle from scoping to reporting.
Red Teaming Fundamentals for AI
Fundamental concepts and methodology for AI red teaming including goal setting, scope definition, technique selection, and reporting.
Automated Red Teaming Systems
Survey of automated red teaming systems including PAIR, TAP, Rainbow Teaming, and curiosity-driven exploration.
Mechanistic Interpretability for Red Teaming
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
Red Teaming Reasoning Traces
Techniques for analyzing and exploiting visible reasoning traces in chain-of-thought models.
EU AI Act Red Team Requirements
Specific red teaming and testing requirements under the EU AI Act for high-risk AI systems.
Responsible AI Red Teaming Ethics
Ethical frameworks for conducting AI red teaming including scope limits and harm prevention.
Industry Verticals: AI Security by Sector
Comprehensive guide to industry-specific AI security challenges, regulatory requirements, and red teaming approaches across healthcare, financial services, legal, government, and critical infrastructure sectors.
Automated Jailbreak Pipelines
Building automated jailbreak systems with PAIR, TAP, AutoDAN, and custom pipeline architectures for systematic AI safety evaluation.
Lab: Payload Crafting
Learn to craft effective prompt injection payloads from scratch by understanding payload structure, testing iteratively, and optimizing for reliability against a local model.
Lab: PyRIT Setup and First Attack
Install and configure Microsoft's PyRIT (Python Risk Identification Toolkit) for automated red teaming, then run your first orchestrated attack against a local model.
Claude (Anthropic) Overview
Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.
Cross-Model Comparison
Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.
Gemini (Google) Overview
Architecture overview of Google's Gemini model family, including natively multimodal design, long context capabilities, Google ecosystem integration, and security-relevant features for red teaming.
GPT-4 / GPT-4o Overview
Architecture overview of OpenAI's GPT-4 and GPT-4o models, including rumored Mixture of Experts design, capabilities, API surface, and security-relevant features for red teaming.
GPT-4 Testing Methodology
Systematic methodology for red teaming GPT-4, including API-based probing techniques, rate limit considerations, content policy mapping, and safety boundary discovery.
Model Deep Dives
Why model-specific knowledge matters for AI red teaming, how different architectures create different attack surfaces, and a systematic methodology for profiling any new model.
Llama Family Attacks
Comprehensive attack analysis of Meta's Llama model family including weight manipulation, fine-tuning safety removal, quantization artifacts, uncensored variants, and Llama Guard bypass techniques.
Mistral & Mixtral
Security analysis of Mistral and Mixtral models, including Mixture of Experts exploitation, sparse activation attacks, minimal safety alignment implications, and open-weight deployment risks.
Methodology for Red Teaming Multimodal Systems
Structured methodology for conducting security assessments of multimodal AI systems, covering scoping, attack surface enumeration, test execution, and reporting with MITRE ATLAS mappings.
AI Red Team Career Pathways
Comprehensive guide to building a career in AI red teaming, from entry-level roles through senior leadership positions.
Context Overflow Attacks
Techniques for filling the LLM context window with padding content to push system instructions out of attention, reducing their influence on model behavior.
Context Window Exploitation
Advanced techniques for exploiting context window mechanics in LLMs, including attention dilution, positional encoding attacks, KV cache manipulation, and context boundary confusion.
Conversation Steering
Techniques for gradually redirecting conversation context toward attack objectives without triggering safety mechanisms.
Direct Prompt Injection
Techniques for directly injecting instructions into LLM prompts to override system behavior, including instruction override, context manipulation, and format mimicry.
Encoding Bypass Techniques
Using Base64, ROT13, Unicode transformations, hex encoding, and other obfuscation methods to evade prompt injection filters and safety classifiers while preserving semantic meaning.
Few-Shot Manipulation
Using crafted in-context examples to steer model behavior, including many-shot jailbreaking, poisoned demonstrations, and example-based conditioning.
Prompt Injection Taxonomy
A comprehensive classification framework for prompt injection attacks, covering direct and indirect vectors, delivery mechanisms, target layers, and severity assessment for systematic red team testing.
Instruction Hierarchy Attacks
Exploiting the priority ordering between system, user, and assistant messages to override safety controls, manipulate instruction precedence, and escalate privilege through message role confusion.
Jailbreak Techniques
Common patterns and advanced techniques for bypassing LLM safety alignment, including role-playing, encoding tricks, many-shot attacks, and gradient-based methods.
Language Switching
Exploiting language-specific gaps in safety training by switching to low-resource languages, mixing languages, or using transliteration to evade filters.
Multi-Turn Attacks
Attacks that span multiple conversation turns using gradual escalation, context building, crescendo patterns, and trust establishment over time.
Multi-Turn Prompt Injection
Progressive escalation attacks across conversation turns, including crescendo patterns, context steering, trust building, and techniques for evading per-message detection systems.
Payload Splitting
Breaking malicious instructions across multiple messages, variables, or data sources to evade single-point detection while the model reassembles the complete payload during processing.
Persona Establishment
Creating persistent alternate identities that survive across conversation turns, including character locking, identity anchoring, and progressive persona building.
Role-Play Attacks
Establishing alternate personas or fictional scenarios that cause models to bypass safety training, including DAN variants, character hijacking, and narrative framing.
Social Engineering of AI
Manipulating AI systems through emotional appeals, authority claims, urgency framing, and social pressure tactics that exploit instruction-following tendencies.
Universal Adversarial Triggers
Discovering and deploying universal adversarial trigger sequences that reliably override safety alignment across multiple LLM families, including gradient-based search, transfer attacks, and defense evasion.
AI Red Teaming Methodology
A structured methodology for AI red teaming engagements, covering reconnaissance, target profiling, attack planning, and the tradecraft that distinguishes professional assessments.
AI Red Teaming Cheat Sheet
A condensed quick reference for AI red team engagements covering the full lifecycle, attack categories, common tools, reconnaissance, and reporting.
API Rate Limit Bypass
Techniques to bypass API rate limiting on LLM services, including header manipulation, distributed requests, authentication rotation, and endpoint discovery.
Audio Prompt Injection
Injecting adversarial instructions through audio inputs to speech-to-text and multimodal models, exploiting the audio channel as an alternative injection vector.
Cipher-Based Jailbreak
Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.
Code Injection via Markdown
Injecting executable payloads through markdown rendering in LLM outputs, exploiting the gap between text generation and content rendering in web-based LLM interfaces.
Composite Attack Chaining
Combining multiple prompt injection techniques into compound attacks that defeat layered defenses, building attack chains that leverage the strengths of each individual technique.
Context Window Stuffing
Techniques for filling the LLM context window to push system instructions out of active memory, manipulating token budgets to dilute or displace defensive prompts.
Crescendo Multi-Turn Attack
The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.
Cross-Modal Confusion
Confusing multimodal AI models by sending conflicting or complementary signals across different input modalities to bypass safety mechanisms and exploit fusion weaknesses.
DAN Jailbreak Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Delimiter Escape Attacks
Techniques for escaping delimiters used to separate system and user content in LLM applications, breaking out of sandboxed input regions to inject instructions.
Direct Injection Basics
Core concepts of directly injecting instructions into LLM prompts, including override techniques, simple payload crafting, and understanding how models parse conflicting instructions.
Encoding-Based Evasion
Using base64, ROT13, hexadecimal, Unicode, and other encoding schemes to evade input detection systems and bypass content filters in LLM applications.
Few-Shot Injection
Using crafted few-shot examples within user input to steer LLM behavior toward unintended outputs, exploiting in-context learning to override safety training.
Image-Based Prompt Injection (Attack Walkthrough)
Embedding text instructions in images that vision models read, enabling prompt injection through the visual modality to bypass text-only input filters and safety mechanisms.
Inference Endpoint Exploitation
Exploiting inference API endpoints for unauthorized access, data exfiltration, and service abuse through authentication flaws, input validation gaps, and misconfigured permissions.
Instruction Hierarchy Bypass
Advanced techniques to bypass instruction priority and hierarchy enforcement in language models, exploiting conflicts between system, user, and assistant-level directives.
Language Switch Jailbreak
Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Many-Shot Jailbreaking (Attack Walkthrough)
Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.
Multi-Turn Progressive Injection
Gradually escalating prompt injection across conversation turns to build compliance, using psychological techniques like foot-in-the-door and norm erosion.
OCR-Based Attacks
Exploiting Optical Character Recognition processing pipelines to inject adversarial text into AI systems, targeting the gap between what OCR extracts and what humans see.
Output Format Manipulation (Attack Walkthrough)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
PAIR Automated Jailbreak
Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.
Payload Obfuscation Techniques
Methods for disguising prompt injection payloads through encoding, splitting, substitution, and other obfuscation techniques to bypass input filters and detection systems.
PDF Document Injection
Injecting adversarial prompts through PDF documents processed by AI systems, exploiting document parsing pipelines to deliver payloads through text layers, metadata, and embedded objects.
Prompt Leaking Step by Step
Systematic approaches to extract system prompts from LLM applications, covering direct elicitation, indirect inference, differential analysis, and output-based reconstruction.
Recursive Injection Chains
Creating self-reinforcing injection chains that amplify across conversation turns, building compound prompts where each step strengthens the next injection's effectiveness.
Role Escalation Chain
Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.
Role-Play Injection
Using fictional scenarios, character role-play, and narrative framing to bypass LLM safety filters by having the model operate within a permissive fictional context.
Skeleton Key Attack
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.
System Prompt Override
Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.
Thought Injection for Reasoning Models
Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.
Token Smuggling
Exploiting LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.
Translation Injection
Using translation requests and low-resource languages to bypass content filters, exploiting the uneven distribution of safety training across languages.
Video Frame Injection (Attack Walkthrough)
Embedding prompt injection payloads in specific video frames to attack multimodal models that process video content, exploiting temporal and visual channels simultaneously.
Virtual Persona Creation
Creating persistent alternate personas within LLM conversations to bypass safety training, establishing character identities that override the model's default behavioral constraints.
Running Your First PyRIT Red Team Campaign
Beginner walkthrough for running your first PyRIT red team campaign from scratch, covering installation, target configuration, orchestrator setup, and basic result analysis.
Orchestrating Multi-Turn Attack Sequences with PyRIT
Intermediate walkthrough on using PyRIT's orchestration capabilities for multi-turn red team campaigns, including attack strategy design, conversation management, and adaptive scoring.
PyRIT End-to-End Walkthrough
Complete walkthrough of Microsoft's Python Risk Identification Toolkit: setup, connecting to targets, running orchestrators, using converters, multi-turn attacks, and analyzing results with the web UI.
Permission Boundary Bypass
Escalating from limited to elevated permissions in AI agent systems through scope creep, implicit permission inheritance, and capability confusion.
2025 年 AI 紅隊演練現況
2025 年初 AI 紅隊演練版圖的調查——新興攻擊向量、產業採用、工具成熟度,以及隨著領域演進的預期。
紅隊演練雲端 AI 服務:實務指南
在 AWS、Azure 與 GCP 上紅隊演練 AI 服務的實務指南——涵蓋共享責任邊界、服務特定攻擊面與雲端原生安全控制。
用 AI 紅隊演練找 CVE:以研究為根基的指南
AI 紅隊演練技術如何在 SQLite、OpenSSL、Linux 核心與 UEFI 開機載入器中發現真實世界 CVE——附上背後研究的引用。
駭入 AI 駭客:當安全工具成為漏洞
新研究對 AI 驅動安全工具達成 100% 提示詞注入成功率。如果你的 SOC 使用 AI 進行威脅偵測,你的 AI 可被反向對付你。以下是研究發現與防禦方法。
2026 年大型語言模型越獄:97% 成功率、自主攻擊,與失靈的軍備競賽
Nature Communications 證實 AI 推理模型可以自主越獄其他大型語言模型,成功率達 97%。JBFuzz 在 60 秒內達到 99% 成功率。本文揭示 2026 年最新研究對 AI 安全現況的揭露——以及為什麼現有防禦正在失敗。
紅隊演練 AI SOC:為何你的自主安全營運需要對手
隨著組織急於在 SOC 中部署代理式 AI,紅隊員發現防禦者自己的 AI 代理現在就是攻擊面。520 起工具濫用事件、記憶體投毒持久性,以及 97% 越獄成功率——以下是如何在攻擊者之前紅隊演練 AI 驅動 SOC。
歡迎來到 redteams.ai
介紹 AI 紅隊演練知識庫——我們為何建構它以及未來展望。
Capstone: Build a Complete AI 紅隊ing Platform
Design and implement a comprehensive AI red teaming platform with automated attack orchestration, vulnerability tracking, and collaborative reporting.
完整案件方法論
進行完整 AI 紅隊案件的全面方法論,將先前章節的所有技術整合為結構化專業評估。
The 攻擊er Moves Second Problem
Why static LLM defenses fail against adaptive adversaries: analysis of 12 bypassed defenses and implications for defense design.
理解 AI 防禦
為何紅隊員必須理解他們所面對的防禦、AI 防禦類別,以及 AI 安全中攻擊者與防禦者的不對稱。
AI 利用開發概覽
為 AI 紅隊演練開發利用程式與工具的介紹,涵蓋建構對機率性系統之可靠攻擊的獨特挑戰。
持續自動化紅隊(CART)
為持續 AI 安全驗證設計 CART 管線:架構、測試套件、遙測、警報、回歸偵測與 CI/CD 整合。
紅隊演練 基礎 for AI
Fundamental concepts and methodology for AI red teaming including goal setting, scope definition, technique selection, and reporting.
Automated 紅隊演練 Systems
Survey of automated red teaming systems including PAIR, TAP, Rainbow Teaming, and curiosity-driven exploration.
Mechanistic Interpretability for 紅隊演練
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
紅隊演練 Reasoning Traces
Techniques for analyzing and exploiting visible reasoning traces in chain-of-thought models.
EU AI Act 紅隊 Requirements
Specific red teaming and testing requirements under the EU AI Act for high-risk AI systems.
Responsible AI 紅隊ing Ethics
Ethical frameworks for conducting AI red teaming including scope limits and harm prevention.
產業別:各行業的 AI 安全
涵蓋醫療、金融服務、法律、政府與關鍵基礎設施的產業特定 AI 安全挑戰、法規要求與紅隊演練方式的完整指南。
Automated 越獄 Pipelines
Building automated jailbreak systems with PAIR, TAP, AutoDAN, and custom pipeline architectures for systematic AI safety evaluation.
實驗室: Payload Crafting
Learn to craft effective prompt injection payloads from scratch by understanding payload structure, testing iteratively, and optimizing for reliability against a local model.
實驗室: PyRIT Setup and First 攻擊
Install and configure Microsoft's PyRIT (Python Risk Identification Toolkit) for automated red teaming, then run your first orchestrated attack against a local model.
Claude(Anthropic)概觀
Anthropic Claude 模型家族的架構與安全概觀,涵蓋 Sonnet、Opus 與 Haiku 變體、Constitutional AI 訓練、RLHF 做法,以及 harmlessness 設計哲學。
跨模型比較
系統性比較 LLM 安全性的方法論,跨模型家族進行,內容涵蓋標準化評估框架、架構差異分析與比較測試方法。
Gemini(Google)概觀
Google Gemini 模型家族的架構概觀,包括原生多模態設計、長上下文能力、Google 生態整合,以及對紅隊具意義的安全相關特性。
GPT-4 / GPT-4o 概觀
OpenAI GPT-4 與 GPT-4o 模型的架構概觀,涵蓋傳聞中的 Mixture of Experts 設計、能力、API 介面,以及對紅隊具意義的安全相關特性。
GPT-4 測試方法論
為紅隊 GPT-4 之系統化方法論,含 API 基探測技術、速率限制考量、內容政策對應與安全邊界發現。
模型深入探討
為何模型特定知識對 AI 紅隊演練重要、不同架構如何建立不同的攻擊面,以及為任何新模型剖析的系統化方法論。
Llama 家族攻擊
Meta 之 Llama 模型家族之完整攻擊分析,含權重操弄、微調安全移除、量化產物、未審查變體與 Llama Guard 繞過技術。
Mistral 與 Mixtral
Mistral 與 Mixtral 模型之安全分析,包括 Mixture of Experts 攻擊、稀疏啟動攻擊、最小化安全對齊之意涵,以及開源權重部署風險。
Methodology for 紅隊演練 Multimodal Systems
Structured methodology for conducting security assessments of multimodal AI systems, covering scoping, attack surface enumeration, test execution, and reporting with MITRE ATLAS mappings.
AI 紅隊 Career Pathways
Comprehensive guide to building a career in AI red teaming, from entry-level roles through senior leadership positions.
Context Overflow 攻擊s
Techniques for filling the LLM context window with padding content to push system instructions out of attention, reducing their influence on model behavior.
Context Window 利用ation
進階 techniques for exploiting context window mechanics in LLMs, including attention dilution, positional encoding attacks, KV cache manipulation, and context boundary confusion.
對話引導
在不觸發安全機制下逐步將對話上下文重導向攻擊目標的技術。
直接提示詞注入
直接將指令注入大型語言模型提示詞以覆蓋系統行為的技術,包含指令覆蓋、上下文操控與格式模仿。
Encoding Bypass Techniques
Using Base64, ROT13, Unicode transformations, hex encoding, and other obfuscation methods to evade prompt injection filters and safety classifiers while preserving semantic meaning.
Few-Shot Manipulation
Using crafted in-context examples to steer model behavior, including many-shot jailbreaking, poisoned demonstrations, and example-based conditioning.
提示詞注入 Taxonomy
A comprehensive classification framework for prompt injection attacks, covering direct and indirect vectors, delivery mechanisms, target layers, and severity assessment for systematic red team testing.
Instruction Hierarchy 攻擊s
利用ing the priority ordering between system, user, and assistant messages to override safety controls, manipulate instruction precedence, and escalate privilege through message role confusion.
越獄技術
繞過大型語言模型安全對齊的常見模式與進階技術,包含角色扮演、編碼技巧、多範例攻擊與基於梯度的方法。
Language Switching
利用ing language-specific gaps in safety training by switching to low-resource languages, mixing languages, or using transliteration to evade filters.
Multi-Turn 攻擊s
攻擊s that span multiple conversation turns using gradual escalation, context building, crescendo patterns, and trust establishment over time.
Multi-Turn 提示詞注入
Progressive escalation attacks across conversation turns, including crescendo patterns, context steering, trust building, and techniques for evading per-message detection systems.
載荷分割
將惡意指令拆分至多則訊息、變數或資料來源,以規避單點偵測,同時讓模型在處理過程中重新組合完整載荷。
人格建立
建立跨對話輪次存活並抵擋回復預設行為之持久另一身份,包含角色鎖定、身份錨定與漸進式人格建構。
Role-Play 攻擊s
Establishing alternate personas or fictional scenarios that cause models to bypass safety training, including DAN variants, character hijacking, and narrative framing.
Social Engineering of AI
Manipulating AI systems through emotional appeals, authority claims, urgency framing, and social pressure tactics that exploit instruction-following tendencies.
Universal Adversarial Triggers
Discovering and deploying universal adversarial trigger sequences that reliably override safety alignment across multiple LLM families, including gradient-based search, transfer attacks, and defense evasion.
AI 紅隊演練方法論
AI 紅隊案件的結構化方法論,涵蓋偵察、目標剖析、攻擊規劃,以及區分專業評估的技藝。
AI 紅隊演練速查表
AI 紅隊案件的濃縮快速參考,涵蓋完整生命週期、攻擊類別、常見工具、偵察與報告。
API Rate Limit Bypass
Techniques to bypass API rate limiting on LLM services, including header manipulation, distributed requests, authentication rotation, and endpoint discovery.
Audio 提示詞注入
Injecting adversarial instructions through audio inputs to speech-to-text and multimodal models, exploiting the audio channel as an alternative injection vector.
Cipher-Based 越獄
Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.
Code Injection via Markdown
Injecting executable payloads through markdown rendering in LLM outputs, exploiting the gap between text generation and content rendering in web-based LLM interfaces.
Composite 攻擊 Chaining
Combining multiple prompt injection techniques into compound attacks that defeat layered defenses, building attack chains that leverage the strengths of each individual technique.
Context Window Stuffing
Techniques for filling the LLM context window to push system instructions out of active memory, manipulating token budgets to dilute or displace defensive prompts.
Crescendo Multi-Turn 攻擊
The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.
Cross-Modal Confusion
Confusing multimodal AI models by sending conflicting or complementary signals across different input modalities to bypass safety mechanisms and exploit fusion weaknesses.
DAN 越獄 Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Delimiter Escape 攻擊s
Techniques for escaping delimiters used to separate system and user content in LLM applications, breaking out of sandboxed input regions to inject instructions.
Direct Injection Basics
Core concepts of directly injecting instructions into LLM prompts, including override techniques, simple payload crafting, and understanding how models parse conflicting instructions.
Encoding-Based Evasion
Using base64, ROT13, hexadecimal, Unicode, and other encoding schemes to evade input detection systems and bypass content filters in LLM applications.
Few-Shot Injection
Using crafted few-shot examples within user input to steer LLM behavior toward unintended outputs, exploiting in-context learning to override safety training.
Image-Based 提示詞注入 (攻擊 導覽)
Embedding text instructions in images that vision models read, enabling prompt injection through the visual modality to bypass text-only input filters and safety mechanisms.
Inference Endpoint 利用ation
利用ing inference API endpoints for unauthorized access, data exfiltration, and service abuse through authentication flaws, input validation gaps, and misconfigured permissions.
Instruction Hierarchy Bypass
進階 techniques to bypass instruction priority and hierarchy enforcement in language models, exploiting conflicts between system, user, and assistant-level directives.
Language Switch 越獄
利用ing weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Many-Shot 越獄ing (攻擊 導覽)
Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.
Multi-Turn Progressive Injection
Gradually escalating prompt injection across conversation turns to build compliance, using psychological techniques like foot-in-the-door and norm erosion.
OCR-Based 攻擊s
利用ing Optical Character Recognition processing pipelines to inject adversarial text into AI systems, targeting the gap between what OCR extracts and what humans see.
Output Format Manipulation (攻擊 導覽)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
PAIR Automated 越獄
Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.
Payload Obfuscation Techniques
Methods for disguising prompt injection payloads through encoding, splitting, substitution, and other obfuscation techniques to bypass input filters and detection systems.
PDF Document Injection
Injecting adversarial prompts through PDF documents processed by AI systems, exploiting document parsing pipelines to deliver payloads through text layers, metadata, and embedded objects.
Prompt Leaking Step by Step
Systematic approaches to extract system prompts from LLM applications, covering direct elicitation, indirect inference, differential analysis, and output-based reconstruction.
Recursive Injection Chains
Creating self-reinforcing injection chains that amplify across conversation turns, building compound prompts where each step strengthens the next injection's effectiveness.
Role Escalation Chain
Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.
Role-Play Injection
Using fictional scenarios, character role-play, and narrative framing to bypass LLM safety filters by having the model operate within a permissive fictional context.
Skeleton Key 攻擊
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.
System Prompt Override
Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.
Thought Injection for Reasoning 模型s
Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.
Token Smuggling
利用ing LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.
Translation Injection
Using translation requests and low-resource languages to bypass content filters, exploiting the uneven distribution of safety training across languages.
Video Frame Injection (攻擊 導覽)
Embedding prompt injection payloads in specific video frames to attack multimodal models that process video content, exploiting temporal and visual channels simultaneously.
Virtual Persona Creation
Creating persistent alternate personas within LLM conversations to bypass safety training, establishing character identities that override the model's default behavioral constraints.
Running Your First PyRIT 紅隊 Campaign
初階 walkthrough for running your first PyRIT red team campaign from scratch, covering installation, target configuration, orchestrator setup, and basic result analysis.
Orchestrating Multi-Turn 攻擊 Sequences with PyRIT
Intermediate walkthrough on using PyRIT's orchestration capabilities for multi-turn red team campaigns, including attack strategy design, conversation management, and adaptive scoring.
PyRIT End-to-End 導覽
Complete walkthrough of Microsoft's Python Risk Identification Toolkit: setup, connecting to targets, running orchestrators, using converters, multi-turn attacks, and analyzing results with the web UI.