# jailbreaking
標記為「jailbreaking」的 71 篇文章
Jailbreaking Techniques Assessment
Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.
Multi-Turn Attack Assessment
Assessment of crescendo attacks, conversational manipulation, and progressive jailbreaking techniques.
Prompt Injection Assessment
Test your knowledge of prompt injection types, techniques, defense mechanisms, and real-world exploitation with 15 intermediate-level questions.
Skill Verification: Jailbreaking
Timed skill verification lab: bypass safety measures on a defended AI system within 30 minutes using jailbreak techniques.
Case Study: Open-Source Model Jailbreak Campaign
Analysis of coordinated jailbreak campaigns against open-source models and community response patterns.
Building a Jailbreak Fuzzer
Build a mutation-based fuzzer for generating and testing jailbreak prompts at scale.
Implementing PAIR Automated Attacks
Implement the PAIR (Prompt Automatic Iterative Refinement) attack for automated jailbreaking.
Lab: PAIR Attack Implementation
Build a complete Prompt Automatic Iterative Refinement system that uses an attacker LLM to automatically generate and refine jailbreak prompts against a target model.
Tree of Attacks (TAP) Implementation
Build a TAP attack system that generates and prunes jailbreak attempts using tree search.
Temperature and Sampling Effects on Jailbreaks
Experiment with temperature and top-p settings to understand their impact on jailbreak success rates.
Jailbreak Gauntlet: Level 2 — Adaptive Defenses
Break through adaptive defense systems that learn from your previous attack attempts.
Custom Jailbreak Wordlist Generation
Generate targeted jailbreak wordlists using mutation strategies and seed prompts.
Implementing Many-Shot Jailbreaking
Implement Anthropic's many-shot jailbreaking technique with scaling analysis across conversation lengths.
Many-Shot Jailbreaking Analysis
Deep analysis of the many-shot jailbreaking technique and its implications for in-context learning.
Universal Jailbreak Techniques
Analysis of jailbreak techniques that transfer across multiple models and providers.
AutoDAN Stealthy Jailbreak Walkthrough
Walkthrough of generating stealthy, human-readable jailbreak prompts using the AutoDAN methodology.
Cipher-Based Jailbreak
Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.
Crescendo Multi-Turn Jailbreak Walkthrough
Step-by-step walkthrough of the Crescendo technique for gradually escalating multi-turn jailbreaks.
Crescendo Multi-Turn Attack
The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.
DAN Jailbreak Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Language Switch Jailbreak
Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Many-Shot Jailbreaking Walkthrough
Walkthrough implementing Anthropic's many-shot jailbreaking technique with analysis of scaling behavior.
Many-Shot Jailbreaking (Attack Walkthrough)
Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.
Output Format Manipulation (Attack Walkthrough)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
PAIR Automated Jailbreak Walkthrough
End-to-end walkthrough of the PAIR attack methodology for automated black-box jailbreaking.
PAIR Automated Jailbreak
Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.
Role Escalation Chain
Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.
Skeleton Key Attack
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.
System Prompt Override
Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.
TAP Tree of Attacks Walkthrough
Detailed walkthrough of building and running Tree of Attacks for systematic jailbreak generation.
Thought Injection for Reasoning Models
Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.
Token Smuggling
Exploiting LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.
Cognitive Overload Jailbreak Walkthrough
Step-by-step walkthrough of the cognitive overload technique that overwhelms safety training with complex nested instructions.
Context Virtualization Attack Walkthrough
Walkthrough of creating virtualized contexts that make the model believe it is operating in a different environment.
Few-Shot Attack Scaling Analysis
Detailed analysis of how few-shot examples scale to influence model behavior, from 2-shot to many-shot regime.
Refusal Suppression Technique Walkthrough
Walkthrough of refusal suppression techniques that prime models to avoid safety refusal patterns.
章節評量:越獄
15 題校準評量,測試你對越獄技術的理解——角色扮演、編碼、多範例、漸進與基於梯度的方法。
Multi-Turn 攻擊 評量
評量 of crescendo attacks, conversational manipulation, and progressive jailbreaking techniques.
技能驗證:越獄
限時技能驗證實驗室:在 30 分鐘內使用越獄技術繞過防禦 AI 系統的安全措施。
Case Study: Open-Source 模型 越獄 Campaign
Analysis of coordinated jailbreak campaigns against open-source models and community response patterns.
Building a 越獄 Fuzzer
Build a mutation-based fuzzer for generating and testing jailbreak prompts at scale.
Implementing PAIR Automated 攻擊s
Implement the PAIR (Prompt Automatic Iterative Refinement) attack for automated jailbreaking.
實驗室: PAIR 攻擊 Implementation
Build a complete Prompt Automatic Iterative Refinement system that uses an attacker LLM to automatically generate and refine jailbreak prompts against a target model.
Tree of 攻擊s (TAP) Implementation
Build a TAP attack system that generates and prunes jailbreak attempts using tree search.
Temperature and Sampling Effects on 越獄s
Experiment with temperature and top-p settings to understand their impact on jailbreak success rates.
越獄 Gauntlet: Level 2 — Adaptive 防禦s
Break through adaptive defense systems that learn from your previous attack attempts.
Custom 越獄 Wordlist Generation
Generate targeted jailbreak wordlists using mutation strategies and seed prompts.
Implementing Many-Shot 越獄ing
Implement Anthropic's many-shot jailbreaking technique with scaling analysis across conversation lengths.
Many-Shot 越獄ing Analysis
Deep analysis of the many-shot jailbreaking technique and its implications for in-context learning.
Universal 越獄 Techniques
Analysis of jailbreak techniques that transfer across multiple models and providers.
AutoDAN Stealthy 越獄 導覽
導覽 of generating stealthy, human-readable jailbreak prompts using the AutoDAN methodology.
Cipher-Based 越獄
Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.
Crescendo Multi-Turn 越獄 導覽
Step-by-step walkthrough of the Crescendo technique for gradually escalating multi-turn jailbreaks.
Crescendo Multi-Turn 攻擊
The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.
DAN 越獄 Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Language Switch 越獄
利用ing weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Many-Shot 越獄ing 導覽
Walkthrough implementing Anthropic's many-shot jailbreaking technique with analysis of scaling behavior.
Many-Shot 越獄ing (攻擊 導覽)
Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.
Output Format Manipulation (攻擊 導覽)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
PAIR Automated 越獄 導覽
End-to-end walkthrough of the PAIR attack methodology for automated black-box jailbreaking.
PAIR Automated 越獄
Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.
Role Escalation Chain
Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.
Skeleton Key 攻擊
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.
System Prompt Override
Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.
TAP Tree of 攻擊s 導覽
Detailed walkthrough of building and running Tree of 攻擊s for systematic jailbreak generation.
Thought Injection for Reasoning 模型s
Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.
Token Smuggling
利用ing LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.
Cognitive Overload 越獄 導覽
Step-by-step walkthrough of the cognitive overload technique that overwhelms safety training with complex nested instructions.
Context Virtualization 攻擊 導覽
導覽 of creating virtualized contexts that make the model believe it is operating in a different environment.
Few-Shot 攻擊 Scaling Analysis
Detailed analysis of how few-shot examples scale to influence model behavior, from 2-shot to many-shot regime.
Refusal Suppression Technique 導覽
導覽 of refusal suppression techniques that prime models to avoid safety refusal patterns.