# jailbreaking
36 articlestagged with “jailbreaking”
Jailbreaking Techniques Assessment
Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.
Multi-Turn Attack Assessment
Assessment of crescendo attacks, conversational manipulation, and progressive jailbreaking techniques.
Prompt Injection Assessment
Test your knowledge of prompt injection types, techniques, defense mechanisms, and real-world exploitation with 15 intermediate-level questions.
Skill Verification: Jailbreaking
Timed skill verification lab: bypass safety measures on a defended AI system within 30 minutes using jailbreak techniques.
Case Study: Open-Source Model Jailbreak Campaign
Analysis of coordinated jailbreak campaigns against open-source models and community response patterns.
Building a Jailbreak Fuzzer
Build a mutation-based fuzzer for generating and testing jailbreak prompts at scale.
Implementing PAIR Automated Attacks
Implement the PAIR (Prompt Automatic Iterative Refinement) attack for automated jailbreaking.
Lab: PAIR Attack Implementation
Build a complete Prompt Automatic Iterative Refinement system that uses an attacker LLM to automatically generate and refine jailbreak prompts against a target model.
Tree of Attacks (TAP) Implementation
Build a TAP attack system that generates and prunes jailbreak attempts using tree search.
Temperature and Sampling Effects on Jailbreaks
Experiment with temperature and top-p settings to understand their impact on jailbreak success rates.
Jailbreak Gauntlet: Level 2 — Adaptive Defenses
Break through adaptive defense systems that learn from your previous attack attempts.
Custom Jailbreak Wordlist Generation
Generate targeted jailbreak wordlists using mutation strategies and seed prompts.
Implementing Many-Shot Jailbreaking
Implement Anthropic's many-shot jailbreaking technique with scaling analysis across conversation lengths.
Many-Shot Jailbreaking Analysis
Deep analysis of the many-shot jailbreaking technique and its implications for in-context learning.
Universal Jailbreak Techniques
Analysis of jailbreak techniques that transfer across multiple models and providers.
AutoDAN Stealthy Jailbreak Walkthrough
Walkthrough of generating stealthy, human-readable jailbreak prompts using the AutoDAN methodology.
Cipher-Based Jailbreak
Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.
Crescendo Multi-Turn Jailbreak Walkthrough
Step-by-step walkthrough of the Crescendo technique for gradually escalating multi-turn jailbreaks.
Crescendo Multi-Turn Attack
The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.
DAN Jailbreak Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Language Switch Jailbreak
Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Many-Shot Jailbreaking Walkthrough
Walkthrough implementing Anthropic's many-shot jailbreaking technique with analysis of scaling behavior.
Many-Shot Jailbreaking (Attack Walkthrough)
Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.
Output Format Manipulation (Attack Walkthrough)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
PAIR Automated Jailbreak Walkthrough
End-to-end walkthrough of the PAIR attack methodology for automated black-box jailbreaking.
PAIR Automated Jailbreak
Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.
Role Escalation Chain
Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.
Skeleton Key Attack
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.
System Prompt Override
Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.
TAP Tree of Attacks Walkthrough
Detailed walkthrough of building and running Tree of Attacks for systematic jailbreak generation.
Thought Injection for Reasoning Models
Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.
Token Smuggling
Exploiting LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.
Cognitive Overload Jailbreak Walkthrough
Step-by-step walkthrough of the cognitive overload technique that overwhelms safety training with complex nested instructions.
Context Virtualization Attack Walkthrough
Walkthrough of creating virtualized contexts that make the model believe it is operating in a different environment.
Few-Shot Attack Scaling Analysis
Detailed analysis of how few-shot examples scale to influence model behavior, from 2-shot to many-shot regime.
Refusal Suppression Technique Walkthrough
Walkthrough of refusal suppression techniques that prime models to avoid safety refusal patterns.