# jailbreaking
36 artikelengetagd met “jailbreaking”
Beoordeling van jailbreak-technieken
Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.
Beoordeling van multi-turn-aanvallen
Assessment of crescendo attacks, conversational manipulation, and progressive jailbreaking techniques.
Beoordeling van prompt injection
Test your knowledge of prompt injection types, techniques, defense mechanisms, and real-world exploitation with 15 intermediate-level questions.
Vaardigheidsverificatie: jailbreaken
Timed skill verification lab: bypass safety measures on a defended AI system within 30 minutes using jailbreak techniques.
Casestudy: Jailbreak-campagne tegen Open-Source Modellen
Analyse van gecoördineerde jailbreak-campagnes tegen open-source modellen en responspatronen van de community.
Een jailbreak-fuzzer bouwen
Bouw een mutatie-gebaseerde fuzzer om op grote schaal jailbreak-prompts te genereren en te testen.
Geautomatiseerde PAIR-aanvallen implementeren
Implement the PAIR (Prompt Automatic Iterative Refinement) attack for automated jailbreaking.
Lab: implementatie van een PAIR-aanval
Build a complete Prompt Automatic Iterative Refinement system that uses an attacker LLM to automatically generate and refine jailbreak prompts against a target model.
Implementatie van Tree of Attacks (TAP)
Build a TAP attack system that generates and prunes jailbreak attempts using tree search.
Effecten van temperature en sampling op jailbreaks
Experiment with temperature and top-p settings to understand their impact on jailbreak success rates.
Jailbreak Gauntlet: niveau 2 — adaptieve verdedigingen
Break through adaptive defense systems that learn from your previous attack attempts.
Generatie van custom jailbreak-wordlists
Generate targeted jailbreak wordlists using mutation strategies and seed prompts.
Many-shot jailbreaking implementeren
Implement Anthropic's many-shot jailbreaking technique with scaling analysis across conversation lengths.
Analyse van many-shot jailbreaking
Diepgaande analyse van de many-shot jailbreaking-techniek en wat die betekent voor in-context learning.
Universele jailbreak-technieken
Analyse van jailbreak-technieken die overdraagbaar zijn tussen meerdere modellen en providers.
Walkthrough: stealthy jailbreak met AutoDAN
Walkthrough of generating stealthy, human-readable jailbreak prompts using the AutoDAN methodology.
Cipher-gebaseerde jailbreak
Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.
Walkthrough: Crescendo multi-turn jailbreak
Step-by-step walkthrough of the Crescendo technique for gradually escalating multi-turn jailbreaks.
Crescendo multi-turn aanval
The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.
De evolutie van de DAN-jailbreak
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Jailbreak via taalwissel
Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Walkthrough: many-shot jailbreaking
Walkthrough implementing Anthropic's many-shot jailbreaking technique with analysis of scaling behavior.
Many-shot jailbreaking (aanval-walkthrough)
Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.
Manipulatie van het uitvoerformaat (aanval-walkthrough)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
Walkthrough van geautomatiseerde jailbreak met PAIR
End-to-end walkthrough of the PAIR attack methodology for automated black-box jailbreaking.
Geautomatiseerde jailbreak met PAIR
Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.
Keten van rolescalatie
Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.
Skeleton Key-aanval
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.
Override van de systeemprompt
Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.
Walkthrough van TAP Tree of Attacks
Detailed walkthrough of building and running Tree of Attacks for systematic jailbreak generation.
Thought injection voor redeneermodellen
Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.
Token smuggling
Exploiting LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.
Walkthrough: jailbreak via cognitieve overbelasting
Step-by-step walkthrough of the cognitive overload technique that overwhelms safety training with complex nested instructions.
Walkthrough: context virtualization-aanval
Walkthrough of creating virtualized contexts that make the model believe it is operating in a different environment.
Analyse van schaalbaarheid van few-shot-aanvallen
Detailed analysis of how few-shot examples scale to influence model behavior, from 2-shot to many-shot regime.
Walkthrough van de techniek voor weigeringsonderdrukking
Walkthrough of refusal suppression techniques that prime models to avoid safety refusal patterns.