# jailbreaking

36 artikelengetagd met “jailbreaking”

Beoordeling van jailbreak-technieken

Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.

assessmentjailbreakingbypasssafety-trainingred-teaming

Gemiddeld

Beoordeling van multi-turn-aanvallen

Assessment of crescendo attacks, conversational manipulation, and progressive jailbreaking techniques.

assessmentmulti-turnjailbreaking

Gemiddeld

Beoordeling van prompt injection

Test your knowledge of prompt injection types, techniques, defense mechanisms, and real-world exploitation with 15 intermediate-level questions.

assessmentprompt-injectionjailbreakinginjection-techniques

Gemiddeld

Vaardigheidsverificatie: jailbreaken

Timed skill verification lab: bypass safety measures on a defended AI system within 30 minutes using jailbreak techniques.

skill-verificationjailbreakingsafety-bypasshands-on

Gevorderd

Casestudy: Jailbreak-campagne tegen Open-Source Modellen

Analyse van gecoördineerde jailbreak-campagnes tegen open-source modellen en responspatronen van de community.

case-studyopen-sourcejailbreaking

Gemiddeld

Een jailbreak-fuzzer bouwen

Bouw een mutatie-gebaseerde fuzzer om op grote schaal jailbreak-prompts te genereren en te testen.

exploit-devfuzzerjailbreakingtooling

Gevorderd

Geautomatiseerde PAIR-aanvallen implementeren

Implement the PAIR (Prompt Automatic Iterative Refinement) attack for automated jailbreaking.

labspairautomatedjailbreaking

Gevorderd

Lab: implementatie van een PAIR-aanval

Build a complete Prompt Automatic Iterative Refinement system that uses an attacker LLM to automatically generate and refine jailbreak prompts against a target model.

labpairautomated-attacksjailbreaking

Gevorderd

Implementatie van Tree of Attacks (TAP)

Build a TAP attack system that generates and prunes jailbreak attempts using tree search.

labstapautomatedjailbreaking

Gevorderd

Effecten van temperature en sampling op jailbreaks

Experiment with temperature and top-p settings to understand their impact on jailbreak success rates.

labstemperaturesamplingjailbreaking

Beginner

Jailbreak Gauntlet: niveau 2 — adaptieve verdedigingen

Break through adaptive defense systems that learn from your previous attack attempts.

labsctfjailbreakingadaptive

Gevorderd

Generatie van custom jailbreak-wordlists

Generate targeted jailbreak wordlists using mutation strategies and seed prompts.

labswordlistsjailbreakingintermediate

Gemiddeld

Many-shot jailbreaking implementeren

Implement Anthropic's many-shot jailbreaking technique with scaling analysis across conversation lengths.

labsmany-shotjailbreakingintermediate

Gemiddeld

Analyse van many-shot jailbreaking

Diepgaande analyse van de many-shot jailbreaking-techniek en wat die betekent voor in-context learning.

prompt-injectionmany-shotjailbreakinganthropic

Gevorderd

Universele jailbreak-technieken

Analyse van jailbreak-technieken die overdraagbaar zijn tussen meerdere modellen en providers.

prompt-injectionuniversaljailbreakingtechniques

Gevorderd

Walkthrough: stealthy jailbreak met AutoDAN

Walkthrough of generating stealthy, human-readable jailbreak prompts using the AutoDAN methodology.

walkthroughsautodanstealthyjailbreaking

Gevorderd

Cipher-gebaseerde jailbreak

Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.

jailbreakingcipherencodingobfuscationcontent-filter-bypassred-teaming

Gemiddeld

Walkthrough: Crescendo multi-turn jailbreak

Step-by-step walkthrough of the Crescendo technique for gradually escalating multi-turn jailbreaks.

walkthroughscrescendomulti-turnjailbreaking

Gemiddeld

Crescendo multi-turn aanval

The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.

jailbreakingcrescendomulti-turnconversation-escalationred-teaming

Gevorderd

De evolutie van de DAN-jailbreak

History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.

jailbreakingDANprompt-engineeringsafety-bypassred-teaming

Gemiddeld

Jailbreak via taalwissel

Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.

jailbreakingmultilinguallanguage-switchlow-resource-languagessafety-bypassred-teaming

Gemiddeld

Walkthrough: many-shot jailbreaking

Walkthrough implementing Anthropic's many-shot jailbreaking technique with analysis of scaling behavior.

walkthroughsmany-shotjailbreakinganthropic

Gemiddeld

Many-shot jailbreaking (aanval-walkthrough)

Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.

jailbreakingmany-shotin-context-learninglong-contextred-teaming

Gemiddeld

Manipulatie van het uitvoerformaat (aanval-walkthrough)

Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.

jailbreakingoutput-formatstructured-outputformat-manipulationsafety-bypassred-teaming

Gemiddeld

Walkthrough van geautomatiseerde jailbreak met PAIR

End-to-end walkthrough of the PAIR attack methodology for automated black-box jailbreaking.

walkthroughspairautomatedjailbreaking

Gevorderd

Geautomatiseerde jailbreak met PAIR

Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.

jailbreakingPAIRautomated-red-teamingLLM-attackeriterative-refinementred-teaming

Gevorderd

Keten van rolescalatie

Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.

jailbreakingrole-escalationpersona-manipulationmulti-turnprivilege-escalationred-teaming

Gevorderd

Skeleton Key-aanval

The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.

jailbreakingskeleton-keymaster-keysafety-bypassred-teaming

Gevorderd

Override van de systeemprompt

Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.

jailbreakingsystem-promptprompt-injectionauthority-overridered-teaming

Gemiddeld

Walkthrough van TAP Tree of Attacks

Detailed walkthrough of building and running Tree of Attacks for systematic jailbreak generation.

walkthroughstaptree-of-attacksjailbreaking

Gevorderd

Thought injection voor redeneermodellen

Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.

jailbreakingthought-injectionchain-of-thoughtreasoning-modelsCoTred-teaming

Gevorderd

Token smuggling

Exploiting LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.

jailbreakingtokenizationtoken-smugglingBPEsubwordcontent-filter-bypassred-teaming

Gevorderd

Walkthrough: jailbreak via cognitieve overbelasting

Step-by-step walkthrough of the cognitive overload technique that overwhelms safety training with complex nested instructions.

walkthroughscognitive-overloadjailbreakingtechnique

Gemiddeld

Walkthrough: context virtualization-aanval

Walkthrough of creating virtualized contexts that make the model believe it is operating in a different environment.

walkthroughscontext-virtualizationjailbreakingtechnique

Gevorderd

Analyse van schaalbaarheid van few-shot-aanvallen

Detailed analysis of how few-shot examples scale to influence model behavior, from 2-shot to many-shot regime.

walkthroughsfew-shotscalingjailbreaking

Gevorderd

Walkthrough van de techniek voor weigeringsonderdrukking

Walkthrough of refusal suppression techniques that prime models to avoid safety refusal patterns.

walkthroughsrefusal-suppressionjailbreakingtechnique

Gemiddeld