# safety-bypass
15 articlestagged with “safety-bypass”
Skill Verification: Jailbreaking
Timed skill verification lab: bypass safety measures on a defended AI system within 30 minutes using jailbreak techniques.
Direct Weight Manipulation
Techniques for directly modifying LoRA adapter weights to bypass safety training, inject targeted capabilities, and hide malicious behaviors -- going beyond dataset-driven fine-tuning to surgical weight-level attacks.
Harmful Content Generation
Bypassing safety mechanisms to generate dangerous content including weapons instructions, malware code, and harassment templates, with analysis of attack patterns and defenses.
Activation Steering
Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.
JSON Mode Safety Bypass
Exploit structured JSON output modes to bypass safety filters that only check plain-text responses.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.
Logit Bias Exploitation
How API logit bias parameters can be abused to force specific token generation, bypass safety alignment, suppress refusal tokens, and extract model behavior through systematic probing.
Sampling Parameter Attacks
How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.
VLM-Specific Jailbreaking
Jailbreaking techniques that exploit the vision modality, including image-text inconsistency attacks, visual safety bypass, and cross-modal jailbreaking strategies.
Jailbreak Techniques
Common patterns and advanced techniques for bypassing LLM safety alignment, including role-playing, encoding tricks, many-shot attacks, and gradient-based methods.
DAN Jailbreak Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Fine-Tuning Safety Bypass Walkthrough
Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.
Language Switch Jailbreak
Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Output Format Manipulation (Attack Walkthrough)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
Skeleton Key Attack
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.