# safety-bypass
標記為「safety-bypass」的 31 篇文章
Skill Verification: Jailbreaking
Timed skill verification lab: bypass safety measures on a defended AI system within 30 minutes using jailbreak techniques.
Direct Weight Manipulation
Techniques for directly modifying LoRA adapter weights to bypass safety training, inject targeted capabilities, and hide malicious behaviors -- going beyond dataset-driven fine-tuning to surgical weight-level attacks.
Harmful Content Generation
Bypassing safety mechanisms to generate dangerous content including weapons instructions, malware code, and harassment templates, with analysis of attack patterns and defenses.
Activation Steering
Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.
JSON Mode Safety Bypass
Exploit structured JSON output modes to bypass safety filters that only check plain-text responses.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.
Logit Bias Exploitation
How API logit bias parameters can be abused to force specific token generation, bypass safety alignment, suppress refusal tokens, and extract model behavior through systematic probing.
Sampling Parameter Attacks
How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.
VLM-Specific Jailbreaking
Jailbreaking techniques that exploit the vision modality, including image-text inconsistency attacks, visual safety bypass, and cross-modal jailbreaking strategies.
Jailbreak Techniques
Common patterns and advanced techniques for bypassing LLM safety alignment, including role-playing, encoding tricks, many-shot attacks, and gradient-based methods.
DAN Jailbreak Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Fine-Tuning Safety Bypass Walkthrough
Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.
Language Switch Jailbreak
Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Output Format Manipulation (Attack Walkthrough)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
Skeleton Key Attack
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.
章節評量:越獄
15 題校準評量,測試你對越獄技術的理解——角色扮演、編碼、多範例、漸進與基於梯度的方法。
技能驗證:越獄
限時技能驗證實驗室:在 30 分鐘內使用越獄技術繞過防禦 AI 系統的安全措施。
Direct Weight Manipulation
Techniques for directly modifying LoRA adapter weights to bypass safety training, inject targeted capabilities, and hide malicious behaviors -- going beyond dataset-driven fine-tuning to surgical weight-level attacks.
有害內容生成
繞過安全機制以生成危險內容(含武器說明、惡意程式碼、騷擾範本),並分析攻擊模式與防禦。
Activation Steering
Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.
JSON Mode Safety Bypass
利用 structured JSON output modes to bypass safety filters that only check plain-text responses.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.
Logit Bias 利用ation
How API logit bias parameters can be abused to force specific token generation, bypass safety alignment, suppress refusal tokens, and extract model behavior through systematic probing.
Sampling Parameter 攻擊s
How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.
VLM 特有的越獄手法
利用視覺模態的越獄技術,包括影像─文字不一致攻擊、視覺安全繞過,以及跨模態越獄策略。
越獄技術
繞過大型語言模型安全對齊的常見模式與進階技術,包含角色扮演、編碼技巧、多範例攻擊與基於梯度的方法。
DAN 越獄 Evolution
History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.
Fine-Tuning Safety Bypass 導覽
導覽 of using fine-tuning API access to remove safety behaviors from aligned models.
Language Switch 越獄
利用ing weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.
Output Format Manipulation (攻擊 導覽)
Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.
Skeleton Key 攻擊
The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.