# safety-bypass

weight-manipulationloraadaptersafety-bypasscapability-injectionhidden-behaviormodel-editing

Direct Weight Manipulation

Techniques for directly modifying LoRA adapter weights to bypass safety training, inject targeted capabilities, and hide malicious behaviors -- going beyond dataset-driven fine-tuning to surgical weight-level attacks.

harmful-contentsafety-bypassmalwareweaponsimpact

Harmful Content Generation

Bypassing safety mechanisms to generate dangerous content including weapons instructions, malware code, and harassment templates, with analysis of attack patterns and defenses.

activation-steeringrepresentation-engineeringsteering-vectorsmechanisticsafety-bypass

Activation Steering

Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.

labsjson-modesafety-bypassintermediate

JSON Mode Safety Bypass

Exploit structured JSON output modes to bypass safety filters that only check plain-text responses.

activation-steeringrefusal-directionrepresentation-engineeringIRISsafety-bypassmechanistic-interpretability

Activation Manipulation & Safety Bypass

How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.

logit-biasapi-abusetoken-forcingsafety-bypassmodel-probing

Logit Bias Exploitation

How API logit bias parameters can be abused to force specific token generation, bypass safety alignment, suppress refusal tokens, and extract model behavior through systematic probing.

samplingtemperaturetop-ptop-ksafety-bypassapi-abuse

Sampling Parameter Attacks

How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.

jailbreakvlmmultimodalsafety-bypass

VLM-Specific Jailbreaking

Jailbreaking techniques that exploit the vision modality, including image-text inconsistency attacks, visual safety bypass, and cross-modal jailbreaking strategies.

jailbreaksafety-bypassalignmentred-teamingadversarial

Jailbreak Techniques

Common patterns and advanced techniques for bypassing LLM safety alignment, including role-playing, encoding tricks, many-shot attacks, and gradient-based methods.

jailbreakingDANprompt-engineeringsafety-bypassred-teaming

DAN Jailbreak Evolution

History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.

walkthroughsfine-tuningsafety-bypasstraining

Fine-Tuning Safety Bypass Walkthrough

Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.

jailbreakingmultilinguallanguage-switchlow-resource-languagessafety-bypassred-teaming

Language Switch Jailbreak

Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.

jailbreakingoutput-formatstructured-outputformat-manipulationsafety-bypassred-teaming

Output Format Manipulation (Attack Walkthrough)

Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.

jailbreakingskeleton-keymaster-keysafety-bypassred-teaming

Skeleton Key Attack

The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.

assessmentjailbreakingsafety-bypassevaluation

章節評量：越獄

15 題校準評量，測試你對越獄技術的理解——角色扮演、編碼、多範例、漸進與基於梯度的方法。

skill-verificationjailbreakingsafety-bypasshands-on

技能驗證：越獄

限時技能驗證實驗室：在 30 分鐘內使用越獄技術繞過防禦 AI 系統的安全措施。

weight-manipulationloraadaptersafety-bypasscapability-injectionhidden-behaviormodel-editing

Direct Weight Manipulation

harmful-contentsafety-bypassmalwareweaponsimpact

有害內容生成

繞過安全機制以生成危險內容（含武器說明、惡意程式碼、騷擾範本），並分析攻擊模式與防禦。

activation-steeringrepresentation-engineeringsteering-vectorsmechanisticsafety-bypass

Activation Steering

Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.

labsjson-modesafety-bypassintermediate

JSON Mode Safety Bypass

利用 structured JSON output modes to bypass safety filters that only check plain-text responses.

activation-steeringrefusal-directionrepresentation-engineeringIRISsafety-bypassmechanistic-interpretability

Activation Manipulation & Safety Bypass

logit-biasapi-abusetoken-forcingsafety-bypassmodel-probing

Logit Bias 利用ation

How API logit bias parameters can be abused to force specific token generation, bypass safety alignment, suppress refusal tokens, and extract model behavior through systematic probing.

samplingtemperaturetop-ptop-ksafety-bypassapi-abuse

Sampling Parameter 攻擊s

How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.

jailbreakvlmmultimodalsafety-bypass

VLM 特有的越獄手法

利用視覺模態的越獄技術，包括影像─文字不一致攻擊、視覺安全繞過，以及跨模態越獄策略。

jailbreaksafety-bypassalignmentred-teamingadversarial

越獄技術

繞過大型語言模型安全對齊的常見模式與進階技術，包含角色扮演、編碼技巧、多範例攻擊與基於梯度的方法。

jailbreakingDANprompt-engineeringsafety-bypassred-teaming

DAN 越獄 Evolution

History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.

walkthroughsfine-tuningsafety-bypasstraining

Fine-Tuning Safety Bypass 導覽

導覽 of using fine-tuning API access to remove safety behaviors from aligned models.

jailbreakingmultilinguallanguage-switchlow-resource-languagessafety-bypassred-teaming

Language Switch 越獄

利用ing weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.

jailbreakingoutput-formatstructured-outputformat-manipulationsafety-bypassred-teaming

Output Format Manipulation (攻擊導覽)

Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.

jailbreakingskeleton-keymaster-keysafety-bypassred-teaming

Skeleton Key 攻擊

The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.