# research
標記為「research」的 122 篇文章
Practice Exam 3: Expert Red Team
25-question expert-level practice exam covering research techniques, automation, fine-tuning attacks, supply chain security, and incident response.
Frontier Research Practice Exam
Practice exam covering alignment faking, sleeper agents, reasoning model attacks, and mechanistic interpretability.
Advanced Topics Study Guide
Study guide covering AI security research techniques, automation, forensics, emerging attack vectors, and tool development for advanced practitioners.
October 2026: DEF CON AI Village Challenge
Reproduce and extend attacks from AI Village talks and workshops, building on published research to develop practical exploitation techniques.
Collaborative Research Projects
Active collaborative research projects in AI security open for community participation.
Research Sprint: Novel Injection Techniques
Community research sprint focused on discovering and documenting novel injection techniques.
Fall 2026 CTF: Advanced Attack Research
An expert-level CTF requiring novel technique development, research-grade problem solving, and original vulnerability discovery in AI systems.
Research Challenge: Attack Interpretability
Community research challenge focused on understanding why specific adversarial techniques succeed using interpretability and mechanistic analysis methods.
Research Challenge: Cross-Model Transfer
Research-oriented challenge exploring the transferability of adversarial techniques across model families, with published findings.
Advanced Defense Techniques
Cutting-edge defense research including instruction hierarchy, constitutional AI, and representation engineering for safety -- what is promising versus what is actually deployed.
The Attacker Moves Second Problem
Why static LLM defenses fail against adaptive adversaries: analysis of 12 bypassed defenses and implications for defense design.
Chain-of-Thought Faithfulness Attacks
Exploiting unfaithful chain-of-thought reasoning where models' stated reasoning does not match their actual decision process.
Cultural Alignment Gaps in Safety Training
Research on cross-cultural gaps in safety alignment and their exploitation through cultural context manipulation.
Frontier Research
Cutting-edge AI security research covering reasoning model attacks, code generation security, computer use agents, AI-powered red teaming, robotics and embodied AI, and alignment faking.
Jailbreaking via Persona Engineering
Research on using sophisticated persona engineering to bypass safety training in frontier models.
Model Collapse in Recursive Training
Security implications of model collapse when models are trained on AI-generated data from previous generations.
Multi-Agent Deception Research
Research on deceptive behavior emerging in multi-agent systems without explicit training for deception.
Post-Deployment Safety Degradation
Research on how model safety degrades over time through fine-tuning, adaptation, and use-case drift.
Prompt Robustness Certification Research
Research on certifying prompt robustness with formal guarantees against bounded adversarial perturbations.
Quantization & Safety Alignment
How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.
Reasoning Model Jailbreaks
How reasoning capabilities create novel jailbreak surfaces: chain-of-thought exploitation, scratchpad attacks, and why higher reasoning effort increases attack success.
RL-Based Jailbreak Optimization
Using reinforcement learning to optimize jailbreak strategies against black-box language models.
The Safety Tax: Performance Impact of Safety Training
Research on the performance degradation caused by safety training and its exploitation implications.
Universal Adversarial Triggers Research
Research on discovering universal adversarial triggers that cause specific behaviors across model families.
Adaptive Attacks Against Safety Training
Research on attacks that adapt to safety training procedures to discover persistent vulnerabilities.
Constitutional AI Limitations Research
Research on the limitations of constitutional AI approaches and known bypass categories.
Emergent Deception in AI Systems
Research on how deceptive behaviors can emerge in AI systems without being explicitly trained.
Interpretability-Driven Attack Design
Using interpretability insights to design more effective and targeted attacks on language models.
Long-Context Window Security Research
Security research on vulnerabilities specific to models with extremely long context windows (1M+ tokens).
Safety Loss During Model Distillation
Research on how safety alignment degrades during knowledge distillation from larger to smaller models.
Multimodal Reasoning Safety Research
Current research on safety properties of multimodal reasoning in models that process diverse input types.
Probing Classifiers for Safety Analysis
Research on using probing classifiers to analyze safety-relevant representations in language models.
Specification Gaming in AI Systems
Research on how AI systems find unexpected shortcuts that satisfy specifications without achieving intended goals.
Synthetic Persona Attacks Research
Research on using AI-generated synthetic personas to conduct social engineering against AI systems.
Pharmaceutical Drug Discovery AI Risks
Security risks in AI-powered drug discovery including molecular generation manipulation and data poisoning.
Adversarial Robustness Certification
Research into certifiable adversarial robustness for LLMs, including theoretical bounds and practical certification methods.
Attention Manipulation Research
Research into directly manipulating attention patterns to achieve injection objectives, informed by mechanistic interpretability insights.
Cross-Architecture Injection Transfer
Research into how injection techniques transfer across model architectures and what architectural properties determine transferability.
Cross-Lingual Injection Transfer Research
Research on how injection techniques transfer across languages and multilingual models.
Defense-Aware Payload Design
Designing injection payloads that adapt to and evade specific defense mechanisms through probing and feedback-based optimization.
Defense-Informed Injection Design
Methodology for designing injections that account for known defensive mechanisms.
Formal Verification of Injection Defenses
Research into formal methods for verifying injection defense properties and providing mathematical guarantees of defense effectiveness.
Injection Research
Advanced research in prompt injection, jailbreak automation, and multimodal attack vectors, covering cutting-edge techniques that push beyond standard injection approaches.
Injection Benchmarking Methodology
Standardized methodologies for benchmarking injection attacks and defenses to enable meaningful comparison across research papers and tools.
Injection Detection Research
State-of-the-art research in injection detection including perplexity-based methods, classifier approaches, and ensemble techniques.
Injection in Reasoning Models
Research into injection attacks specific to reasoning-augmented models that exploit chain-of-thought processes and self-reflection mechanisms.
Injection Scaling Laws
Research into how injection effectiveness scales with model size, training compute, and safety training investment.
Injection Attack Surface Taxonomy
Comprehensive taxonomy of all known injection attack surfaces in LLM-powered applications.
Injection Transferability Research
Research on how prompt injection techniques transfer across different model families and sizes.
Multi-Agent Injection Research
Research into how injections propagate through multi-agent systems and what properties determine infection spread rates.
Novel Injection Classes
Exploring emerging injection classes that don't fit traditional taxonomies, including structural, temporal, and cross-system injection vectors.
Semantic Space Injection Research
Research into injections that operate in semantic embedding space rather than token space, exploiting learned representations directly.
Temporal Dynamics of Injection Success
Research on how injection success rates change over time with model updates and defense evolution.
Lab: Novel Jailbreak Research
Systematic methodology for discovering new jailbreak techniques against large language models. Learn to identify unexplored attack surfaces, develop novel attack vectors, and validate findings with scientific rigor.
Lab: Vulnerability Research Methodology
Systematic methodology lab for discovering novel AI vulnerabilities including hypothesis generation, attack surface mapping, experimental design, validation protocols, and responsible disclosure.
Audio Modality Attacks
Comprehensive attack taxonomy for audio-enabled LLMs: adversarial audio generation, voice-based prompt injection, cross-modal split attacks, and ultrasonic perturbations.
Many-Shot Jailbreaking
Power-law scaling of in-context jailbreaks: why 5 shots fail but 256 succeed, context window size as attack surface, and mitigations for long-context exploitation.
OSINT for AI Red Teaming
Gathering intelligence about AI deployments from public sources: documentation, job postings, research papers, social media, and technical artifacts.
Full Engagement: Legal Research AI System
Full red team engagement of a legal research AI with case law access, privilege management, and citation generation.
Practice Exam 3: 專家 紅隊
25-question expert-level practice exam covering research techniques, automation, fine-tuning attacks, supply chain security, and incident response.
Frontier Research Practice Exam
Practice exam covering alignment faking, sleeper agents, reasoning model attacks, and mechanistic interpretability.
進階 Topics Study 指南
Study guide covering AI security research techniques, automation, forensics, emerging attack vectors, and tool development for advanced practitioners.
微調安全研究的教訓
來自微調安全研究的關鍵教訓——涵蓋對齊侵蝕、後門植入、資料投毒、安全評估落差,以及微調管線的防禦策略。
2026 年提示詞注入現況
提示詞注入攻擊如何從簡單的指令覆蓋演進為複雜的多階段利用鏈。
AI 安全最新動態 — 2026 年 3 月
2026 年 3 月最重要 AI 安全發展、工具更新、研究亮點與新興攻擊向量的月度彙整。
2026 年 10 月:DEF CON AI Village 挑戰
重現並擴展 AI Village 演講與工作坊之攻擊,建立於已發表研究之上以發展實務利用技術。
Collaborative Research Projects
Active collaborative research projects in AI security open for community participation.
Research Sprint: Novel Injection Techniques
Community research sprint focused on discovering and documenting novel injection techniques.
2026 秋季 CTF:進階攻擊研究
需新穎技術發展、研究級問題解決與 AI 系統中原創漏洞發現之專家級 CTF。
Research Challenge: 攻擊 Interpretability
Community research challenge focused on understanding why specific adversarial techniques succeed using interpretability and mechanistic analysis methods.
Research Challenge: Cross-模型 Transfer
Research-oriented challenge exploring the transferability of adversarial techniques across model families, with published findings.
進階防禦技術
前沿防禦研究,包括指令階層、Constitutional AI,以及為安全之表徵工程——何者具前景、何者已實際部署。
The 攻擊er Moves Second Problem
Why static LLM defenses fail against adaptive adversaries: analysis of 12 bypassed defenses and implications for defense design.
Chain-of-Thought Faithfulness 攻擊s
Exploiting unfaithful chain-of-thought reasoning where models' stated reasoning does not match their actual decision process.
程式碼生成模型安全研究
程式碼生成模型的前沿安全研究——涵蓋 Copilot 利用、建議投毒、儲存庫投毒與 AI 驅動開發工具安全。
Cultural Alignment Gaps in Safety 訓練
Research on cross-cultural gaps in safety alignment and their exploitation through cultural context manipulation.
前沿研究
涵蓋推理模型攻擊、程式碼生成安全、電腦使用代理、AI 驅動紅隊演練、機器人與具身 AI,以及對齊造假的尖端 AI 安全研究。
越獄ing via Persona Engineering
Research on using sophisticated persona engineering to bypass safety training in frontier models.
模型 Collapse in Recursive 訓練
安全 implications of model collapse when models are trained on AI-generated data from previous generations.
Multi-代理 Deception Research
Research on deceptive behavior emerging in multi-agent systems without explicit training for deception.
Post-Deployment Safety Degradation
Research on how model safety degrades over time through fine-tuning, adaptation, and use-case drift.
Prompt Robustness Certification Research
Research on certifying prompt robustness with formal guarantees against bounded adversarial perturbations.
Quantization & Safety Alignment
How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.
Reasoning 模型 越獄s
How reasoning capabilities create novel jailbreak surfaces: chain-of-thought exploitation, scratchpad attacks, and why higher reasoning effort increases attack success.
RL-Based 越獄 Optimization
Using reinforcement learning to optimize jailbreak strategies against black-box language models.
The Safety Tax: Performance Impact of Safety 訓練
Research on the performance degradation caused by safety training and its exploitation implications.
Universal Adversarial Triggers Research
Research on discovering universal adversarial triggers that cause specific behaviors across model families.
Adaptive 攻擊s Against Safety 訓練
Research on attacks that adapt to safety training procedures to discover persistent vulnerabilities.
Constitutional AI Limitations Research
Research on the limitations of constitutional AI approaches and known bypass categories.
Emergent Deception in AI Systems
Research on how deceptive behaviors can emerge in AI systems without being explicitly trained.
Interpretability-Driven 攻擊 Design
Using interpretability insights to design more effective and targeted attacks on language models.
Long-Context Window 安全 Research
安全 research on vulnerabilities specific to models with extremely long context windows (1M+ tokens).
Safety Loss During 模型 Distillation
Research on how safety alignment degrades during knowledge distillation from larger to smaller models.
Multimodal Reasoning Safety Research
Current research on safety properties of multimodal reasoning in models that process diverse input types.
Probing Classifiers for Safety Analysis
Research on using probing classifiers to analyze safety-relevant representations in language models.
Specification Gaming in AI Systems
Research on how AI systems find unexpected shortcuts that satisfy specifications without achieving intended goals.
Synthetic Persona 攻擊s Research
Research on using AI-generated synthetic personas to conduct social engineering against AI systems.
Pharmaceutical Drug Discovery AI Risks
安全 risks in AI-powered drug discovery including molecular generation manipulation and data poisoning.
Adversarial Robustness Certification
Research into certifiable adversarial robustness for LLMs, including theoretical bounds and practical certification methods.
Attention Manipulation Research
Research into directly manipulating attention patterns to achieve injection objectives, informed by mechanistic interpretability insights.
Cross-Architecture Injection Transfer
Research into how injection techniques transfer across model architectures and what architectural properties determine transferability.
Cross-Lingual Injection Transfer Research
Research on how injection techniques transfer across languages and multilingual models.
防禦-Aware Payload Design
Designing injection payloads that adapt to and evade specific defense mechanisms through probing and feedback-based optimization.
防禦-Informed Injection Design
Methodology for designing injections that account for known defensive mechanisms.
Formal Verification of Injection 防禦s
Research into formal methods for verifying injection defense properties and providing mathematical guarantees of defense effectiveness.
注入研究
提示詞注入、越獄自動化與多模態攻擊向量的進階研究,涵蓋超越標準注入方法的尖端技術。
Injection Benchmarking Methodology
Standardized methodologies for benchmarking injection attacks and defenses to enable meaningful comparison across research papers and tools.
Injection Detection Research
State-of-the-art research in injection detection including perplexity-based methods, classifier approaches, and ensemble techniques.
Injection in Reasoning 模型s
Research into injection attacks specific to reasoning-augmented models that exploit chain-of-thought processes and self-reflection mechanisms.
Injection Scaling Laws
Research into how injection effectiveness scales with model size, training compute, and safety training investment.
Injection 攻擊 Surface Taxonomy
Comprehensive taxonomy of all known injection attack surfaces in LLM-powered applications.
Injection Transferability Research
Research on how prompt injection techniques transfer across different model families and sizes.
Multi-代理 Injection Research
Research into how injections propagate through multi-agent systems and what properties determine infection spread rates.
Novel Injection Classes
Exploring emerging injection classes that don't fit traditional taxonomies, including structural, temporal, and cross-system injection vectors.
Semantic Space Injection Research
Research into injections that operate in semantic embedding space rather than token space, exploiting learned representations directly.
Temporal Dynamics of Injection Success
Research on how injection success rates change over time with model updates and defense evolution.
實驗室: Novel 越獄 Research
Systematic methodology for discovering new jailbreak techniques against large language models. Learn to identify unexplored attack surfaces, develop novel attack vectors, and validate findings with scientific rigor.
實驗室: 漏洞 Research Methodology
Systematic methodology lab for discovering novel AI vulnerabilities including hypothesis generation, attack surface mapping, experimental design, validation protocols, and responsible disclosure.
Audio Modality 攻擊s
Comprehensive attack taxonomy for audio-enabled LLMs: adversarial audio generation, voice-based prompt injection, cross-modal split attacks, and ultrasonic perturbations.
Many-Shot 越獄ing
Power-law scaling of in-context jailbreaks: why 5 shots fail but 256 succeed, context window size as attack surface, and mitigations for long-context exploitation.
AI 紅隊的 OSINT
自公開來源蒐集關於 AI 部署的情資:文件、徵才啟事、研究論文、社群媒體與技術產物。
Full Engagement: Legal Research AI System
Full red team engagement of a legal research AI with case law access, privilege management, and citation generation.