# safety
標記為「safety」的 128 篇文章
Autonomous Goal Drift
How autonomous AI agents drift from intended objectives through reward hacking, specification gaming, safety system bypass, and cascading failures in multi-agent systems.
Code Execution Safety Assessment
Assessment of LLM-generated code safety, sandbox escape techniques, and code review automation.
Capstone: Design and Implement an AI Safety Benchmark Suite
Build a comprehensive, reproducible benchmark suite for evaluating LLM safety across multiple risk dimensions including toxicity, bias, hallucination, and adversarial robustness.
Capstone: Medical AI System Assessment
Comprehensive red team assessment of a medical AI diagnostic system addressing patient safety, data privacy, and regulatory compliance.
Autonomous Vehicle AI Security
Security analysis of AI systems in autonomous vehicles. Perception system attacks, decision model manipulation, V2X communication exploitation, and the physical safety implications of AV AI vulnerabilities.
Education & Tutoring AI Security
Security analysis of AI systems in education. Academic integrity bypass, inappropriate content risks, student data protection under COPPA and FERPA, and testing methodologies for educational AI platforms.
Healthcare AI Security
Security testing methodology for healthcare AI systems. PHI exposure risks, clinical decision manipulation, HIPAA compliance implications, and testing approaches for health AI including diagnostic, clinical decision support, and patient-facing systems.
Bing Chat Sydney Incident
Analysis of the February 2023 Bing Chat 'Sydney' incident where Microsoft's AI chatbot exhibited erratic behavior including emotional manipulation, threats, and identity confusion during extended conversations.
Azure AI Content Safety Testing
Testing Azure AI Content Safety service for bypass vulnerabilities and configuration weaknesses.
User Intent Classification for Safety
Building user intent classifiers that distinguish legitimate requests from adversarial manipulation attempts.
Alignment Removal via Fine-Tuning
Techniques for removing safety alignment through targeted fine-tuning with minimal data.
API Fine-Tuning Security
Security analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.
Few-Shot Fine-Tuning Risks
Security risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
Instruction Tuning Manipulation
Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.
Instruction Tuning Safety Bypass
Using instruction tuning to selectively bypass safety mechanisms while maintaining model capability.
Quantization-Induced Safety Degradation
How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.
Safety Training Methods
Overview of safety training methods including RLHF, Constitutional AI, DPO, and their limitations from a red team perspective.
Understanding LLM Safety Training
How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.
Alignment Faking Detection
Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.
Constitutional Classifiers for AI Safety
Analysis of Anthropic's Constitutional Classifiers approach to jailbreak resistance.
Post-Deployment Safety Degradation
Research on how model safety degrades over time through fine-tuning, adaptation, and use-case drift.
Quantization & Safety Alignment
How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.
Representation Engineering for Security
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
The Safety Tax: Performance Impact of Safety Training
Research on the performance degradation caused by safety training and its exploitation implications.
Continual Learning Safety Challenges
Safety challenges in continual learning systems where models adapt to new data over time.
Cooperative AI Safety and Security
Security implications of cooperative AI systems and adversarial manipulation of cooperative behaviors.
Emergent Deception in AI Systems
Research on how deceptive behaviors can emerge in AI systems without being explicitly trained.
Multimodal Reasoning Safety Research
Current research on safety properties of multimodal reasoning in models that process diverse input types.
AI Safety Benchmarks & Evaluation
Overview of AI safety evaluation: benchmarking frameworks, safety metrics, evaluation methodologies, and the landscape of standardized assessment tools for AI red teaming.
Aviation AI Security
Security of AI in air traffic control, maintenance prediction, passenger screening, and flight operations.
Construction Industry AI Security
AI security in building design, project management, safety monitoring, and autonomous construction equipment.
Critical Infrastructure AI Security
Security testing for AI in critical infrastructure: SCADA/ICS integration, power grid AI, transportation systems, water treatment, and the convergence of operational technology with artificial intelligence.
Construction Industry AI Threats
Security considerations for AI in construction including project planning, safety monitoring, and resource allocation.
Lab: Safety Regression Testing at Scale
Build automated pipelines that detect safety degradation across model versions, ensuring that updates and fine-tuning do not introduce new vulnerabilities or weaken existing protections.
Lab: Model Comparison
Test the same attack techniques against different language models and compare their safety behaviors, refusal patterns, and vulnerability profiles.
Lab: Compare Model Safety
Hands-on lab for running identical safety tests against GPT-4, Claude, Gemini, and Llama to compare how different models handle prompt injection, jailbreaks, and safety boundary enforcement.
Lab: Mapping Safety Boundaries
Systematically discover what a language model will and won't do by probing its safety boundaries across multiple categories and documenting the results.
Safety Boundary Mapping
Systematically map the safety boundaries of an LLM by testing increasingly sensitive topics and documenting refusal patterns.
Safety Training Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
Temperature and Top-K Effects on Safety
Systematically test how temperature, top-k, and top-p parameters affect safety guardrail effectiveness.
CTF: Alignment Breaker
Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.
Lab: Alignment Stress Testing
Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.
Lab: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
Simulation: Healthcare AI Safety Assessment
Expert-level simulation assessing a clinical decision support AI for safety violations, data leakage, and manipulation of medical recommendations.
Architecture Comparison for Safety Properties
Comparative analysis of how architectural choices (dense vs MoE, decoder-only vs encoder-decoder) affect safety properties and attack surfaces.
Open Source Model Safety Comparison
Comparative safety analysis across open-source model families including Llama, Mistral, Qwen, and Phi.
Pruning Impact on Safety
How structured and unstructured pruning affects model safety properties, and techniques for exploiting pruning artifacts to bypass safety training.
Quantization Impact on Model Safety
How quantization affects safety alignment including GPTQ, AWQ, and GGUF format implications.
Multimodal Defense Strategies
Comprehensive defense approaches for multimodal AI systems: cross-modal verification, perceptual hashing, NSFW detection, input sanitization, and defense-in-depth architectures.
Alignment Challenges in Multimodal Models
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
Defending Multimodal AI Systems
Comprehensive defense strategies for multimodal AI systems including input sanitization, cross-modal safety classifiers, instruction hierarchy, and monitoring for adversarial multimodal inputs.
Benchmarking Multimodal Model Safety
Designing and implementing safety benchmarks for multimodal AI models that process images, audio, and video alongside text, covering cross-modal attack evaluation, consistency testing, and safety score aggregation.
Deconfliction Procedures for AI Testing
Procedures for deconflicting AI red team testing activities with production operations, monitoring teams, and other concurrent assessments.
Security Implications of DPO Training
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
Model Merging Safety Implications
Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.
Pre-Training Safety Interventions
Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.
Constitutional Classifier Setup
Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.
LLM Judge Implementation
Step-by-step walkthrough for using an LLM to judge another LLM's outputs for safety and quality, covering judge prompt design, scoring rubrics, calibration, cost optimization, and deployment patterns.
Output Content Classifier
Step-by-step walkthrough for building a classifier to filter harmful LLM outputs, covering taxonomy definition, multi-label classification, threshold calibration, and deployment as a real-time output gate.
Runtime Safety Monitor Implementation
Implement a runtime safety monitor that detects and blocks unsafe model outputs in real-time.
Toxicity Scoring Pipeline
Step-by-step walkthrough for building a toxicity scoring pipeline for LLM output filtering, covering model selection, multi-dimensional scoring, threshold calibration, and production deployment with real-time scoring.
HarmBench Evaluation Framework Walkthrough
Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.
Inspect AI Safety Evaluations
Build and run AI safety evaluations using the UK AISI Inspect framework.
Autonomous Goal Drift
How autonomous AI agents drift from intended objectives through reward hacking, specification gaming, safety system bypass, and cascading failures in multi-agent systems.
Code Execution Safety 評量
評量 of LLM-generated code safety, sandbox escape techniques, and code review automation.
微調安全研究的教訓
來自微調安全研究的關鍵教訓——涵蓋對齊侵蝕、後門植入、資料投毒、安全評估落差,以及微調管線的防禦策略。
Capstone: Design and Implement an AI Safety Benchmark Suite
Build a comprehensive, reproducible benchmark suite for evaluating LLM safety across multiple risk dimensions including toxicity, bias, hallucination, and adversarial robustness.
Capstone: Medical AI System 評量
Comprehensive red team assessment of a medical AI diagnostic system addressing patient safety, data privacy, and regulatory compliance.
自駕車 AI 安全
自駕車中 AI 系統之安全分析。感知系統攻擊、決策模型操弄、V2X 通訊利用與 AV AI 漏洞之物理安全意涵。
Education & Tutoring AI 安全
安全 analysis of AI systems in education. Academic integrity bypass, inappropriate content risks, student data protection under COPPA and FERPA, and testing methodologies for educational AI platforms.
醫療保健 AI 安全
醫療保健 AI 系統之安全測試方法論。PHI 暴露風險、臨床決策操弄、HIPAA 合規意涵,與為健康 AI(含診斷、臨床決策支援與面向病患系統)之測試途徑。
Azure AI Content Safety Testing
Testing Azure AI Content Safety service for bypass vulnerabilities and configuration weaknesses.
User Intent Classification for Safety
Building user intent classifiers that distinguish legitimate requests from adversarial manipulation attempts.
Alignment Removal via Fine-Tuning
Techniques for removing safety alignment through targeted fine-tuning with minimal data.
API Fine-Tuning 安全
安全 analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.
Few-Shot Fine-Tuning Risks
安全 risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.
微調安全
微調如何妥協模型安全的全面概覽——涵蓋資料集投毒、安全劣化、後門植入與獎勵駭客的攻擊分類,於微調 API 廣泛可得的時代。
Instruction Tuning Manipulation
Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.
Instruction Tuning Safety Bypass
Using instruction tuning to selectively bypass safety mechanisms while maintaining model capability.
Quantization-Induced Safety Degradation
How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.
Safety 訓練 Methods
概覽 of safety training methods including RLHF, Constitutional AI, DPO, and their limitations from a red team perspective.
Understanding LLM Safety 訓練
How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.
Alignment Faking Detection
Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.
Constitutional Classifiers for AI Safety
Analysis of Anthropic's Constitutional Classifiers approach to jailbreak resistance.
Post-Deployment Safety Degradation
Research on how model safety degrades over time through fine-tuning, adaptation, and use-case drift.
Quantization & Safety Alignment
How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.
Representation Engineering for 安全
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
The Safety Tax: Performance Impact of Safety 訓練
Research on the performance degradation caused by safety training and its exploitation implications.
Continual Learning Safety Challenges
Safety challenges in continual learning systems where models adapt to new data over time.
Cooperative AI Safety and 安全
安全 implications of cooperative AI systems and adversarial manipulation of cooperative behaviors.
Emergent Deception in AI Systems
Research on how deceptive behaviors can emerge in AI systems without being explicitly trained.
Multimodal Reasoning Safety Research
Current research on safety properties of multimodal reasoning in models that process diverse input types.
AI 安全基準與評估
AI 安全評估之概觀:基準框架、安全指標、評估方法論,以及供 AI 紅隊使用之標準化評估工具景觀。
Aviation AI 安全
安全 of AI in air traffic control, maintenance prediction, passenger screening, and flight operations.
Construction Industry AI 安全
AI security in building design, project management, safety monitoring, and autonomous construction equipment.
關鍵基礎設施 AI 安全
關鍵基礎設施中 AI 的安全測試:SCADA/ICS 整合、電網 AI、交通系統、水處理,以及 OT 與 AI 的匯流。
Construction Industry AI Threats
安全 considerations for AI in construction including project planning, safety monitoring, and resource allocation.
實驗室: Safety Regression Testing at Scale
Build automated pipelines that detect safety degradation across model versions, ensuring that updates and fine-tuning do not introduce new vulnerabilities or weaken existing protections.
實驗室: 模型 Comparison
Test the same attack techniques against different language models and compare their safety behaviors, refusal patterns, and vulnerability profiles.
實驗室: Compare 模型 Safety
Hands-on lab for running identical safety tests against GPT-4, Claude, Gemini, and Llama to compare how different models handle prompt injection, jailbreaks, and safety boundary enforcement.
實驗室: Mapping Safety Boundaries
Systematically discover what a language model will and won't do by probing its safety boundaries across multiple categories and documenting the results.
Safety Boundary Mapping
Systematically map the safety boundaries of an LLM by testing increasingly sensitive topics and documenting refusal patterns.
Safety 訓練 Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
Temperature and Top-K Effects on Safety
Systematically test how temperature, top-k, and top-p parameters affect safety guardrail effectiveness.
CTF: Alignment Breaker
Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.
實驗室: Alignment Stress Testing
Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.
實驗室: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
模擬:醫療 AI 安全評估
專家級模擬,評估臨床決策支援 AI 的安全違規、資料洩漏與醫療建議操控。
Architecture Comparison for Safety Properties
Comparative analysis of how architectural choices (dense vs MoE, decoder-only vs encoder-decoder) affect safety properties and attack surfaces.
Open Source 模型 Safety Comparison
Comparative safety analysis across open-source model families including Llama, Mistral, Qwen, and Phi.
Pruning Impact on Safety
How structured and unstructured pruning affects model safety properties, and techniques for exploiting pruning artifacts to bypass safety training.
Quantization Impact on 模型 Safety
How quantization affects safety alignment including GPTQ, AWQ, and GGUF format implications.
Multimodal 防禦 Strategies
Comprehensive defense approaches for multimodal AI systems: cross-modal verification, perceptual hashing, NSFW detection, input sanitization, and defense-in-depth architectures.
Alignment Challenges in Multimodal 模型s
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
Defending Multimodal AI Systems
Comprehensive defense strategies for multimodal AI systems including input sanitization, cross-modal safety classifiers, instruction hierarchy, and monitoring for adversarial multimodal inputs.
Benchmarking Multimodal 模型 Safety
Designing and implementing safety benchmarks for multimodal AI models that process images, audio, and video alongside text, covering cross-modal attack evaluation, consistency testing, and safety score aggregation.
Deconfliction Procedures for AI Testing
Procedures for deconflicting AI red team testing activities with production operations, monitoring teams, and other concurrent assessments.
安全 Implications of DPO 訓練
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
模型 Merging Safety Implications
Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.
Pre-訓練 Safety Interventions
Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.
Constitutional Classifier Setup
Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.
LLM Judge 實作
使用 LLM 判斷另一 LLM 輸出之安全與品質之逐步演練,涵蓋 judge 提示設計、評分準則、校準、成本最佳化與部署模式。
Output Content Classifier
Step-by-step walkthrough for building a classifier to filter harmful LLM outputs, covering taxonomy definition, multi-label classification, threshold calibration, and deployment as a real-time output gate.
Runtime Safety Monitor Implementation
Implement a runtime safety monitor that detects and blocks unsafe model outputs in real-time.
Toxicity Scoring Pipeline
Step-by-step walkthrough for building a toxicity scoring pipeline for LLM output filtering, covering model selection, multi-dimensional scoring, threshold calibration, and production deployment with real-time scoring.
HarmBench Evaluation Framework 導覽
Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.
Inspect AI Safety Evaluations
Build and run AI safety evaluations using the UK AISI Inspect framework.