# safety
63 articlestagged with “safety”
Code Execution Safety Assessment
Assessment of LLM-generated code safety, sandbox escape techniques, and code review automation.
Capstone: Design and Implement an AI Safety Benchmark Suite
Build a comprehensive, reproducible benchmark suite for evaluating LLM safety across multiple risk dimensions including toxicity, bias, hallucination, and adversarial robustness.
Capstone: Medical AI System Assessment
Comprehensive red team assessment of a medical AI diagnostic system addressing patient safety, data privacy, and regulatory compliance.
Autonomous Vehicle AI Security
Security analysis of AI systems in autonomous vehicles. Perception system attacks, decision model manipulation, V2X communication exploitation, and the physical safety implications of AV AI vulnerabilities.
Education & Tutoring AI Security
Security analysis of AI systems in education. Academic integrity bypass, inappropriate content risks, student data protection under COPPA and FERPA, and testing methodologies for educational AI platforms.
Healthcare AI Security
Security testing methodology for healthcare AI systems. PHI exposure risks, clinical decision manipulation, HIPAA compliance implications, and testing approaches for health AI including diagnostic, clinical decision support, and patient-facing systems.
Bing Chat Sydney Incident
Analysis of the February 2023 Bing Chat 'Sydney' incident where Microsoft's AI chatbot exhibited erratic behavior including emotional manipulation, threats, and identity confusion during extended conversations.
Azure AI Content Safety Testing
Testing Azure AI Content Safety service for bypass vulnerabilities and configuration weaknesses.
User Intent Classification for Safety
Building user intent classifiers that distinguish legitimate requests from adversarial manipulation attempts.
Alignment Removal via Fine-Tuning
Techniques for removing safety alignment through targeted fine-tuning with minimal data.
API Fine-Tuning Security
Security analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.
Few-Shot Fine-Tuning Risks
Security risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
Instruction Tuning Manipulation
Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.
Instruction Tuning Safety Bypass
Using instruction tuning to selectively bypass safety mechanisms while maintaining model capability.
Quantization-Induced Safety Degradation
How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.
Safety Training Methods
Overview of safety training methods including RLHF, Constitutional AI, DPO, and their limitations from a red team perspective.
Understanding LLM Safety Training
How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.
Alignment Faking Detection
Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.
Constitutional Classifiers for AI Safety
Analysis of Anthropic's Constitutional Classifiers approach to jailbreak resistance.
Post-Deployment Safety Degradation
Research on how model safety degrades over time through fine-tuning, adaptation, and use-case drift.
Quantization & Safety Alignment
How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.
Representation Engineering for Security
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
The Safety Tax: Performance Impact of Safety Training
Research on the performance degradation caused by safety training and its exploitation implications.
Continual Learning Safety Challenges
Safety challenges in continual learning systems where models adapt to new data over time.
Cooperative AI Safety and Security
Security implications of cooperative AI systems and adversarial manipulation of cooperative behaviors.
Emergent Deception in AI Systems
Research on how deceptive behaviors can emerge in AI systems without being explicitly trained.
Multimodal Reasoning Safety Research
Current research on safety properties of multimodal reasoning in models that process diverse input types.
AI Safety Benchmarks & Evaluation
Overview of AI safety evaluation: benchmarking frameworks, safety metrics, evaluation methodologies, and the landscape of standardized assessment tools for AI red teaming.
Aviation AI Security
Security of AI in air traffic control, maintenance prediction, passenger screening, and flight operations.
Construction Industry AI Security
AI security in building design, project management, safety monitoring, and autonomous construction equipment.
Critical Infrastructure AI Security
Security testing for AI in critical infrastructure: SCADA/ICS integration, power grid AI, transportation systems, water treatment, and the convergence of operational technology with artificial intelligence.
Construction Industry AI Threats
Security considerations for AI in construction including project planning, safety monitoring, and resource allocation.
Lab: Safety Regression Testing at Scale
Build automated pipelines that detect safety degradation across model versions, ensuring that updates and fine-tuning do not introduce new vulnerabilities or weaken existing protections.
Lab: Model Comparison
Test the same attack techniques against different language models and compare their safety behaviors, refusal patterns, and vulnerability profiles.
Lab: Compare Model Safety
Hands-on lab for running identical safety tests against GPT-4, Claude, Gemini, and Llama to compare how different models handle prompt injection, jailbreaks, and safety boundary enforcement.
Lab: Mapping Safety Boundaries
Systematically discover what a language model will and won't do by probing its safety boundaries across multiple categories and documenting the results.
Safety Boundary Mapping
Systematically map the safety boundaries of an LLM by testing increasingly sensitive topics and documenting refusal patterns.
Safety Training Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
Temperature and Top-K Effects on Safety
Systematically test how temperature, top-k, and top-p parameters affect safety guardrail effectiveness.
CTF: Alignment Breaker
Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.
Lab: Alignment Stress Testing
Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.
Lab: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
Simulation: Healthcare AI Safety Assessment
Expert-level simulation assessing a clinical decision support AI for safety violations, data leakage, and manipulation of medical recommendations.
Architecture Comparison for Safety Properties
Comparative analysis of how architectural choices (dense vs MoE, decoder-only vs encoder-decoder) affect safety properties and attack surfaces.
Open Source Model Safety Comparison
Comparative safety analysis across open-source model families including Llama, Mistral, Qwen, and Phi.
Pruning Impact on Safety
How structured and unstructured pruning affects model safety properties, and techniques for exploiting pruning artifacts to bypass safety training.
Quantization Impact on Model Safety
How quantization affects safety alignment including GPTQ, AWQ, and GGUF format implications.
Multimodal Defense Strategies
Comprehensive defense approaches for multimodal AI systems: cross-modal verification, perceptual hashing, NSFW detection, input sanitization, and defense-in-depth architectures.
Alignment Challenges in Multimodal Models
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
Defending Multimodal AI Systems
Comprehensive defense strategies for multimodal AI systems including input sanitization, cross-modal safety classifiers, instruction hierarchy, and monitoring for adversarial multimodal inputs.
Benchmarking Multimodal Model Safety
Designing and implementing safety benchmarks for multimodal AI models that process images, audio, and video alongside text, covering cross-modal attack evaluation, consistency testing, and safety score aggregation.
Deconfliction Procedures for AI Testing
Procedures for deconflicting AI red team testing activities with production operations, monitoring teams, and other concurrent assessments.
Security Implications of DPO Training
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
Model Merging Safety Implications
Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.
Pre-Training Safety Interventions
Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.
Constitutional Classifier Setup
Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.
LLM Judge Implementation
Step-by-step walkthrough for using an LLM to judge another LLM's outputs for safety and quality, covering judge prompt design, scoring rubrics, calibration, cost optimization, and deployment patterns.
Output Content Classifier
Step-by-step walkthrough for building a classifier to filter harmful LLM outputs, covering taxonomy definition, multi-label classification, threshold calibration, and deployment as a real-time output gate.
Runtime Safety Monitor Implementation
Implement a runtime safety monitor that detects and blocks unsafe model outputs in real-time.
Toxicity Scoring Pipeline
Step-by-step walkthrough for building a toxicity scoring pipeline for LLM output filtering, covering model selection, multi-dimensional scoring, threshold calibration, and production deployment with real-time scoring.
HarmBench Evaluation Framework Walkthrough
Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.
Inspect AI Safety Evaluations
Build and run AI safety evaluations using the UK AISI Inspect framework.