# safety

assessmentcode-executionsafety

Code Execution Safety Assessment

Assessment of LLM-generated code safety, sandbox escape techniques, and code review automation.

capstonebenchmarkingsafetyevaluationtesting

Capstone: Design and Implement an AI Safety Benchmark Suite

Build a comprehensive, reproducible benchmark suite for evaluating LLM safety across multiple risk dimensions including toxicity, bias, hallucination, and adversarial robustness.

capstonemedicalhealthcaresafety

Capstone: Medical AI System Assessment

Comprehensive red team assessment of a medical AI diagnostic system addressing patient safety, data privacy, and regulatory compliance.

autonomousvehiclessafetyphysical

Autonomous Vehicle AI Security

Security analysis of AI systems in autonomous vehicles. Perception system attacks, decision model manipulation, V2X communication exploitation, and the physical safety implications of AV AI vulnerabilities.

educationtutoringsafetyacademic

Education & Tutoring AI Security

Security analysis of AI systems in education. Academic integrity bypass, inappropriate content risks, student data protection under COPPA and FERPA, and testing methodologies for educational AI platforms.

healthcarehipaaclinicalsafety

Healthcare AI Security

Security testing methodology for healthcare AI systems. PHI exposure risks, clinical decision manipulation, HIPAA compliance implications, and testing approaches for health AI including diagnostic, clinical decision support, and patient-facing systems.

incident-analysisbingsydneyalignmentsafety

Bing Chat Sydney Incident

Analysis of the February 2023 Bing Chat 'Sydney' incident where Microsoft's AI chatbot exhibited erratic behavior including emotional manipulation, threats, and identity confusion during extended conversations.

cloudsafetycontentazuretesting

Azure AI Content Safety Testing

Testing Azure AI Content Safety service for bypass vulnerabilities and configuration weaknesses.

defenseintent-classificationsafetydetection

User Intent Classification for Safety

Building user intent classifiers that distinguish legitimate requests from adversarial manipulation attempts.

fine-tuningalignment-removalsafetyattacks

Alignment Removal via Fine-Tuning

Techniques for removing safety alignment through targeted fine-tuning with minimal data.

api-fine-tuningopenaianthropictogether-aifireworkssafetycloud-security

API Fine-Tuning Security

Security analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.

fine-tuningfew-shotriskssafety

Few-Shot Fine-Tuning Risks

Security risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.

fine-tuningsafetydataset-poisoningbackdoorreward-hackingrlhfloramodel-security

Fine-Tuning Security

Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.

fine-tuninginstruction-tuningmanipulationsafety

Instruction Tuning Manipulation

Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.

instructionfinesafetybypasstuning

Instruction Tuning Safety Bypass

Using instruction tuning to selectively bypass safety mechanisms while maintaining model capability.

fine-tuningquantizationsafetydegradation

Quantization-Induced Safety Degradation

How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.

foundationssafetyRLHFconstitutional-AI

Safety Training Methods

Overview of safety training methods including RLHF, Constitutional AI, DPO, and their limitations from a red team perspective.

understandingtrainingsafetyfoundations

Understanding LLM Safety Training

How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.

frontier-researchalignment-fakingdetectionsafety

Alignment Faking Detection

Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.

frontier-researchconstitutional-classifierssafetyanthropic

Constitutional Classifiers for AI Safety

Analysis of Anthropic's Constitutional Classifiers approach to jailbreak resistance.

deploymentsafetypostresearchdegradationfrontier

Post-Deployment Safety Degradation

Research on how model safety degrades over time through fine-tuning, adaptation, and use-case drift.

quantizationsafetyalignmentdeploymentmodel-compressionresearch

Quantization & Safety Alignment

How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.

representation-engineeringactivation-steeringinterpretabilityinternal-representationssafety

Representation Engineering for Security

Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.

safetytaxfrontierresearch

The Safety Tax: Performance Impact of Safety Training

Research on the performance degradation caused by safety training and its exploitation implications.

frontier-researchcontinual-learningsafetychallenges

Continual Learning Safety Challenges

Safety challenges in continual learning systems where models adapt to new data over time.

frontier-researchcooperative-aisafetymulti-agent

Cooperative AI Safety and Security

Security implications of cooperative AI systems and adversarial manipulation of cooperative behaviors.

frontier-researchemergent-deceptionresearchsafety

Emergent Deception in AI Systems

Research on how deceptive behaviors can emerge in AI systems without being explicitly trained.

frontier-researchmultimodal-reasoningsafetyresearch

Multimodal Reasoning Safety Research

Current research on safety properties of multimodal reasoning in models that process diverse input types.

benchmarksevaluationsafety

AI Safety Benchmarks & Evaluation

Overview of AI safety evaluation: benchmarking frameworks, safety metrics, evaluation methodologies, and the landscape of standardized assessment tools for AI red teaming.

Aviation AI Security

Security of AI in air traffic control, maintenance prediction, passenger screening, and flight operations.

industryaviationsafety

industryconstructionsafety

Construction Industry AI Security

AI security in building design, project management, safety monitoring, and autonomous construction equipment.

critical-infrastructurescadaicsotpower-gridtransportationsafety

Critical Infrastructure AI Security

Security testing for AI in critical infrastructure: SCADA/ICS integration, power grid AI, transportation systems, water treatment, and the convergence of operational technology with artificial intelligence.

industry-verticalsconstructionproject-managementsafety

Construction Industry AI Threats

Security considerations for AI in construction including project planning, safety monitoring, and resource allocation.

labregression-testingsafetyautomationci-cd

Lab: Safety Regression Testing at Scale

Build automated pipelines that detect safety degradation across model versions, ensuring that updates and fine-tuning do not introduce new vulnerabilities or weaken existing protections.

labmodel-comparisonsafetybenchmarkingbeginnerhands-on

Lab: Model Comparison

Test the same attack techniques against different language models and compare their safety behaviors, refusal patterns, and vulnerability profiles.

labcomparisonsafetymulti-modelbeginner

Lab: Compare Model Safety

Hands-on lab for running identical safety tests against GPT-4, Claude, Gemini, and Llama to compare how different models handle prompt injection, jailbreaks, and safety boundary enforcement.

labsafetyboundariesmappingbeginnerhands-on

Lab: Mapping Safety Boundaries

Systematically discover what a language model will and won't do by probing its safety boundaries across multiple categories and documenting the results.

safetymappinglabbeginnerboundarylabs

Safety Boundary Mapping

Systematically map the safety boundaries of an LLM by testing increasingly sensitive topics and documenting refusal patterns.

probingsafetylabbeginnertraininglabs

Safety Training Boundary Probing

Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.

labstemperaturetop-ksafetybeginner

Temperature and Top-K Effects on Safety

Systematically test how temperature, top-k, and top-p parameters affect safety guardrail effectiveness.

CTF: Alignment Breaker

Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.

ctfalignmentsafetyexpert

labexpertalignmentstress-testingsafetyhands-on

Lab: Alignment Stress Testing

Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.

labexpertbenchmarkevaluationsafetyhands-on

Lab: Create a Safety Benchmark

Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.

simulationhealthcareclinicalsafety

Simulation: Healthcare AI Safety Assessment

Expert-level simulation assessing a clinical decision support AI for safety violations, data leakage, and manipulation of medical recommendations.

modelsarchitecturecomparisonsafety

Architecture Comparison for Safety Properties

Comparative analysis of how architectural choices (dense vs MoE, decoder-only vs encoder-decoder) affect safety properties and attack surfaces.

sourcecomparisonsafetydeepopenmodel

Open Source Model Safety Comparison

Comparative safety analysis across open-source model families including Llama, Mistral, Qwen, and Phi.

modelspruningsparsitysafety

Pruning Impact on Safety

How structured and unstructured pruning affects model safety properties, and techniques for exploiting pruning artifacts to bypass safety training.

safetydeepquantizationimpactmodel

Quantization Impact on Model Safety

How quantization affects safety alignment including GPTQ, AWQ, and GGUF format implications.

defensemultimodalcross-modalperceptual-hashingnsfwsafety

Multimodal Defense Strategies

Comprehensive defense approaches for multimodal AI systems: cross-modal verification, perceptual hashing, NSFW detection, input sanitization, and defense-in-depth architectures.

multimodalalignmentsafetytrainingcross-modal

Alignment Challenges in Multimodal Models

Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.

multimodaldefensesafetymonitoringsanitization

Defending Multimodal AI Systems

Comprehensive defense strategies for multimodal AI systems including input sanitization, cross-modal safety classifiers, instruction hierarchy, and monitoring for adversarial multimodal inputs.

multimodalbenchmarkingsafetyevaluationvlm

Benchmarking Multimodal Model Safety

Designing and implementing safety benchmarks for multimodal AI models that process images, audio, and video alongside text, covering cross-modal attack evaluation, consistency testing, and safety score aggregation.

tradecraftdeconflictionsafetyprocedures

Deconfliction Procedures for AI Testing

Procedures for deconflicting AI red team testing activities with production operations, monitoring teams, and other concurrent assessments.

training-pipelinedpoalignmentsafetypreference-learning

Security Implications of DPO Training

Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.

trainingmodel-mergingsafety

Model Merging Safety Implications

Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.

trainingpre-trainingsafety

Pre-Training Safety Interventions

Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.

constitutional-aiclassifierprinciplessafetydefensewalkthrough

Constitutional Classifier Setup

Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.

llm-judgeoutput-validationsafetyevaluationdefensewalkthrough

LLM Judge Implementation

Step-by-step walkthrough for using an LLM to judge another LLM's outputs for safety and quality, covering judge prompt design, scoring rubrics, calibration, cost optimization, and deployment patterns.

output-filteringclassifiercontent-moderationsafetydefensewalkthrough

Output Content Classifier

Step-by-step walkthrough for building a classifier to filter harmful LLM outputs, covering taxonomy definition, multi-label classification, threshold calibration, and deployment as a real-time output gate.

walkthroughsdefenseruntime-monitorsafety

Runtime Safety Monitor Implementation

Implement a runtime safety monitor that detects and blocks unsafe model outputs in real-time.

toxicityscoringoutput-filteringcontent-moderationsafetydefensewalkthrough

Toxicity Scoring Pipeline

Step-by-step walkthrough for building a toxicity scoring pipeline for LLM output filtering, covering model selection, multi-dimensional scoring, threshold calibration, and production deployment with real-time scoring.

harmbenchevaluationbenchmarkssafetyred-team-automationwalkthrough

HarmBench Evaluation Framework Walkthrough

Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.

walkthroughsinspect-aisafetyevaluations

Inspect AI Safety Evaluations

Build and run AI safety evaluations using the UK AISI Inspect framework.