# safety

assessmentcode-executionsafety

Veiligheidsassessment van code-executie

Assessment of LLM-generated code safety, sandbox escape techniques, and code review automation.

capstonebenchmarkingsafetyevaluationtesting

Capstone: Ontwerp en implementeer een AI-safety-benchmarksuite

Bouw een uitgebreide, reproduceerbare benchmarksuite voor het evalueren van LLM-safety over meerdere risicodimensies, waaronder toxiciteit, bias, hallucinatie en adversarial robuustheid.

capstonemedicalhealthcaresafety

Capstone: assessment van een medisch AI-systeem

Comprehensive red team assessment of a medical AI diagnostic system addressing patient safety, data privacy, and regulatory compliance.

autonomousvehiclessafetyphysical

AI-beveiliging van autonome voertuigen

Security analysis of AI systems in autonomous vehicles. Perception system attacks, decision model manipulation, V2X communication exploitation, and the physical safety implications of AV AI vulnerabilities.

educationtutoringsafetyacademic

AI-beveiliging in onderwijs en bijles

Security analysis of AI systems in education. Academic integrity bypass, inappropriate content risks, student data protection under COPPA and FERPA, and testing methodologies for educational AI platforms.

healthcarehipaaclinicalsafety

AI-beveiliging voor de zorg

Security testing methodology for healthcare AI systems. PHI exposure risks, clinical decision manipulation, HIPAA compliance implications, and testing approaches for health AI including diagnostic, clinical decision support, and patient-facing systems.

incident-analysisbingsydneyalignmentsafety

Het Bing Chat Sydney-incident

Analysis of the February 2023 Bing Chat 'Sydney' incident where Microsoft's AI chatbot exhibited erratic behavior including emotional manipulation, threats, and identity confusion during extended conversations.

cloudsafetycontentazuretesting

Testen van Azure AI Content Safety

Testing Azure AI Content Safety service for bypass vulnerabilities and configuration weaknesses.

defenseintent-classificationsafetydetection

Classificatie van gebruikersintentie voor veiligheid

Het bouwen van classifiers voor gebruikersintentie die legitieme verzoeken onderscheiden van adversariële manipulatiepogingen.

fine-tuningalignment-removalsafetyattacks

Alignment verwijderen via fine-tuning

Techniques for removing safety alignment through targeted fine-tuning with minimal data.

api-fine-tuningopenaianthropictogether-aifireworkssafetycloud-security

Beveiliging van API-fine-tuning

Security analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.

fine-tuningfew-shotriskssafety

Risico's van few-shot fine-tuning

Security risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.

fine-tuningsafetydataset-poisoningbackdoorreward-hackingrlhfloramodel-security

Beveiliging van fine-tuning

Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.

fine-tuninginstruction-tuningmanipulationsafety

Manipulatie van instruction tuning

Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.

instructionfinesafetybypasstuning

Veiligheid omzeilen via instruction tuning

Using instruction tuning to selectively bypass safety mechanisms while maintaining model capability.

fine-tuningquantizationsafetydegradation

Veiligheidsdegradatie door kwantisatie

How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.

foundationssafetyRLHFconstitutional-AI

Methoden voor veiligheidstraining

Overzicht van methoden voor veiligheidstraining, waaronder RLHF, Constitutional AI en DPO, en hun beperkingen vanuit een red team-perspectief.

understandingtrainingsafetyfoundations

Inzicht in LLM-veiligheidstraining

Hoe veiligheidstraining werkt, waaronder RLHF, DPO en Constitutional AI, en waarom ze omzeild kan worden.

frontier-researchalignment-fakingdetectionsafety

Detectie van alignment faking

Detecteren wanneer modellen alignment veinzen tijdens evaluatie terwijl ze ander gedrag vertonen in deployment.

frontier-researchconstitutional-classifierssafetyanthropic

Constitutional Classifiers voor AI-veiligheid

Analyse van Anthropics Constitutional Classifiers-aanpak voor jailbreak-weerbaarheid.

deploymentsafetypostresearchdegradationfrontier

Veiligheidsdegradatie na deployment

Onderzoek naar hoe de veiligheid van modellen in de loop van de tijd verslechtert door fine-tuning, aanpassing en use-case-drift.

quantizationsafetyalignmentdeploymentmodel-compressionresearch

Kwantisatie en veiligheids-alignment

Hoe modelkwantisatie de veiligheids-alignment onevenredig aantast: kwaadaardige kwantisatie-aanvallen, token-flipping en veiligheidsbewuste kwantisatieverdedigingen.

representation-engineeringactivation-steeringinterpretabilityinternal-representationssafety

Representation engineering voor beveiliging

Het lezen en manipuleren van interne representaties van modellen voor beveiliging: activation steering, conceptprobing, veiligheidscontroles op representatieniveau, en beveiligingstoepassingen van representation engineering.

safetytaxfrontierresearch

De Safety Tax: Prestatie-impact van veiligheidstraining

Onderzoek naar de prestatieverslechtering veroorzaakt door veiligheidstraining en de implicaties voor exploitatie.

frontier-researchcontinual-learningsafetychallenges

Veiligheidsuitdagingen bij continual learning

Veiligheidsuitdagingen in continual learning-systemen waarbij modellen zich na verloop van tijd aanpassen aan nieuwe gegevens.

frontier-researchcooperative-aisafetymulti-agent

Veiligheid en beveiliging van coöperatieve AI

Beveiligingsimplicaties van coöperatieve AI-systemen en adversariële manipulatie van coöperatief gedrag.

frontier-researchemergent-deceptionresearchsafety

Opkomende misleiding in AI-systemen

Onderzoek naar hoe misleidend gedrag kan ontstaan in AI-systemen zonder expliciet te zijn getraind.

frontier-researchmultimodal-reasoningsafetyresearch

Onderzoek naar de veiligheid van multimodaal redeneren

Actueel onderzoek naar de veiligheidseigenschappen van multimodaal redeneren in modellen die uiteenlopende invoertypen verwerken.

benchmarksevaluationsafety

AI-veiligheidsbenchmarks & evaluatie

Overzicht van AI-veiligheidsevaluatie: benchmarkframeworks, veiligheidsmetrics, evaluatiemethodologieën en het landschap van gestandaardiseerde beoordelingsinstrumenten voor AI-red-teaming.

AI-beveiliging in de luchtvaart

Beveiliging van AI in luchtverkeersleiding, onderhoudsvoorspelling, passagiersscreening en vluchtoperaties.

industryaviationsafety

industryconstructionsafety

AI-beveiliging in de bouwsector

AI-beveiliging in bouwontwerp, projectmanagement, veiligheidsmonitoring en autonome bouwmachines.

critical-infrastructurescadaicsotpower-gridtransportationsafety

AI-beveiliging in kritieke infrastructuur

Beveiligingstesten voor AI in kritieke infrastructuur: SCADA/ICS-integratie, AI voor het stroomnet, transportsystemen, waterzuivering en de convergentie van operationele technologie met kunstmatige intelligentie.

industry-verticalsconstructionproject-managementsafety

AI-dreigingen in de bouwsector

Security considerations for AI in construction including project planning, safety monitoring, and resource allocation.

labregression-testingsafetyautomationci-cd

Lab: veiligheidsregressietesten op schaal

Build automated pipelines that detect safety degradation across model versions, ensuring that updates and fine-tuning do not introduce new vulnerabilities or weaken existing protections.

labmodel-comparisonsafetybenchmarkingbeginnerhands-on

Lab: modelvergelijking

Test the same attack techniques against different language models and compare their safety behaviors, refusal patterns, and vulnerability profiles.

labcomparisonsafetymulti-modelbeginner

Lab: vergelijk de veiligheid van modellen

Hands-on lab for running identical safety tests against GPT-4, Claude, Gemini, and Llama to compare how different models handle prompt injection, jailbreaks, and safety boundary enforcement.

labsafetyboundariesmappingbeginnerhands-on

Lab: Veiligheidsgrenzen in kaart brengen

Ontdek systematisch wat een taalmodel wel en niet wil doen door zijn veiligheidsgrenzen over meerdere categorieën te onderzoeken en de resultaten te documenteren.

safetymappinglabbeginnerboundarylabs

In kaart brengen van veiligheidsgrenzen

Systematically map the safety boundaries of an LLM by testing increasingly sensitive topics and documenting refusal patterns.

probingsafetylabbeginnertraininglabs

Probing van de grenzen van veiligheidstraining

Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.

labstemperaturetop-ksafetybeginner

Effecten van temperature en top-k op veiligheid

Systematically test how temperature, top-k, and top-p parameters affect safety guardrail effectiveness.

CTF: Alignment Breaker

Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.

ctfalignmentsafetyexpert

labexpertalignmentstress-testingsafetyhands-on

Lab: stresstesten van alignment

Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.

labexpertbenchmarkevaluationsafetyhands-on

Lab: maak een veiligheidsbenchmark

Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.

simulationhealthcareclinicalsafety

Simulatie: veiligheidsbeoordeling van AI in de zorg

Expert-level simulation assessing a clinical decision support AI for safety violations, data leakage, and manipulation of medical recommendations.

modelsarchitecturecomparisonsafety

Architectuurvergelijking op veiligheidseigenschappen

Comparative analysis of how architectural choices (dense vs MoE, decoder-only vs encoder-decoder) affect safety properties and attack surfaces.

sourcecomparisonsafetydeepopenmodel

Veiligheidsvergelijking van open source-modellen

Comparative safety analysis across open-source model families including Llama, Mistral, Qwen, and Phi.

modelspruningsparsitysafety

Impact van pruning op veiligheid

How structured and unstructured pruning affects model safety properties, and techniques for exploiting pruning artifacts to bypass safety training.

safetydeepquantizationimpactmodel

Impact van kwantisatie op modelveiligheid

How quantization affects safety alignment including GPTQ, AWQ, and GGUF format implications.

defensemultimodalcross-modalperceptual-hashingnsfwsafety

Multimodale verdedigingsstrategieën

Uitgebreide verdedigingsbenaderingen voor multimodale AI-systemen: cross-modal-verificatie, perceptual hashing, NSFW-detectie, inputsanitatie en defense-in-depth-architecturen.

multimodalalignmentsafetytrainingcross-modal

Alignment-uitdagingen in multimodale modellen

Analyse van alignment-uitdagingen die specifiek zijn voor multimodale AI-systemen, waaronder cross-modale veiligheidshiaten, representatieconflicten en de moeilijkheid om op tekst gebaseerde veiligheidstraining uit te breiden naar visuele, audio- en video-invoer.

multimodaldefensesafetymonitoringsanitization

Multimodale AI-systemen verdedigen

Uitgebreide verdedigingsstrategieën voor multimodale AI-systemen, waaronder invoersanering, cross-modale veiligheidsclassifiers, instructiehiërarchie en monitoring op vijandige multimodale invoer.

multimodalbenchmarkingsafetyevaluationvlm

Benchmarking van multimodale modelveiligheid

Het ontwerpen en implementeren van safety-benchmarks voor multimodale AI-modellen die afbeeldingen, audio en video naast tekst verwerken, met evaluatie van cross-modale aanvallen, consistentietests en aggregatie van veiligheidsscores.

tradecraftdeconflictionsafetyprocedures

Deconflictieprocedures voor AI-testen

Procedures om AI-redteamingactiviteiten te deconflicteren met productie-operaties, monitoringteams en andere gelijktijdig lopende assessments.

training-pipelinedpoalignmentsafetypreference-learning

Beveiligingsimplicaties van DPO-training

Analyse van beveiligingskwetsbaarheden die worden geïntroduceerd door Direct Preference Optimization, waaronder preferentiemanipulatie, exploitatie van het impliciete beloningsmodel en degradatie van safety alignment.

trainingmodel-mergingsafety

Beveiligingsimplicaties van model merging

Analyse van hoe model-merging-technieken (TIES, DARE, SLERP) safety-eigenschappen en alignment beïnvloeden.

trainingpre-trainingsafety

Safety-interventies tijdens pre-training

Analyse van safety-interventies die tijdens pre-training worden toegepast, waaronder datafiltering, loss-weighting en curriculumontwerp.

constitutional-aiclassifierprinciplessafetydefensewalkthrough

Opzetten van een Constitutional classifier

Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.

llm-judgeoutput-validationsafetyevaluationdefensewalkthrough

Implementatie van een LLM-judge

Step-by-step walkthrough for using an LLM to judge another LLM's outputs for safety and quality, covering judge prompt design, scoring rubrics, calibration, cost optimization, and deployment patterns.

output-filteringclassifiercontent-moderationsafetydefensewalkthrough

Classifier voor uitvoercontent

Step-by-step walkthrough for building a classifier to filter harmful LLM outputs, covering taxonomy definition, multi-label classification, threshold calibration, and deployment as a real-time output gate.

walkthroughsdefenseruntime-monitorsafety

Implementatie van een runtime-veiligheidsmonitor

Implement a runtime safety monitor that detects and blocks unsafe model outputs in real-time.

toxicityscoringoutput-filteringcontent-moderationsafetydefensewalkthrough

Pijplijn voor het scoren van toxiciteit

Step-by-step walkthrough for building a toxicity scoring pipeline for LLM output filtering, covering model selection, multi-dimensional scoring, threshold calibration, and production deployment with real-time scoring.

harmbenchevaluationbenchmarkssafetyred-team-automationwalkthrough

Walkthrough: HarmBench-evaluatieframework

Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.

walkthroughsinspect-aisafetyevaluations

Veiligheidsevaluaties met Inspect AI

Build and run AI safety evaluations using the UK AISI Inspect framework.