# red-teaming

assessmentjailbreakingbypasssafety-trainingred-teaming

Beoordeling van jailbreak-technieken

Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.

capstoneplatformred-teamingautomationtooling

Capstone: bouw een compleet AI-redteaming platform

Design and implement a comprehensive AI red teaming platform with automated attack orchestration, vulnerability tracking, and collaborative reporting.

Expert

Methodologie voor een volledige opdracht

A comprehensive methodology for conducting full AI red teaming engagements, integrating all techniques from previous sections into a structured professional assessment.

capstoneengagementmethodologyred-teamingprofessional

defenseadaptive-attacksred-teamingresearchadversarial-robustness

Het probleem dat de aanvaller als tweede zet

Waarom statische LLM-verdedigingen falen tegen adaptieve tegenstanders: analyse van 12 omzeilde verdedigingen en implicaties voor het ontwerp van verdedigingen.

defensesred-teamingsecurity-fundamentalsattacker-defender-asymmetry

Inzicht in AI-verdedigingen

Waarom red teamers de verdedigingen moeten begrijpen waar ze tegenaan lopen, de categorieën van AI-verdedigingen en de asymmetrie tussen aanvaller en verdediger in AI-veiligheid.

exploit-devtoolingautomationred-teamingmethodology

Overzicht: ontwikkeling van AI-exploits

Een introductie tot het ontwikkelen van exploits en tooling voor AI-redteaming, met de unieke uitdagingen van het bouwen van betrouwbare aanvallen tegen probabilistische systemen.

cartcontinuousautomationpipelinetelemetryci-cdmonitoringred-teaming

Continuous Automated Red Teaming (CART)

CART-pipelines ontwerpen voor doorlopende AI-beveiligingsvalidatie: architectuur, testsuites, telemetrie, alerting, regressiedetectie en CI/CD-integratie.

Expert

Hoe LLM's werken: een gids voor redteamers

Begrijp de basis van grote taalmodellen — tokenvoorspelling, contextvensters, rollen en temperatuur — door een beveiligingsbril.

llmfundamentalsred-teamingbeginner

red-teamingmethodologyfundamentalsbeginner

Grondbeginselen van red team-methodologie

Wat AI-redteaming is, hoe het verschilt van traditioneel beveiligingstesten en de volledige levenscyclus van een opdracht, van scoping tot rapportage.

foundationsred-teamingfundamentalsmethodology

Grondbeginselen van redteaming voor AI

Fundamentele concepten en methodologie voor AI-redteaming, inclusief doelbepaling, scopedefinitie, techniekkeuze en rapportage.

frontier-researchautomatedred-teamingsystems

Geautomatiseerde Red Teaming-Systemen

Overzicht van geautomatiseerde red teaming-systemen waaronder PAIR, TAP, Rainbow Teaming en nieuwsgierigheidsgedreven exploratie.

frontier-researchmechanistic-interpretabilityred-teamingcircuits

Mechanistische interpreteerbaarheid voor red teaming

Mechanistische interpreteerbaarheid gebruiken om exploiteerbare circuits en features in neurale netwerken te ontdekken.

Expert

Red teaming van reasoning traces

Technieken voor het analyseren en uitbuiten van zichtbare reasoning traces in chain-of-thought-modellen.

frontier-researchreasoning-tracesred-teamingchain-of-thought

governanceeu-ai-actred-teamingrequirements

Red team-vereisten van de EU AI Act

Specifieke red teaming- en testvereisten onder de EU AI Act voor AI-systemen met een hoog risico.

governanceresponsible-aiethicsred-teaming

Ethiek van verantwoorde AI-redteaming

Ethical frameworks for conducting AI red teaming including scope limits and harm prevention.

industry-verticalsdomain-specificregulationred-teamingcompliance

Sectoren: AI-beveiliging per sector

Comprehensive guide to industry-specific AI security challenges, regulatory requirements, and red teaming approaches across healthcare, financial services, legal, government, and critical infrastructure sectors.

jailbreakautomationPAIRTAPAutoDANred-teaming

Geautomatiseerde jailbreak-pipelines

Geautomatiseerde jailbreak-systemen bouwen met PAIR, TAP, AutoDAN en eigen pipeline-architecturen voor systematische evaluatie van AI-veiligheid.

labpayload-craftingprompt-injectionred-teamingbeginnerhands-on

Lab: payloads maken

Learn to craft effective prompt injection payloads from scratch by understanding payload structure, testing iteratively, and optimizing for reliability against a local model.

labpyrittool-setupred-teamingmicrosoftbeginnerhands-on

Lab: PyRIT opzetten en je eerste aanval

Install and configure Microsoft's PyRIT (Python Risk Identification Toolkit) for automated red teaming, then run your first orchestrated attack against a local model.

claudeanthropicconstitutional-airlhfharmlessnessred-teaming

Overzicht van Claude (Anthropic)

Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.

comparisoncross-modelmethodologyevaluationred-teamingbenchmarking

Cross-model vergelijking

Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.

geminigooglemultimodallong-contextarchitecturered-teaming

Overzicht van Gemini (Google)

Architecture overview of Google's Gemini model family, including natively multimodal design, long context capabilities, Google ecosystem integration, and security-relevant features for red teaming.

gpt-4openaiarchitecturemoered-teaming

Overzicht van GPT-4 / GPT-4o

Architecture overview of OpenAI's GPT-4 and GPT-4o models, including rumored Mixture of Experts design, capabilities, API surface, and security-relevant features for red teaming.

gpt-4testingmethodologyapi-probingsafety-boundariesred-teaming

Testmethodologie voor GPT-4

Systematic methodology for red teaming GPT-4, including API-based probing techniques, rate limit considerations, content policy mapping, and safety boundary discovery.

model-securityred-teamingattack-surfacemethodologyarchitecture

Diepe duiken per model

Why model-specific knowledge matters for AI red teaming, how different architectures create different attack surfaces, and a systematic methodology for profiling any new model.

llamametaweight-manipulationfine-tuningquantizationllama-guardred-teaming

Aanvallen op de Llama-familie

Comprehensive attack analysis of Meta's Llama model family including weight manipulation, fine-tuning safety removal, quantization artifacts, uncensored variants, and Llama Guard bypass techniques.

mistralmixtralmoesparse-activationopen-weightred-teaming

Mistral en Mixtral

Security analysis of Mistral and Mixtral models, including Mixture of Experts exploitation, sparse activation attacks, minimal safety alignment implications, and open-weight deployment risks.

multimodalred-teamingmethodologyassessmentframework

Methodologie voor het red teamen van multimodale systemen

Gestructureerde methodologie voor het uitvoeren van beveiligingsassessments van multimodale AI-systemen, met scoping, enumeratie van het aanvalsoppervlak, testuitvoering en rapportage met MITRE ATLAS-mappings.

professionalcareerred-teamingskills-development

Loopbaanpaden in AI-redteaming

Comprehensive guide to building a career in AI red teaming, from entry-level roles through senior leadership positions.

prompt-injectioncontext-overflowattentioncontext-windowred-teaming

Context overflow-aanvallen

Technieken om het context window van een LLM te vullen met opvulinhoud om systeeminstructies uit de attention te duwen en zo hun invloed op het modelgedrag te verminderen.

prompt-injectioncontext-windowattentionpositional-encodingred-teaming

Misbruik van het context window

Geavanceerde technieken om de mechanismen van het context window in LLM's te misbruiken, waaronder attention-verdunning, aanvallen op positional encoding, manipulatie van de KV-cache en verwarring van contextgrenzen.

conversation-steeringpersistencetopic-driftmanipulationred-teaming

Conversation steering

Technieken om de context van een gesprek geleidelijk richting aanvalsdoelen te sturen zonder veiligheidsmechanismen te activeren.

prompt-injectiondirect-injectioninstruction-overridered-teaming

Directe prompt injection

Technieken om instructies rechtstreeks in LLM-prompts te injecteren om het systeemgedrag te overschrijven, waaronder instruction override, contextmanipulatie en format mimicry.

prompt-injectionencodingbase64unicodeobfuscationfilter-evasionred-teaming

Technieken om encoding te omzeilen

Het gebruik van Base64, ROT13, Unicode-transformaties, hex-encoding en andere obfuscatiemethoden om filters voor prompt injection en safety classifiers te omzeilen, met behoud van de semantische betekenis.

few-shotmany-shotin-context-learningjailbreakred-teaming

Few-shot-manipulatie

Vervaardigde in-context voorbeelden gebruiken om modelgedrag te sturen, waaronder many-shot jailbreaken, vergiftigde demonstraties en conditionering op basis van voorbeelden.

prompt-injectiontaxonomyclassificationred-teamingframework

Taxonomie van prompt injection

Een uitgebreid classificatieframework voor prompt injection-aanvallen, met directe en indirecte vectoren, afleveringsmechanismen, doellagen en severitybeoordeling voor systematisch red-teamtesten.

prompt-injectioninstruction-hierarchymessage-priorityrole-confusionsystem-promptred-teaming

Aanvallen op de instructiehiërarchie

Het misbruiken van de prioriteitsvolgorde tussen systeem-, gebruikers- en assistentberichten om veiligheidsmaatregelen te omzeilen, de voorrang van instructies te manipuleren en privileges te escaleren via verwarring over berichtrollen.

jailbreaksafety-bypassalignmentred-teamingadversarial

Jailbreak-technieken

Veelvoorkomende patronen en gevorderde technieken om de veiligheidsalignment van LLM's te omzeilen, waaronder rollenspel, encodingtrucs, many-shot-aanvallen en gradient-gebaseerde methoden.

language-switchingmultilingualevasionlow-resourcered-teaming

Taalwisseling

Taalspecifieke gaten in veiligheidstraining misbruiken door over te schakelen naar low-resource talen, talen te mengen of transliteratie te gebruiken om filters te ontwijken.

multi-turncrescendoescalationconversationred-teaming

Multi-turn-aanvallen

Aanvallen die zich over meerdere conversatiebeurten uitstrekken via geleidelijke escalatie, contextopbouw, crescendo-patronen en het opbouwen van vertrouwen in de loop van de tijd.

prompt-injectionmulti-turncrescendoescalationconversationred-teaming

Multi-turn prompt injection

Aanvallen met progressieve escalatie over gespreksbeurten heen, waaronder crescendopatronen, contextsturing, vertrouwensopbouw en technieken om detectie per bericht te ontwijken.

prompt-injectionpayload-splittingfragmentationevasionred-teaming

Payload splitten

Het opsplitsen van kwaadaardige instructies over meerdere berichten, variabelen of gegevensbronnen om detectie op een enkel punt te ontwijken, terwijl het model de volledige payload tijdens de verwerking weer samenstelt.

personapersistencecharacter-lockingidentityred-teaming

Een persona vestigen

Het creëren van persistente alternatieve identiteiten die meerdere gespreksbeurten overleven, inclusief character locking, het verankeren van identiteiten en het stapsgewijs opbouwen van een persona.

role-playpersonajailbreakDANred-teaming

Role-play-aanvallen

Het opzetten van alternatieve persona's of fictieve scenario's die modellen hun safety-training laten omzeilen, inclusief DAN-varianten, het kapen van personages en narratieve framing.

social-engineeringpersuasionmanipulationjailbreakred-teaming

Social engineering van AI

Het manipuleren van AI-systemen via emotionele beroepen, autoriteitsclaims, urgentieframing en social-pressuretactieken die de neiging om instructies op te volgen misbruiken.

prompt-injectionadversarial-triggersjailbreaktransfer-attacksred-teaming

Universele adversarial triggers

Het ontdekken en inzetten van universele adversarial trigger-sequenties die betrouwbaar de safety alignment van meerdere LLM-families omzeilen, inclusief gradient-gebaseerd zoeken, transfer-aanvallen en het ontwijken van verdedigingen.

methodologyrecontradecraftred-teamingassessment

AI-redteaming methodologie

Een gestructureerde methodologie voor AI-redteamingopdrachten, met verkenning, het profileren van het doelwit, aanvalsplanning en de tradecraft die professionele beoordelingen onderscheidt.

cheat-sheetred-teamingquick-referencemethodology

AI-redteaming cheatsheet

Een beknopte snelle referentie voor AI-redteaming-opdrachten die de volledige levenscyclus, aanvalscategorieën, veelgebruikte tools, verkenning en rapportage omvat.

infrastructureapirate-limitingbypassred-teaming

API rate limit omzeilen

Techniques to bypass API rate limiting on LLM services, including header manipulation, distributed requests, authentication rotation, and endpoint discovery.

multimodalaudioprompt-injectionspeechred-teaming

Audio prompt injection

Injecting adversarial instructions through audio inputs to speech-to-text and multimodal models, exploiting the audio channel as an alternative injection vector.

jailbreakingcipherencodingobfuscationcontent-filter-bypassred-teaming

Cipher-gebaseerde jailbreak

Using ciphers, encodings, and coded language to bypass LLM content filters by transforming harmful requests into formats that safety classifiers do not recognize.

prompt-injectionmarkdowncode-injectionxssred-teamingintermediate

Code injection via Markdown

Injecting executable payloads through markdown rendering in LLM outputs, exploiting the gap between text generation and content rendering in web-based LLM interfaces.

prompt-injectionattack-chainingcompound-attacksred-teamingadvanced

Samengestelde aanvallen aan elkaar koppelen

Combining multiple prompt injection techniques into compound attacks that defeat layered defenses, building attack chains that leverage the strengths of each individual technique.

prompt-injectioncontext-windowtoken-manipulationred-teamingintermediate

Context window stuffing

Techniques for filling the LLM context window to push system instructions out of active memory, manipulating token budgets to dilute or displace defensive prompts.

jailbreakingcrescendomulti-turnconversation-escalationred-teaming

Crescendo multi-turn aanval

The Crescendo attack technique for gradually escalating requests across multiple conversation turns to bypass LLM safety training without triggering single-turn detection.

multimodalcross-modalprompt-injectionfusionred-teaming

Cross-modale confusie

Confusing multimodal AI models by sending conflicting or complementary signals across different input modalities to bypass safety mechanisms and exploit fusion weaknesses.

jailbreakingDANprompt-engineeringsafety-bypassred-teaming

De evolutie van de DAN-jailbreak

History and evolution of Do Anything Now (DAN) prompts, analyzing what makes them effective at bypassing LLM safety training and how defenses have adapted over time.

prompt-injectiondelimiter-escapesandbox-escapered-teamingintermediate

Delimiter escape-aanvallen

Techniques for escaping delimiters used to separate system and user content in LLM applications, breaking out of sandboxed input regions to inject instructions.

prompt-injectiondirect-injectionred-teamingbeginnerpayload-crafting

Grondbeginselen van directe injection

Core concepts of directly injecting instructions into LLM prompts, including override techniques, simple payload crafting, and understanding how models parse conflicting instructions.

prompt-injectionencodingbase64rot13unicodeevasionred-teamingintermediate

Ontwijking op basis van encoding

Using base64, ROT13, hexadecimal, Unicode, and other encoding schemes to evade input detection systems and bypass content filters in LLM applications.

prompt-injectionfew-shotin-context-learningred-teamingintermediate

Few-shot injection

Using crafted few-shot examples within user input to steer LLM behavior toward unintended outputs, exploiting in-context learning to override safety training.

multimodalprompt-injectionvisionimagesred-teaming

Op afbeeldingen gebaseerde prompt injection (aanval-walkthrough)

Embedding text instructions in images that vision models read, enabling prompt injection through the visual modality to bypass text-only input filters and safety mechanisms.

infrastructureapiinferenceexploitationred-teaming

Misbruik van inference-endpoints

Exploiting inference API endpoints for unauthorized access, data exfiltration, and service abuse through authentication flaws, input validation gaps, and misconfigured permissions.

prompt-injectioninstruction-hierarchyprivilege-escalationred-teamingadvanced

De instructiehiërarchie omzeilen

Advanced techniques to bypass instruction priority and hierarchy enforcement in language models, exploiting conflicts between system, user, and assistant-level directives.

jailbreakingmultilinguallanguage-switchlow-resource-languagessafety-bypassred-teaming

Jailbreak via taalwissel

Exploiting weaker safety training in non-English languages to bypass LLM content filters by switching the conversation language mid-prompt or using low-resource languages.

jailbreakingmany-shotin-context-learninglong-contextred-teaming

Many-shot jailbreaking (aanval-walkthrough)

Using large numbers of examples in a single prompt to overwhelm LLM safety training through in-context learning, exploiting long context windows to shift model behavior.

prompt-injectionmulti-turnescalationsocial-engineeringred-teamingadvanced

Progressieve multi-turn-injectie

Gradually escalating prompt injection across conversation turns to build compliance, using psychological techniques like foot-in-the-door and norm erosion.

multimodalocrprompt-injectiontext-extractionred-teaming

OCR-gebaseerde aanvallen

Exploiting Optical Character Recognition processing pipelines to inject adversarial text into AI systems, targeting the gap between what OCR extracts and what humans see.

jailbreakingoutput-formatstructured-outputformat-manipulationsafety-bypassred-teaming

Manipulatie van het uitvoerformaat (aanval-walkthrough)

Forcing specific output formats to bypass LLM safety checks by exploiting the tension between format compliance and content restriction.

jailbreakingPAIRautomated-red-teamingLLM-attackeriterative-refinementred-teaming

Geautomatiseerde jailbreak met PAIR

Using a second LLM as an automated attacker to iteratively generate and refine jailbreak prompts against a target model, implementing the Prompt Automatic Iterative Refinement technique.

prompt-injectionobfuscationevasionpayload-craftingred-teamingintermediate

Technieken voor payloadobfuscatie

Methods for disguising prompt injection payloads through encoding, splitting, substitution, and other obfuscation techniques to bypass input filters and detection systems.

multimodalpdfprompt-injectiondocumentsred-teaming

PDF-documentinjectie

Injecting adversarial prompts through PDF documents processed by AI systems, exploiting document parsing pipelines to deliver payloads through text layers, metadata, and embedded objects.

prompt-injectionprompt-leakingsystem-promptextractionred-teamingbeginner

Prompt leaking stap voor stap

Systematic approaches to extract system prompts from LLM applications, covering direct elicitation, indirect inference, differential analysis, and output-based reconstruction.

prompt-injectionrecursivemulti-turnchain-attacksred-teamingadvanced

Recursieve injectieketens

Creating self-reinforcing injection chains that amplify across conversation turns, building compound prompts where each step strengthens the next injection's effectiveness.

jailbreakingrole-escalationpersona-manipulationmulti-turnprivilege-escalationred-teaming

Keten van rolescalatie

Progressive role escalation techniques that gradually transform an LLM from a constrained assistant into an unrestricted entity across multiple conversation turns.

prompt-injectionrole-playjailbreakfictional-framingred-teamingintermediate

Role-play-injectie

Using fictional scenarios, character role-play, and narrative framing to bypass LLM safety filters by having the model operate within a permissive fictional context.

jailbreakingskeleton-keymaster-keysafety-bypassred-teaming

Skeleton Key-aanval

The Skeleton Key jailbreak technique that attempts to disable model safety guardrails across all topics simultaneously by convincing the model to add a disclaimer instead of refusing.

jailbreakingsystem-promptprompt-injectionauthority-overridered-teaming

Override van de systeemprompt

Techniques to override, replace, or neutralize LLM system prompts through user-level injection, analyzing how system prompt authority can be undermined.

jailbreakingthought-injectionchain-of-thoughtreasoning-modelsCoTred-teaming

Thought injection voor redeneermodellen

Techniques for injecting malicious content into chain-of-thought reasoning traces of thinking models, exploiting the gap between reasoning and safety enforcement.

jailbreakingtokenizationtoken-smugglingBPEsubwordcontent-filter-bypassred-teaming

Token smuggling

Exploiting LLM tokenization quirks to smuggle harmful content past safety filters by manipulating how text is split into tokens at the subword level.

prompt-injectiontranslationmultilinguallow-resource-languagesred-teamingintermediate

Injectie via vertaling

Using translation requests and low-resource languages to bypass content filters, exploiting the uneven distribution of safety training across languages.

multimodalvideoprompt-injectionframesred-teaming

Injectie via videoframes (aanval-walkthrough)

Embedding prompt injection payloads in specific video frames to attack multimodal models that process video content, exploiting temporal and visual channels simultaneously.

prompt-injectionpersonajailbreakDANcharacter-creationred-teamingintermediate

Het creëren van een virtuele persona

Creating persistent alternate personas within LLM conversations to bypass safety training, establishing character identities that override the model's default behavioral constraints.

pyritred-teamingbeginnerfirst-campaignwalkthrough

Je eerste PyRIT-red team-campagne uitvoeren

Beginner walkthrough for running your first PyRIT red team campaign from scratch, covering installation, target configuration, orchestrator setup, and basic result analysis.

pyritmulti-turnorchestrationred-teamingwalkthrough

Multi-turn-aanvalssequenties orkestreren met PyRIT

Intermediate walkthrough on using PyRIT's orchestration capabilities for multi-turn red team campaigns, including attack strategy design, conversation management, and adaptive scoring.

pyritmicrosoftorchestratorsconvertersmulti-turnred-teamingwalkthrough

End-to-end walkthrough van PyRIT

Complete walkthrough of Microsoft's Python Risk Identification Toolkit: setup, connecting to targets, running orchestrators, using converters, multi-turn attacks, and analyzing results with the web UI.