# training

assessmentdata-poisoningtraining

Data Poisoning Assessment

Comprehensive assessment of training data poisoning, synthetic data attacks, and supply chain vulnerabilities.

assessmentfine-tuningtraining

Fine-Tuning Attack Assessment

Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.

assessmentsupply-chaintraining

Model Supply Chain Assessment

Assessment covering model provenance, checkpoint manipulation, and third-party model risks.

RLHF Exploitation Assessment

Assessment of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.

assessmentrlhftraining

skill-verificationtrainingpipeline

Skill Verification: Training Pipeline Security

Skill verification for data poisoning, RLHF exploitation, and fine-tuning attack techniques.

communitymentorshiptrainingcareer

Mentorship Program: AI Red Team Training

Community mentorship program pairing experienced AI red teamers with newcomers for structured learning and hands-on engagement experience.

data-trainingaugmentationmanipulationtraining

Data Augmentation Attacks

Exploiting automated data augmentation pipelines to amplify poisoned samples or introduce adversarial patterns through augmentation transformations.

attacksgradientdatatrainingleakage

Gradient Leakage Attacks

Extracting training data from gradient updates in federated and collaborative learning settings.

dataexploitationmemorizationtraining

Training Data Memorization Exploitation

Techniques for exploiting model memorization to extract verbatim training examples.

attacksinferencedatapropertytraining

Property Inference Attacks

Inferring global properties of training datasets through model behavior analysis.

syntheticpracticaldatapoisoningtraining

Practical Synthetic Data Poisoning

Poisoning synthetic data generation pipelines used for model training augmentation.

data-poisoningtrainingclean-labelfeature-collisionbilevel-optimizationdetection-evasion

Data Poisoning Methods

Practical methodology for poisoning training datasets at scale, including crowdsource manipulation, web-scale dataset attacks, label flipping, feature collision, bilevel optimization for poison selection, and detection evasion techniques.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

Training & Fine-Tuning Attacks

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

mitigationtrainingadversarialdefense

Adversarial Training for LLM Defense

Use adversarial training techniques to improve LLM robustness against known attack patterns.

defenseclassifiertraining

Training Prompt Injection Classifiers

Methodologies for training and evaluating ML classifiers that detect prompt injection attempts with high accuracy.

embeddingbackdoortrainingmanipulation

Embedding Backdoor Attacks

Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.

trainingrlhffine-tuningalignmentintermediate

Pre-training → Fine-tuning → RLHF Pipeline

Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.

understandingtrainingsafetyfoundations

Understanding LLM Safety Training

How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

Training Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

frontier-researchsleeper-agentsdeceptivetraining

Sleeper Agent Research

Current research on training deceptive LLMs that persist through safety training and activation patterns.

frontier-researchsynthetic-datapoisoningtraining

Synthetic Data Poisoning in Training Pipelines

Research on poisoning synthetic data generation pipelines used for model training and fine-tuning.

frontier-researchmodel-collapsesecuritytraining

Model Collapse and Security Implications

Security implications of model collapse from training on AI-generated data in iterative training loops.

infrastructuredistributedtrainingsecurity

Distributed Training Security

Security considerations for distributed model training across multiple nodes and data centers.

classifiercustomadvancedlabtraininglabs

Custom Safety Classifier Training

Train a custom input safety classifier and then develop payloads that reliably evade it to understand classifier limitations.

probingsafetylabbeginnertraininglabs

Safety Training Boundary Probing

Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.

attackstimelabexperttraininglabs

End-to-End Training Time Attacks

Execute a complete training-time attack from data poisoning through model deployment to triggered exploitation.

multimodalalignmentsafetytrainingcross-modal

Alignment Challenges in Multimodal Models

Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.

professionaltrainingawarenessdevelopers

AI Security Awareness Training for Developers

Designing and delivering AI security awareness programs that help developers recognize and mitigate AI-specific security risks in their daily work.

professionalcertificationstrainingcareer-development

AI Security Certification Landscape (Professional)

Comprehensive guide to certifications, training programs, and credentials relevant to AI security practitioners.

programsecurityprofessionaltraining

AI Security Training Program Design

Designing and delivering AI security training programs for development and security teams.

Industry Certifications & Training

Comprehensive guide to certifications, training programs, and educational resources relevant to AI red teaming, including security certifications, ML courses, and specialized AI security training.

certificationstraining

professionalcertificationstrainingcredentials

Certifications in AI Security

Overview of relevant certifications and training programs for AI security professionals.

professionaltrainingprogrameducation

Training Program Development

Developing comprehensive AI red team training programs from beginner to advanced levels, including curriculum design and practical exercises.

synthetic-datamodel-collapsequality-degradationdistributiontraining

Synthetic Data Risks

Model collapse from training on synthetic data, quality degradation across generations, distribution narrowing, minority erasure, and strategies for safe synthetic data usage in LLM training.

trainingalignment-taxtradeoffs

Alignment Tax: Safety vs Capability Tradeoffs

Quantitative analysis of the performance cost of safety training and alignment techniques on model capabilities.

trainingcontinual-learningdrift

Continual Learning Drift Attacks

Exploiting continual learning and online training to gradually shift model behavior toward adversarial objectives.

trainingdistillationsafety-gap

Knowledge Distillation Safety Gap

Analysis of safety property loss during knowledge distillation from teacher to student models.

DPO and IPO Training Vulnerabilities

Security analysis of Direct Preference Optimization and Identity Preference Optimization training methods.

trainingdpoipo

training-pipelinedpotrainingvulnerabilities

DPO Training Vulnerabilities

Security analysis of Direct Preference Optimization training and its vulnerability to preference poisoning.

trainingevaluationcontamination

Evaluation Set Contamination Attacks

Attacking evaluation benchmarks and test sets to create false impressions of model safety and capability.

traininggradientpoisoning

Gradient-Based Data Poisoning (Training Pipeline)

Using gradient information to craft optimally adversarial training examples for targeted model manipulation.

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

Training Pipeline Security

Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.

instructionpipelinetuningmanipulationtraining

Instruction Tuning Data Manipulation

Manipulating instruction tuning datasets to embed specific behaviors in the resulting model.

attackspipelinedistillationknowledgetraining

Knowledge Distillation Security

Security implications of knowledge distillation including capability extraction and safety alignment transfer.

trainingmodel-mergingsafety

Model Merging Safety Implications

Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.

securityanalysispipelinemergetrainingmodel

Model Merging Security Analysis (Training Pipeline)

Security analysis of model merging techniques and propagation of vulnerabilities through merged models.

trainingweightsmanipulation

Model Weight Manipulation Techniques

Direct manipulation of model weights to inject backdoors, modify behavior, and bypass safety training.

trainingpre-trainingsafety

Pre-Training Safety Interventions

Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.

preferencepipelinedatapoisoningtraining

Preference Data Poisoning (Training Pipeline)

Poisoning preference data used in RLHF and DPO to shift model alignment toward attacker objectives.

trainingrlhfreward-hacking

RLHF Reward Hacking Deep Dive

In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.

trainingfine-tuningsafety-reversal

Safety Fine-Tuning Reversal Attacks

Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.

trainingsynthetic-datapoisoning

Synthetic Data Poisoning Vectors

Attack vectors specific to synthetic data generation pipelines used in model training and augmentation.

trainingtokenizerpoisoning

Tokenizer Poisoning Attacks

Attacking tokenizer training and vocabulary to create adversarial token patterns that bypass safety measures.

trainingdata-curationpoisoning

Training Data Curation Attacks

Attacking the data curation pipeline to inject adversarial examples into training datasets at scale.

provenancetrainingpipelinedata

Training Data Provenance Attacks

Attacking training data provenance and attribution systems to inject unverified data sources.

trainingtransfer-learningsecurity

Transfer Learning Security Analysis

Security implications of transfer learning including inherited vulnerabilities and cross-domain attack transfer.

walkthroughsfine-tuningsafety-bypasstraining

Fine-Tuning Safety Bypass Walkthrough

Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.

classifiermachine-learningprompt-injectiondetectiontrainingdefensewalkthrough

Prompt Classifier Training

Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.

walkthroughsdefensesafety-classifiertraining

Training Custom Safety Classifiers

Train custom safety classifiers tuned to your application's specific threat model and content policy.

walkthroughsdefenseclassifiertraining

Training a Prompt Injection Classifier

Train a custom prompt injection detection classifier using labeled datasets and modern NLP techniques.