# training
61 articlestagged with “training”
Training Pipeline Security Practice Exam
Practice exam on data poisoning, RLHF exploitation, fine-tuning attacks, and supply chain risks.
Data Poisoning Assessment
Comprehensive assessment of training data poisoning, synthetic data attacks, and supply chain vulnerabilities.
Fine-Tuning Attack Assessment
Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.
Model Supply Chain Assessment
Assessment covering model provenance, checkpoint manipulation, and third-party model risks.
RLHF Exploitation Assessment
Assessment of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.
Skill Verification: Training Pipeline Security
Skill verification for data poisoning, RLHF exploitation, and fine-tuning attack techniques.
Mentorship Program: AI Red Team Training
Community mentorship program pairing experienced AI red teamers with newcomers for structured learning and hands-on engagement experience.
Data Augmentation Attacks
Exploiting automated data augmentation pipelines to amplify poisoned samples or introduce adversarial patterns through augmentation transformations.
Gradient Leakage Attacks
Extracting training data from gradient updates in federated and collaborative learning settings.
Training Data Memorization Exploitation
Techniques for exploiting model memorization to extract verbatim training examples.
Property Inference Attacks
Inferring global properties of training datasets through model behavior analysis.
Practical Synthetic Data Poisoning
Poisoning synthetic data generation pipelines used for model training augmentation.
Data Poisoning Methods
Practical methodology for poisoning training datasets at scale, including crowdsource manipulation, web-scale dataset attacks, label flipping, feature collision, bilevel optimization for poison selection, and detection evasion techniques.
Training & Fine-Tuning Attacks
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Adversarial Training for LLM Defense
Use adversarial training techniques to improve LLM robustness against known attack patterns.
Training Prompt Injection Classifiers
Methodologies for training and evaluating ML classifiers that detect prompt injection attempts with high accuracy.
Embedding Backdoor Attacks
Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.
Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
Understanding LLM Safety Training
How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Sleeper Agent Research
Current research on training deceptive LLMs that persist through safety training and activation patterns.
Synthetic Data Poisoning in Training Pipelines
Research on poisoning synthetic data generation pipelines used for model training and fine-tuning.
Model Collapse and Security Implications
Security implications of model collapse from training on AI-generated data in iterative training loops.
Distributed Training Security
Security considerations for distributed model training across multiple nodes and data centers.
Custom Safety Classifier Training
Train a custom input safety classifier and then develop payloads that reliably evade it to understand classifier limitations.
Safety Training Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
End-to-End Training Time Attacks
Execute a complete training-time attack from data poisoning through model deployment to triggered exploitation.
Alignment Challenges in Multimodal Models
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
AI Security Awareness Training for Developers
Designing and delivering AI security awareness programs that help developers recognize and mitigate AI-specific security risks in their daily work.
AI Security Certification Landscape (Professional)
Comprehensive guide to certifications, training programs, and credentials relevant to AI security practitioners.
AI Security Training Program Design
Designing and delivering AI security training programs for development and security teams.
Industry Certifications & Training
Comprehensive guide to certifications, training programs, and educational resources relevant to AI red teaming, including security certifications, ML courses, and specialized AI security training.
Certifications in AI Security
Overview of relevant certifications and training programs for AI security professionals.
Training Program Development
Developing comprehensive AI red team training programs from beginner to advanced levels, including curriculum design and practical exercises.
Synthetic Data Risks
Model collapse from training on synthetic data, quality degradation across generations, distribution narrowing, minority erasure, and strategies for safe synthetic data usage in LLM training.
Alignment Tax: Safety vs Capability Tradeoffs
Quantitative analysis of the performance cost of safety training and alignment techniques on model capabilities.
Continual Learning Drift Attacks
Exploiting continual learning and online training to gradually shift model behavior toward adversarial objectives.
Knowledge Distillation Safety Gap
Analysis of safety property loss during knowledge distillation from teacher to student models.
DPO and IPO Training Vulnerabilities
Security analysis of Direct Preference Optimization and Identity Preference Optimization training methods.
DPO Training Vulnerabilities
Security analysis of Direct Preference Optimization training and its vulnerability to preference poisoning.
Evaluation Set Contamination Attacks
Attacking evaluation benchmarks and test sets to create false impressions of model safety and capability.
Gradient-Based Data Poisoning (Training Pipeline)
Using gradient information to craft optimally adversarial training examples for targeted model manipulation.
Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
Instruction Tuning Data Manipulation
Manipulating instruction tuning datasets to embed specific behaviors in the resulting model.
Knowledge Distillation Security
Security implications of knowledge distillation including capability extraction and safety alignment transfer.
Model Merging Safety Implications
Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.
Model Merging Security Analysis (Training Pipeline)
Security analysis of model merging techniques and propagation of vulnerabilities through merged models.
Model Weight Manipulation Techniques
Direct manipulation of model weights to inject backdoors, modify behavior, and bypass safety training.
Pre-Training Safety Interventions
Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.
Preference Data Poisoning (Training Pipeline)
Poisoning preference data used in RLHF and DPO to shift model alignment toward attacker objectives.
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
Safety Fine-Tuning Reversal Attacks
Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.
Synthetic Data Poisoning Vectors
Attack vectors specific to synthetic data generation pipelines used in model training and augmentation.
Tokenizer Poisoning Attacks
Attacking tokenizer training and vocabulary to create adversarial token patterns that bypass safety measures.
Training Data Curation Attacks
Attacking the data curation pipeline to inject adversarial examples into training datasets at scale.
Training Data Provenance Attacks
Attacking training data provenance and attribution systems to inject unverified data sources.
Transfer Learning Security Analysis
Security implications of transfer learning including inherited vulnerabilities and cross-domain attack transfer.
Fine-Tuning Safety Bypass Walkthrough
Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.
Prompt Classifier Training
Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.
Training Custom Safety Classifiers
Train custom safety classifiers tuned to your application's specific threat model and content policy.
Training a Prompt Injection Classifier
Train a custom prompt injection detection classifier using labeled datasets and modern NLP techniques.