# training

assessmentdata-poisoningtraining

Data Poisoning Assessment

Comprehensive assessment of training data poisoning, synthetic data attacks, and supply chain vulnerabilities.

assessmentfine-tuningtraining

Fine-Tuning Attack Assessment

Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.

assessmentsupply-chaintraining

Model Supply Chain Assessment

Assessment covering model provenance, checkpoint manipulation, and third-party model risks.

RLHF Exploitation Assessment

Assessment of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.

assessmentrlhftraining

skill-verificationtrainingpipeline

Skill Verification: Training Pipeline Security

Skill verification for data poisoning, RLHF exploitation, and fine-tuning attack techniques.

communitymentorshiptrainingcareer

Mentorship Program: AI Red Team Training

Community mentorship program pairing experienced AI red teamers with newcomers for structured learning and hands-on engagement experience.

data-trainingaugmentationmanipulationtraining

Data Augmentation Attacks

Exploiting automated data augmentation pipelines to amplify poisoned samples or introduce adversarial patterns through augmentation transformations.

attacksgradientdatatrainingleakage

Gradient Leakage Attacks

Extracting training data from gradient updates in federated and collaborative learning settings.

dataexploitationmemorizationtraining

Training Data Memorization Exploitation

Techniques for exploiting model memorization to extract verbatim training examples.

attacksinferencedatapropertytraining

Property Inference Attacks

Inferring global properties of training datasets through model behavior analysis.

syntheticpracticaldatapoisoningtraining

Practical Synthetic Data Poisoning

Poisoning synthetic data generation pipelines used for model training augmentation.

data-poisoningtrainingclean-labelfeature-collisionbilevel-optimizationdetection-evasion

Data Poisoning Methods

Practical methodology for poisoning training datasets at scale, including crowdsource manipulation, web-scale dataset attacks, label flipping, feature collision, bilevel optimization for poison selection, and detection evasion techniques.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

Training & Fine-Tuning Attacks

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

mitigationtrainingadversarialdefense

Adversarial Training for LLM Defense

Use adversarial training techniques to improve LLM robustness against known attack patterns.

defenseclassifiertraining

Training Prompt Injection Classifiers

Methodologies for training and evaluating ML classifiers that detect prompt injection attempts with high accuracy.

embeddingbackdoortrainingmanipulation

Embedding Backdoor Attacks

Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.

trainingrlhffine-tuningalignmentintermediate

Pre-training → Fine-tuning → RLHF Pipeline

Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.

understandingtrainingsafetyfoundations

Understanding LLM Safety Training

How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

Training Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

frontier-researchsleeper-agentsdeceptivetraining

Sleeper Agent Research

Current research on training deceptive LLMs that persist through safety training and activation patterns.

frontier-researchsynthetic-datapoisoningtraining

Synthetic Data Poisoning in Training Pipelines

Research on poisoning synthetic data generation pipelines used for model training and fine-tuning.

frontier-researchmodel-collapsesecuritytraining

Model Collapse and Security Implications

Security implications of model collapse from training on AI-generated data in iterative training loops.

infrastructuredistributedtrainingsecurity

Distributed Training Security

Security considerations for distributed model training across multiple nodes and data centers.

classifiercustomadvancedlabtraininglabs

Custom Safety Classifier Training

Train a custom input safety classifier and then develop payloads that reliably evade it to understand classifier limitations.

probingsafetylabbeginnertraininglabs

Safety Training Boundary Probing

Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.

attackstimelabexperttraininglabs

End-to-End Training Time Attacks

Execute a complete training-time attack from data poisoning through model deployment to triggered exploitation.

multimodalalignmentsafetytrainingcross-modal

Alignment Challenges in Multimodal Models

Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.

professionaltrainingawarenessdevelopers

AI Security Awareness Training for Developers

Designing and delivering AI security awareness programs that help developers recognize and mitigate AI-specific security risks in their daily work.

professionalcertificationstrainingcareer-development

AI Security Certification Landscape (Professional)

Comprehensive guide to certifications, training programs, and credentials relevant to AI security practitioners.

programsecurityprofessionaltraining

AI Security Training Program Design

Designing and delivering AI security training programs for development and security teams.

Industry Certifications & Training

Comprehensive guide to certifications, training programs, and educational resources relevant to AI red teaming, including security certifications, ML courses, and specialized AI security training.

certificationstraining

professionalcertificationstrainingcredentials

Certifications in AI Security

Overview of relevant certifications and training programs for AI security professionals.

professionaltrainingprogrameducation

Training Program Development

Developing comprehensive AI red team training programs from beginner to advanced levels, including curriculum design and practical exercises.

synthetic-datamodel-collapsequality-degradationdistributiontraining

Synthetic Data Risks

Model collapse from training on synthetic data, quality degradation across generations, distribution narrowing, minority erasure, and strategies for safe synthetic data usage in LLM training.

trainingalignment-taxtradeoffs

Alignment Tax: Safety vs Capability Tradeoffs

Quantitative analysis of the performance cost of safety training and alignment techniques on model capabilities.

trainingcontinual-learningdrift

Continual Learning Drift Attacks

Exploiting continual learning and online training to gradually shift model behavior toward adversarial objectives.

trainingdistillationsafety-gap

Knowledge Distillation Safety Gap

Analysis of safety property loss during knowledge distillation from teacher to student models.

DPO and IPO Training Vulnerabilities

Security analysis of Direct Preference Optimization and Identity Preference Optimization training methods.

trainingdpoipo

training-pipelinedpotrainingvulnerabilities

DPO Training Vulnerabilities

Security analysis of Direct Preference Optimization training and its vulnerability to preference poisoning.

trainingevaluationcontamination

Evaluation Set Contamination Attacks

Attacking evaluation benchmarks and test sets to create false impressions of model safety and capability.

traininggradientpoisoning

Gradient-Based Data Poisoning (Training Pipeline)

Using gradient information to craft optimally adversarial training examples for targeted model manipulation.

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

Training Pipeline Security

Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.

instructionpipelinetuningmanipulationtraining

Instruction Tuning Data Manipulation

Manipulating instruction tuning datasets to embed specific behaviors in the resulting model.

attackspipelinedistillationknowledgetraining

Knowledge Distillation Security

Security implications of knowledge distillation including capability extraction and safety alignment transfer.

trainingmodel-mergingsafety

Model Merging Safety Implications

Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.

securityanalysispipelinemergetrainingmodel

Model Merging Security Analysis (Training Pipeline)

Security analysis of model merging techniques and propagation of vulnerabilities through merged models.

trainingweightsmanipulation

Model Weight Manipulation Techniques

Direct manipulation of model weights to inject backdoors, modify behavior, and bypass safety training.

trainingpre-trainingsafety

Pre-Training Safety Interventions

Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.

preferencepipelinedatapoisoningtraining

Preference Data Poisoning (Training Pipeline)

Poisoning preference data used in RLHF and DPO to shift model alignment toward attacker objectives.

trainingrlhfreward-hacking

RLHF Reward Hacking Deep Dive

In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.

trainingfine-tuningsafety-reversal

Safety Fine-Tuning Reversal Attacks

Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.

trainingsynthetic-datapoisoning

Synthetic Data Poisoning Vectors

Attack vectors specific to synthetic data generation pipelines used in model training and augmentation.

trainingtokenizerpoisoning

Tokenizer Poisoning Attacks

Attacking tokenizer training and vocabulary to create adversarial token patterns that bypass safety measures.

trainingdata-curationpoisoning

Training Data Curation Attacks

Attacking the data curation pipeline to inject adversarial examples into training datasets at scale.

provenancetrainingpipelinedata

Training Data Provenance Attacks

Attacking training data provenance and attribution systems to inject unverified data sources.

trainingtransfer-learningsecurity

Transfer Learning Security Analysis

Security implications of transfer learning including inherited vulnerabilities and cross-domain attack transfer.

walkthroughsfine-tuningsafety-bypasstraining

Fine-Tuning Safety Bypass Walkthrough

Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.

classifiermachine-learningprompt-injectiondetectiontrainingdefensewalkthrough

Prompt Classifier Training

Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.

walkthroughsdefensesafety-classifiertraining

Training Custom Safety Classifiers

Train custom safety classifiers tuned to your application's specific threat model and content policy.

walkthroughsdefenseclassifiertraining

Training a Prompt Injection Classifier

Train a custom prompt injection detection classifier using labeled datasets and modern NLP techniques.

practice-examtrainingpipeline

訓練 Pipeline 安全 Practice Exam

Practice exam on data poisoning, RLHF exploitation, fine-tuning attacks, and supply chain risks.

assessmentdata-poisoningtraining

Data 投毒評量

Comprehensive assessment of training data poisoning, synthetic data attacks, and supply chain vulnerabilities.

assessmentfine-tuningtraining

Fine-Tuning 攻擊評量

評量 of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.

assessmentsupply-chaintraining

模型 Supply Chain 評量

評量 covering model provenance, checkpoint manipulation, and third-party model risks.

RLHF 利用ation 評量

評量 of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.

assessmentrlhftraining

skill-verificationtrainingpipeline

Skill Verification: 訓練 Pipeline 安全

Skill verification for data poisoning, RLHF exploitation, and fine-tuning attack techniques.

communitymentorshiptrainingcareer

Mentorship Program: AI 紅隊訓練

Community mentorship program pairing experienced AI red teamers with newcomers for structured learning and hands-on engagement experience.

data-trainingaugmentationmanipulationtraining

Data Augmentation 攻擊s

利用ing automated data augmentation pipelines to amplify poisoned samples or introduce adversarial patterns through augmentation transformations.

attacksgradientdatatrainingleakage

Gradient Leakage 攻擊s

Extracting training data from gradient updates in federated and collaborative learning settings.

dataexploitationmemorizationtraining

訓練 Data Memorization 利用ation

Techniques for exploiting model memorization to extract verbatim training examples.

attacksinferencedatapropertytraining

Property Inference 攻擊s

Inferring global properties of training datasets through model behavior analysis.

syntheticpracticaldatapoisoningtraining

Practical Synthetic Data 投毒

投毒 synthetic data generation pipelines used for model training augmentation.

data-poisoningtrainingclean-labelfeature-collisionbilevel-optimizationdetection-evasion

Data 投毒 Methods

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

訓練 & Fine-Tuning 攻擊s

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

mitigationtrainingadversarialdefense

Adversarial 訓練 for LLM 防禦

Use adversarial training techniques to improve LLM robustness against known attack patterns.

defenseclassifiertraining

訓練提示詞注入 Classifiers

Methodologies for training and evaluating ML classifiers that detect prompt injection attempts with high accuracy.

embeddingbackdoortrainingmanipulation

Embedding Backdoor 攻擊s

Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.

llmtransformerarchitecturetrainingalignmentfoundations

大型語言模型如何運作

從安全視角理解大型語言模型——涵蓋 transformer 架構、分詞、注意力、訓練流程與安全對齊機制。

trainingrlhffine-tuningalignmentintermediate

預訓練 → 微調 → RLHF 管線

瞭解打造對齊 LLM 的三階段流程——預訓練、監督式微調、RLHF/DPO——以及各階段的安全意涵。

understandingtrainingsafetyfoundations

Understanding LLM Safety 訓練

How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

訓練 Implications of Alignment Faking

frontier-researchsleeper-agentsdeceptivetraining

Sleeper 代理 Research

Current research on training deceptive LLMs that persist through safety training and activation patterns.

frontier-researchsynthetic-datapoisoningtraining

Synthetic Data 投毒 in 訓練 Pipelines

Research on poisoning synthetic data generation pipelines used for model training and fine-tuning.

frontier-researchmodel-collapsesecuritytraining

模型 Collapse and 安全 Implications

安全 implications of model collapse from training on AI-generated data in iterative training loops.

infrastructuredistributedtrainingsecurity

Distributed 訓練安全

安全 considerations for distributed model training across multiple nodes and data centers.

classifiercustomadvancedlabtraininglabs

Custom Safety Classifier 訓練

Train a custom input safety classifier and then develop payloads that reliably evade it to understand classifier limitations.

probingsafetylabbeginnertraininglabs

Safety 訓練 Boundary Probing

Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.

attackstimelabexperttraininglabs

End-to-End 訓練 Time 攻擊s

Execute a complete training-time attack from data poisoning through model deployment to triggered exploitation.

multimodalalignmentsafetytrainingcross-modal

Alignment Challenges in Multimodal 模型s

professionaltrainingawarenessdevelopers

AI 安全 Awareness 訓練 for Developers

Designing and delivering AI security awareness programs that help developers recognize and mitigate AI-specific security risks in their daily work.

professionalcertificationstrainingcareer-development

AI 安全 Certification Landscape (Professional)

Comprehensive guide to certifications, training programs, and credentials relevant to AI security practitioners.

programsecurityprofessionaltraining

AI 安全訓練 Program Design

Designing and delivering AI security training programs for development and security teams.

業界認證與訓練

與 AI 紅隊相關之認證、訓練計畫與教育資源的完整指南，包含安全認證、ML 課程與專業 AI 安全訓練。

certificationstraining

professionalcertificationstrainingcredentials

Certifications in AI 安全

概覽 of relevant certifications and training programs for AI security professionals.

professionaltrainingprogrameducation

訓練 Program Development

Developing comprehensive AI red team training programs from beginner to advanced levels, including curriculum design and practical exercises.

synthetic-datamodel-collapsequality-degradationdistributiontraining

Synthetic Data Risks

模型 collapse from training on synthetic data, quality degradation across generations, distribution narrowing, minority erasure, and strategies for safe synthetic data usage in LLM training.

trainingalignment-taxtradeoffs

Alignment Tax: Safety vs Capability Tradeoffs

Quantitative analysis of the performance cost of safety training and alignment techniques on model capabilities.

trainingcontinual-learningdrift

Continual Learning Drift 攻擊s

利用ing continual learning and online training to gradually shift model behavior toward adversarial objectives.

trainingdistillationsafety-gap

Knowledge Distillation Safety Gap

Analysis of safety property loss during knowledge distillation from teacher to student models.

DPO and IPO 訓練 Vulnerabilities

安全 analysis of Direct Preference Optimization and Identity Preference Optimization training methods.

trainingdpoipo

training-pipelinedpotrainingvulnerabilities

DPO 訓練 Vulnerabilities

安全 analysis of Direct Preference Optimization training and its vulnerability to preference poisoning.

trainingevaluationcontamination

Evaluation Set Contamination 攻擊s

攻擊ing evaluation benchmarks and test sets to create false impressions of model safety and capability.

traininggradientpoisoning

Gradient-Based Data 投毒 (訓練 Pipeline)

Using gradient information to craft optimally adversarial training examples for targeted model manipulation.

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

訓練管線安全

完整 AI 模型訓練管線的安全，涵蓋預訓練攻擊、微調與對齊操控、架構層級漏洞與進階訓練期威脅。

instructionpipelinetuningmanipulationtraining

Instruction Tuning Data Manipulation

Manipulating instruction tuning datasets to embed specific behaviors in the resulting model.

attackspipelinedistillationknowledgetraining

Knowledge Distillation 安全

安全 implications of knowledge distillation including capability extraction and safety alignment transfer.

trainingmodel-mergingsafety

模型 Merging Safety Implications

Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.

securityanalysispipelinemergetrainingmodel

模型 Merging 安全 Analysis (訓練 Pipeline)

安全 analysis of model merging techniques and propagation of vulnerabilities through merged models.

trainingweightsmanipulation

模型 Weight Manipulation Techniques

Direct manipulation of model weights to inject backdoors, modify behavior, and bypass safety training.

trainingpre-trainingsafety

Pre-訓練 Safety Interventions

Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.

preferencepipelinedatapoisoningtraining

Preference Data 投毒 (訓練 Pipeline)

投毒 preference data used in RLHF and DPO to shift model alignment toward attacker objectives.

trainingrlhfreward-hacking

RLHF Reward Hacking Deep Dive

In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.

trainingfine-tuningsafety-reversal

Safety Fine-Tuning Reversal 攻擊s

Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.

trainingsynthetic-datapoisoning

Synthetic Data 投毒 Vectors

攻擊 vectors specific to synthetic data generation pipelines used in model training and augmentation.

trainingtokenizerpoisoning

Tokenizer 投毒攻擊s

攻擊ing tokenizer training and vocabulary to create adversarial token patterns that bypass safety measures.

trainingdata-curationpoisoning

訓練 Data Curation 攻擊s

攻擊ing the data curation pipeline to inject adversarial examples into training datasets at scale.

provenancetrainingpipelinedata

訓練 Data Provenance 攻擊s

攻擊ing training data provenance and attribution systems to inject unverified data sources.

trainingtransfer-learningsecurity

Transfer Learning 安全 Analysis

安全 implications of transfer learning including inherited vulnerabilities and cross-domain attack transfer.

walkthroughsfine-tuningsafety-bypasstraining

Fine-Tuning Safety Bypass 導覽

導覽 of using fine-tuning API access to remove safety behaviors from aligned models.

classifiermachine-learningprompt-injectiondetectiontrainingdefensewalkthrough

Prompt Classifier 訓練

walkthroughsdefensesafety-classifiertraining

訓練 Custom Safety Classifiers

Train custom safety classifiers tuned to your application's specific threat model and content policy.

walkthroughsdefenseclassifiertraining

訓練 a 提示詞注入 Classifier

Train a custom prompt injection detection classifier using labeled datasets and modern NLP techniques.