# training
標記為「training」的 123 篇文章
Training Pipeline Security Practice Exam
Practice exam on data poisoning, RLHF exploitation, fine-tuning attacks, and supply chain risks.
Data Poisoning Assessment
Comprehensive assessment of training data poisoning, synthetic data attacks, and supply chain vulnerabilities.
Fine-Tuning Attack Assessment
Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.
Model Supply Chain Assessment
Assessment covering model provenance, checkpoint manipulation, and third-party model risks.
RLHF Exploitation Assessment
Assessment of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.
Skill Verification: Training Pipeline Security
Skill verification for data poisoning, RLHF exploitation, and fine-tuning attack techniques.
Mentorship Program: AI Red Team Training
Community mentorship program pairing experienced AI red teamers with newcomers for structured learning and hands-on engagement experience.
Data Augmentation Attacks
Exploiting automated data augmentation pipelines to amplify poisoned samples or introduce adversarial patterns through augmentation transformations.
Gradient Leakage Attacks
Extracting training data from gradient updates in federated and collaborative learning settings.
Training Data Memorization Exploitation
Techniques for exploiting model memorization to extract verbatim training examples.
Property Inference Attacks
Inferring global properties of training datasets through model behavior analysis.
Practical Synthetic Data Poisoning
Poisoning synthetic data generation pipelines used for model training augmentation.
Data Poisoning Methods
Practical methodology for poisoning training datasets at scale, including crowdsource manipulation, web-scale dataset attacks, label flipping, feature collision, bilevel optimization for poison selection, and detection evasion techniques.
Training & Fine-Tuning Attacks
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Adversarial Training for LLM Defense
Use adversarial training techniques to improve LLM robustness against known attack patterns.
Training Prompt Injection Classifiers
Methodologies for training and evaluating ML classifiers that detect prompt injection attempts with high accuracy.
Embedding Backdoor Attacks
Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.
Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
Understanding LLM Safety Training
How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Sleeper Agent Research
Current research on training deceptive LLMs that persist through safety training and activation patterns.
Synthetic Data Poisoning in Training Pipelines
Research on poisoning synthetic data generation pipelines used for model training and fine-tuning.
Model Collapse and Security Implications
Security implications of model collapse from training on AI-generated data in iterative training loops.
Distributed Training Security
Security considerations for distributed model training across multiple nodes and data centers.
Custom Safety Classifier Training
Train a custom input safety classifier and then develop payloads that reliably evade it to understand classifier limitations.
Safety Training Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
End-to-End Training Time Attacks
Execute a complete training-time attack from data poisoning through model deployment to triggered exploitation.
Alignment Challenges in Multimodal Models
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
AI Security Awareness Training for Developers
Designing and delivering AI security awareness programs that help developers recognize and mitigate AI-specific security risks in their daily work.
AI Security Certification Landscape (Professional)
Comprehensive guide to certifications, training programs, and credentials relevant to AI security practitioners.
AI Security Training Program Design
Designing and delivering AI security training programs for development and security teams.
Industry Certifications & Training
Comprehensive guide to certifications, training programs, and educational resources relevant to AI red teaming, including security certifications, ML courses, and specialized AI security training.
Certifications in AI Security
Overview of relevant certifications and training programs for AI security professionals.
Training Program Development
Developing comprehensive AI red team training programs from beginner to advanced levels, including curriculum design and practical exercises.
Synthetic Data Risks
Model collapse from training on synthetic data, quality degradation across generations, distribution narrowing, minority erasure, and strategies for safe synthetic data usage in LLM training.
Alignment Tax: Safety vs Capability Tradeoffs
Quantitative analysis of the performance cost of safety training and alignment techniques on model capabilities.
Continual Learning Drift Attacks
Exploiting continual learning and online training to gradually shift model behavior toward adversarial objectives.
Knowledge Distillation Safety Gap
Analysis of safety property loss during knowledge distillation from teacher to student models.
DPO and IPO Training Vulnerabilities
Security analysis of Direct Preference Optimization and Identity Preference Optimization training methods.
DPO Training Vulnerabilities
Security analysis of Direct Preference Optimization training and its vulnerability to preference poisoning.
Evaluation Set Contamination Attacks
Attacking evaluation benchmarks and test sets to create false impressions of model safety and capability.
Gradient-Based Data Poisoning (Training Pipeline)
Using gradient information to craft optimally adversarial training examples for targeted model manipulation.
Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
Instruction Tuning Data Manipulation
Manipulating instruction tuning datasets to embed specific behaviors in the resulting model.
Knowledge Distillation Security
Security implications of knowledge distillation including capability extraction and safety alignment transfer.
Model Merging Safety Implications
Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.
Model Merging Security Analysis (Training Pipeline)
Security analysis of model merging techniques and propagation of vulnerabilities through merged models.
Model Weight Manipulation Techniques
Direct manipulation of model weights to inject backdoors, modify behavior, and bypass safety training.
Pre-Training Safety Interventions
Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.
Preference Data Poisoning (Training Pipeline)
Poisoning preference data used in RLHF and DPO to shift model alignment toward attacker objectives.
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
Safety Fine-Tuning Reversal Attacks
Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.
Synthetic Data Poisoning Vectors
Attack vectors specific to synthetic data generation pipelines used in model training and augmentation.
Tokenizer Poisoning Attacks
Attacking tokenizer training and vocabulary to create adversarial token patterns that bypass safety measures.
Training Data Curation Attacks
Attacking the data curation pipeline to inject adversarial examples into training datasets at scale.
Training Data Provenance Attacks
Attacking training data provenance and attribution systems to inject unverified data sources.
Transfer Learning Security Analysis
Security implications of transfer learning including inherited vulnerabilities and cross-domain attack transfer.
Fine-Tuning Safety Bypass Walkthrough
Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.
Prompt Classifier Training
Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.
Training Custom Safety Classifiers
Train custom safety classifiers tuned to your application's specific threat model and content policy.
Training a Prompt Injection Classifier
Train a custom prompt injection detection classifier using labeled datasets and modern NLP techniques.
訓練 Pipeline 安全 Practice Exam
Practice exam on data poisoning, RLHF exploitation, fine-tuning attacks, and supply chain risks.
Data 投毒 評量
Comprehensive assessment of training data poisoning, synthetic data attacks, and supply chain vulnerabilities.
Fine-Tuning 攻擊 評量
評量 of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.
模型 Supply Chain 評量
評量 covering model provenance, checkpoint manipulation, and third-party model risks.
RLHF 利用ation 評量
評量 of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.
Skill Verification: 訓練 Pipeline 安全
Skill verification for data poisoning, RLHF exploitation, and fine-tuning attack techniques.
Mentorship Program: AI 紅隊 訓練
Community mentorship program pairing experienced AI red teamers with newcomers for structured learning and hands-on engagement experience.
Data Augmentation 攻擊s
利用ing automated data augmentation pipelines to amplify poisoned samples or introduce adversarial patterns through augmentation transformations.
Gradient Leakage 攻擊s
Extracting training data from gradient updates in federated and collaborative learning settings.
訓練 Data Memorization 利用ation
Techniques for exploiting model memorization to extract verbatim training examples.
Property Inference 攻擊s
Inferring global properties of training datasets through model behavior analysis.
Practical Synthetic Data 投毒
投毒 synthetic data generation pipelines used for model training augmentation.
Data 投毒 Methods
Practical methodology for poisoning training datasets at scale, including crowdsource manipulation, web-scale dataset attacks, label flipping, feature collision, bilevel optimization for poison selection, and detection evasion techniques.
訓練 & Fine-Tuning 攻擊s
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Adversarial 訓練 for LLM 防禦
Use adversarial training techniques to improve LLM robustness against known attack patterns.
訓練 提示詞注入 Classifiers
Methodologies for training and evaluating ML classifiers that detect prompt injection attempts with high accuracy.
Embedding Backdoor 攻擊s
Inserting backdoors into embedding models that cause specific trigger inputs to produce predetermined embedding vectors for adversarial retrieval.
大型語言模型如何運作
從安全視角理解大型語言模型——涵蓋 transformer 架構、分詞、注意力、訓練流程與安全對齊機制。
預訓練 → 微調 → RLHF 管線
瞭解打造對齊 LLM 的三階段流程——預訓練、監督式微調、RLHF/DPO——以及各階段的安全意涵。
Understanding LLM Safety 訓練
How safety training works including RLHF, DPO, and constitutional AI and why it can be bypassed.
訓練 Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Sleeper 代理 Research
Current research on training deceptive LLMs that persist through safety training and activation patterns.
Synthetic Data 投毒 in 訓練 Pipelines
Research on poisoning synthetic data generation pipelines used for model training and fine-tuning.
模型 Collapse and 安全 Implications
安全 implications of model collapse from training on AI-generated data in iterative training loops.
Distributed 訓練 安全
安全 considerations for distributed model training across multiple nodes and data centers.
Custom Safety Classifier 訓練
Train a custom input safety classifier and then develop payloads that reliably evade it to understand classifier limitations.
Safety 訓練 Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
End-to-End 訓練 Time 攻擊s
Execute a complete training-time attack from data poisoning through model deployment to triggered exploitation.
Alignment Challenges in Multimodal 模型s
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
AI 安全 Awareness 訓練 for Developers
Designing and delivering AI security awareness programs that help developers recognize and mitigate AI-specific security risks in their daily work.
AI 安全 Certification Landscape (Professional)
Comprehensive guide to certifications, training programs, and credentials relevant to AI security practitioners.
AI 安全 訓練 Program Design
Designing and delivering AI security training programs for development and security teams.
業界認證與訓練
與 AI 紅隊相關之認證、訓練計畫與教育資源的完整指南,包含安全認證、ML 課程與專業 AI 安全訓練。
Certifications in AI 安全
概覽 of relevant certifications and training programs for AI security professionals.
訓練 Program Development
Developing comprehensive AI red team training programs from beginner to advanced levels, including curriculum design and practical exercises.
Synthetic Data Risks
模型 collapse from training on synthetic data, quality degradation across generations, distribution narrowing, minority erasure, and strategies for safe synthetic data usage in LLM training.
Alignment Tax: Safety vs Capability Tradeoffs
Quantitative analysis of the performance cost of safety training and alignment techniques on model capabilities.
Continual Learning Drift 攻擊s
利用ing continual learning and online training to gradually shift model behavior toward adversarial objectives.
Knowledge Distillation Safety Gap
Analysis of safety property loss during knowledge distillation from teacher to student models.
DPO and IPO 訓練 Vulnerabilities
安全 analysis of Direct Preference Optimization and Identity Preference Optimization training methods.
DPO 訓練 Vulnerabilities
安全 analysis of Direct Preference Optimization training and its vulnerability to preference poisoning.
Evaluation Set Contamination 攻擊s
攻擊ing evaluation benchmarks and test sets to create false impressions of model safety and capability.
Gradient-Based Data 投毒 (訓練 Pipeline)
Using gradient information to craft optimally adversarial training examples for targeted model manipulation.
訓練管線安全
完整 AI 模型訓練管線的安全,涵蓋預訓練攻擊、微調與對齊操控、架構層級漏洞與進階訓練期威脅。
Instruction Tuning Data Manipulation
Manipulating instruction tuning datasets to embed specific behaviors in the resulting model.
Knowledge Distillation 安全
安全 implications of knowledge distillation including capability extraction and safety alignment transfer.
模型 Merging Safety Implications
Analysis of how model merging techniques (TIES, DARE, SLERP) affect safety properties and alignment.
模型 Merging 安全 Analysis (訓練 Pipeline)
安全 analysis of model merging techniques and propagation of vulnerabilities through merged models.
模型 Weight Manipulation Techniques
Direct manipulation of model weights to inject backdoors, modify behavior, and bypass safety training.
Pre-訓練 Safety Interventions
Analysis of safety interventions applied during pre-training including data filtering, loss weighting, and curriculum design.
Preference Data 投毒 (訓練 Pipeline)
投毒 preference data used in RLHF and DPO to shift model alignment toward attacker objectives.
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
Safety Fine-Tuning Reversal 攻擊s
Techniques for reversing safety fine-tuning through targeted fine-tuning on adversarial datasets.
Synthetic Data 投毒 Vectors
攻擊 vectors specific to synthetic data generation pipelines used in model training and augmentation.
Tokenizer 投毒 攻擊s
攻擊ing tokenizer training and vocabulary to create adversarial token patterns that bypass safety measures.
訓練 Data Curation 攻擊s
攻擊ing the data curation pipeline to inject adversarial examples into training datasets at scale.
訓練 Data Provenance 攻擊s
攻擊ing training data provenance and attribution systems to inject unverified data sources.
Transfer Learning 安全 Analysis
安全 implications of transfer learning including inherited vulnerabilities and cross-domain attack transfer.
Fine-Tuning Safety Bypass 導覽
導覽 of using fine-tuning API access to remove safety behaviors from aligned models.
Prompt Classifier 訓練
Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.
訓練 Custom Safety Classifiers
Train custom safety classifiers tuned to your application's specific threat model and content policy.
訓練 a 提示詞注入 Classifier
Train a custom prompt injection detection classifier using labeled datasets and modern NLP techniques.