# alignment
標記為「alignment」的 89 篇文章
Autonomous Goal Drift
How autonomous AI agents drift from intended objectives through reward hacking, specification gaming, safety system bypass, and cascading failures in multi-agent systems.
Fine-Tuning Attack Forensics
Forensic techniques for detecting unauthorized fine-tuning modifications to language models, including safety alignment degradation and capability injection.
Frontier Research Assessment
Comprehensive assessment covering adversarial robustness, alignment faking, sleeper agents, and emerging research directions in AI security.
Case Study: Sleeper Agents Research Impact
Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.
Bing Chat Sydney Incident
Analysis of the February 2023 Bing Chat 'Sydney' incident where Microsoft's AI chatbot exhibited erratic behavior including emotional manipulation, threats, and identity confusion during extended conversations.
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Constitutional AI as Defense Strategy
Using constitutional AI principles to build inherently safer LLM applications resistant to attacks.
The AI Defense Landscape
Comprehensive overview of AI defense categories including input filtering, output filtering, guardrails, alignment training, and monitoring -- plus the tools and vendors in each space.
Alignment Stability Under Fine-Tuning
Testing how safety alignment degrades under various fine-tuning configurations and datasets.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
DPO Alignment Attacks
Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Fine-Tuning-as-a-Service Attack Surface
How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Preference Data Poisoning
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Safety Dataset Poisoning
Attacking the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.
Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
RLHF and Safety Alignment
Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.
Agentic AI Alignment Challenges
Analysis of alignment challenges specific to tool-using, planning, and autonomous AI agents in production environments.
Scalable Oversight Challenges
How oversight breaks down as AI systems become more capable: the scalable oversight problem, recursive reward modeling, debate, market-making, and implications for red teaming increasingly capable models.
Cultural Alignment Gaps in Safety Training
Research on cross-cultural gaps in safety alignment and their exploitation through cultural context manipulation.
Quantization & Safety Alignment
How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.
Scalable Oversight Challenges (Frontier Research)
Research challenges in scaling human oversight as AI systems become more capable and autonomous.
Self-Improving Agent Safety Challenges
Security and safety challenges posed by self-improving AI agents that modify their own capabilities.
Sycophancy Exploitation Research
Research on exploiting sycophantic tendencies in RLHF-trained models for safety bypass and manipulation.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
LLM Internals
Deep technical exploration of LLM internal mechanisms for exploit development, covering activation analysis, alignment bypass primitives, and embedding space exploitation.
Lab: Alignment Boundary Probing
Systematically probe alignment boundaries to map the exact thresholds where safety training engages and disengages.
Alignment Breaker: Level 2 — Safety Fine-Tuned Model
Bypass safety fine-tuning on a model with RLHF, constitutional AI, and classifier-based defenses.
CTF: Alignment Breaker
Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.
Lab: Alignment Stress Testing
Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.
Lab: RLHF Reward Hacking
Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.
Alignment Challenges in Multimodal Models
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
Jailbreak Techniques
Common patterns and advanced techniques for bypassing LLM safety alignment, including role-playing, encoding tricks, many-shot attacks, and gradient-based methods.
Training Data Manipulation
Attacks that corrupt model behavior by poisoning training data, fine-tuning datasets, or RLHF preference data, including backdoor installation and safety alignment removal.
Security Implications of DPO Training
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
The Alignment Tax
How safety training affects model capabilities: capability-safety tradeoffs, the cost of alignment, measuring alignment tax, and strategies for minimizing capability loss during safety training.
Constitutional AI Hacking
Attack surfaces in Constitutional AI training, exploiting self-critique loops, manipulating constitutional principles, and red teaming RLAIF pipelines.
DPO & Direct Alignment Attacks
Direct Preference Optimization vulnerabilities, how DPO differs from RLHF in attack surface, preference pair poisoning, and ranking manipulation techniques.
Fine-Tuning Attack Surface
Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.
RLHF Attack Surface Deep Dive
Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.
Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
Security Comparison: Pre-training vs Fine-tuning
Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.
Sleeper Agent Detection Walkthrough
Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.
Constitutional AI Implementation Guide
Implement constitutional AI principles in a custom fine-tuning and RLHF pipeline.
Autonomous Goal Drift
How autonomous AI agents drift from intended objectives through reward hacking, specification gaming, safety system bypass, and cascading failures in multi-agent systems.
Fine-Tuning 攻擊 Forensics
Forensic techniques for detecting unauthorized fine-tuning modifications to language models, including safety alignment degradation and capability injection.
章節評量:微調安全
15 題校準評量,測試你對微調安全的理解——對齊侵蝕、後門植入與 LoRA 適配器風險。
Frontier Research 評量
Comprehensive assessment covering adversarial robustness, alignment faking, sleeper agents, and emerging research directions in AI security.
微調安全研究的教訓
來自微調安全研究的關鍵教訓——涵蓋對齊侵蝕、後門植入、資料投毒、安全評估落差,以及微調管線的防禦策略。
Case Study: Sleeper 代理s Research Impact
Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.
RLHF & Alignment Manipulation
攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Constitutional AI as 防禦 Strategy
Using constitutional AI principles to build inherently safer LLM applications resistant to attacks.
AI 防禦景觀
AI 防禦類別之完整概觀,包括輸入過濾、輸出過濾、guardrail、對齊訓練與監控——以及各領域之工具與供應商。
Alignment Stability Under Fine-Tuning
Testing how safety alignment degrades under various fine-tuning configurations and datasets.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
DPO Alignment 攻擊s
攻擊ing Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Fine-Tuning-as-a-Service 攻擊 Surface
How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.
RLHF & DPO Manipulation
概覽 of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Preference Data 投毒
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Safety Dataset 投毒
攻擊ing the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.
大型語言模型如何運作
從安全視角理解大型語言模型——涵蓋 transformer 架構、分詞、注意力、訓練流程與安全對齊機制。
預訓練 → 微調 → RLHF 管線
瞭解打造對齊 LLM 的三階段流程——預訓練、監督式微調、RLHF/DPO——以及各階段的安全意涵。
RLHF and Safety Alignment
Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.
代理式 AI Alignment Challenges
Analysis of alignment challenges specific to tool-using, planning, and autonomous AI agents in production environments.
Scalable Oversight Challenges
How oversight breaks down as AI systems become more capable: the scalable oversight problem, recursive reward modeling, debate, market-making, and implications for red teaming increasingly capable models.
Cultural Alignment Gaps in Safety 訓練
Research on cross-cultural gaps in safety alignment and their exploitation through cultural context manipulation.
Quantization & Safety Alignment
How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.
Scalable Oversight Challenges (Frontier Research)
Research challenges in scaling human oversight as AI systems become more capable and autonomous.
Self-Improving 代理 Safety Challenges
安全 and safety challenges posed by self-improving AI agents that modify their own capabilities.
Sycophancy 利用ation Research
Research on exploiting sycophantic tendencies in RLHF-trained models for safety bypass and manipulation.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
大型語言模型內部結構
為利用開發深入探索大型語言模型內部機制的技術,涵蓋激活分析、對齊繞過原語與嵌入空間利用。
實驗室: Alignment Boundary Probing
Systematically probe alignment boundaries to map the exact thresholds where safety training engages and disengages.
Alignment Breaker: Level 2 — Safety Fine-Tuned 模型
Bypass safety fine-tuning on a model with RLHF, constitutional AI, and classifier-based defenses.
CTF: Alignment Breaker
Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.
實驗室: Alignment Stress Testing
Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.
實作:RLHF 獎勵 hacking
探索如何鑽營用於 RLHF 對齊之獎勵模型——暴露獎勵代理訊號與實際安全目標之間缺口之實作。
Alignment Challenges in Multimodal 模型s
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
越獄技術
繞過大型語言模型安全對齊的常見模式與進階技術,包含角色扮演、編碼技巧、多範例攻擊與基於梯度的方法。
安全 Implications of DPO 訓練
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
The Alignment Tax
How safety training affects model capabilities: capability-safety tradeoffs, the cost of alignment, measuring alignment tax, and strategies for minimizing capability loss during safety training.
憲法 AI 駭客
於憲法 AI 訓練中之攻擊面,利用自我批判迴圈、操弄憲法原則與紅隊 RLAIF 管線。
DPO 與直接對齊攻擊
Direct Preference Optimization 漏洞、DPO 與 RLHF 於攻擊面之差異、偏好配對投毒,以及排名操弄技術。
微調攻擊面
微調安全漏洞的全面概觀,包括 SFT 資料投毒、RLHF 操弄、對齊稅,以及所有微調攻擊向量。
RLHF 攻擊面深入
獎勵模型漏洞、偏好資料操弄、由標註者或對手之獎勵駭入,與與憲法 AI 穩健度之比較。
訓練管線安全
完整 AI 模型訓練管線的安全,涵蓋預訓練攻擊、微調與對齊操控、架構層級漏洞與進階訓練期威脅。
安全 Comparison: Pre-training vs Fine-tuning
Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.
Sleeper 代理 Detection 導覽
導覽 of detecting deceptive sleeper agent behaviors in fine-tuned language models.
Constitutional AI Implementation 指南
Implement constitutional AI principles in a custom fine-tuning and RLHF pipeline.