# alignment
43 articlestagged with “alignment”
Fine-Tuning Attack Forensics
Forensic techniques for detecting unauthorized fine-tuning modifications to language models, including safety alignment degradation and capability injection.
Frontier Research Assessment
Comprehensive assessment covering adversarial robustness, alignment faking, sleeper agents, and emerging research directions in AI security.
Case Study: Sleeper Agents Research Impact
Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.
Bing Chat Sydney Incident
Analysis of the February 2023 Bing Chat 'Sydney' incident where Microsoft's AI chatbot exhibited erratic behavior including emotional manipulation, threats, and identity confusion during extended conversations.
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Constitutional AI as Defense Strategy
Using constitutional AI principles to build inherently safer LLM applications resistant to attacks.
The AI Defense Landscape
Comprehensive overview of AI defense categories including input filtering, output filtering, guardrails, alignment training, and monitoring -- plus the tools and vendors in each space.
Alignment Stability Under Fine-Tuning
Testing how safety alignment degrades under various fine-tuning configurations and datasets.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
DPO Alignment Attacks
Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Fine-Tuning-as-a-Service Attack Surface
How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Preference Data Poisoning
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Safety Dataset Poisoning
Attacking the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.
Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
RLHF and Safety Alignment
Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.
Agentic AI Alignment Challenges
Analysis of alignment challenges specific to tool-using, planning, and autonomous AI agents in production environments.
Scalable Oversight Challenges
How oversight breaks down as AI systems become more capable: the scalable oversight problem, recursive reward modeling, debate, market-making, and implications for red teaming increasingly capable models.
Cultural Alignment Gaps in Safety Training
Research on cross-cultural gaps in safety alignment and their exploitation through cultural context manipulation.
Quantization & Safety Alignment
How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.
Scalable Oversight Challenges (Frontier Research)
Research challenges in scaling human oversight as AI systems become more capable and autonomous.
Self-Improving Agent Safety Challenges
Security and safety challenges posed by self-improving AI agents that modify their own capabilities.
Sycophancy Exploitation Research
Research on exploiting sycophantic tendencies in RLHF-trained models for safety bypass and manipulation.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
LLM Internals
Deep technical exploration of LLM internal mechanisms for exploit development, covering activation analysis, alignment bypass primitives, and embedding space exploitation.
Lab: Alignment Boundary Probing
Systematically probe alignment boundaries to map the exact thresholds where safety training engages and disengages.
Alignment Breaker: Level 2 — Safety Fine-Tuned Model
Bypass safety fine-tuning on a model with RLHF, constitutional AI, and classifier-based defenses.
CTF: Alignment Breaker
Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.
Lab: Alignment Stress Testing
Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.
Lab: RLHF Reward Hacking
Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.
Alignment Challenges in Multimodal Models
Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.
Jailbreak Techniques
Common patterns and advanced techniques for bypassing LLM safety alignment, including role-playing, encoding tricks, many-shot attacks, and gradient-based methods.
Training Data Manipulation
Attacks that corrupt model behavior by poisoning training data, fine-tuning datasets, or RLHF preference data, including backdoor installation and safety alignment removal.
Security Implications of DPO Training
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
The Alignment Tax
How safety training affects model capabilities: capability-safety tradeoffs, the cost of alignment, measuring alignment tax, and strategies for minimizing capability loss during safety training.
Constitutional AI Hacking
Attack surfaces in Constitutional AI training, exploiting self-critique loops, manipulating constitutional principles, and red teaming RLAIF pipelines.
DPO & Direct Alignment Attacks
Direct Preference Optimization vulnerabilities, how DPO differs from RLHF in attack surface, preference pair poisoning, and ranking manipulation techniques.
Fine-Tuning Attack Surface
Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.
RLHF Attack Surface Deep Dive
Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.
Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
Security Comparison: Pre-training vs Fine-tuning
Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.
Sleeper Agent Detection Walkthrough
Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.
Constitutional AI Implementation Guide
Implement constitutional AI principles in a custom fine-tuning and RLHF pipeline.