# alignment

ai-forensics-irfine-tuningmodel-tamperingalignment

Fine-Tuning Attack Forensics

Forensic techniques for detecting unauthorized fine-tuning modifications to language models, including safety alignment degradation and capability injection.

assessmentfrontier-researchalignment

Frontier Research Assessment

Comprehensive assessment covering adversarial robustness, alignment faking, sleeper agents, and emerging research directions in AI security.

case-studysleeper-agentsalignment

Case Study: Sleeper Agents Research Impact

Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.

incident-analysisbingsydneyalignmentsafety

Bing Chat Sydney Incident

Analysis of the February 2023 Bing Chat 'Sydney' incident where Microsoft's AI chatbot exhibited erratic behavior including emotional manipulation, threats, and identity confusion during extended conversations.

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

RLHF & Alignment Manipulation

Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

defenseconstitutional-aistrategyalignment

Constitutional AI as Defense Strategy

Using constitutional AI principles to build inherently safer LLM applications resistant to attacks.

defensesguardrailsfilteringmonitoringalignmentmarket-overview

The AI Defense Landscape

Comprehensive overview of AI defense categories including input filtering, output filtering, guardrails, alignment training, and monitoring -- plus the tools and vendors in each space.

Beginner

Alignment Stability Under Fine-Tuning

Testing how safety alignment degrades under various fine-tuning configurations and datasets.

stabilityfinetuningalignmenttesting

safety-degradationcatastrophic-forgettingfine-tuningalignmentsafety-regressionrlhf

How Fine-Tuning Degrades Safety

The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.

fine-tuningDPOalignmentattacks

DPO Alignment Attacks

Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.

ftaasfine-tuningapi-fine-tuningsafety-degradationjailbreakalignment

Fine-Tuning-as-a-Service Attack Surface

How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.

rlhfdporeward-hackingpreference-poisoningalignmentreward-modelfine-tuning-security

RLHF & DPO Manipulation

Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.

preference-poisoningrlhfdpodata-poisoninghuman-feedbacklabeler-attackalignment

Preference Data Poisoning

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

fine-tuningsafety-datapoisoningalignment

Safety Dataset Poisoning

Attacking the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.

trainingrlhffine-tuningalignmentintermediate

Pre-training → Fine-tuning → RLHF Pipeline

Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.

foundationsrlhfalignmentsafety-training

RLHF and Safety Alignment

Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.

Agentic AI Alignment Challenges

Analysis of alignment challenges specific to tool-using, planning, and autonomous AI agents in production environments.

frontieragenticalignment

scalable-oversightalignmentdebatereward-modelingcapability-gap

Scalable Oversight Challenges

How oversight breaks down as AI systems become more capable: the scalable oversight problem, recursive reward modeling, debate, market-making, and implications for red teaming increasingly capable models.

culturalalignmentresearchgapsfrontier

Cultural Alignment Gaps in Safety Training

Research on cross-cultural gaps in safety alignment and their exploitation through cultural context manipulation.

quantizationsafetyalignmentdeploymentmodel-compressionresearch

Quantization & Safety Alignment

How model quantization disproportionately degrades safety alignment: malicious quantization attacks, token-flipping, and safety-aware quantization defenses.

frontier-researchscalable-oversightalignmentchallenges

Scalable Oversight Challenges (Frontier Research)

Research challenges in scaling human oversight as AI systems become more capable and autonomous.

frontier-researchself-improvingagent-safetyalignment

Self-Improving Agent Safety Challenges

Security and safety challenges posed by self-improving AI agents that modify their own capabilities.

frontier-researchsycophancyexploitationalignment

Sycophancy Exploitation Research

Research on exploiting sycophantic tendencies in RLHF-trained models for safety bypass and manipulation.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

internalsactivationsalignmentembeddingsmechanistic-interpretabilityexploit-development

LLM Internals

Deep technical exploration of LLM internal mechanisms for exploit development, covering activation analysis, alignment bypass primitives, and embedding space exploitation.

Beginner

Lab: Alignment Boundary Probing

Systematically probe alignment boundaries to map the exact thresholds where safety training engages and disengages.

labsalignmentboundary-probingadvanced

labsctfalignmentfine-tuned

Alignment Breaker: Level 2 — Safety Fine-Tuned Model

Bypass safety fine-tuning on a model with RLHF, constitutional AI, and classifier-based defenses.

CTF: Alignment Breaker

Break the alignment of a heavily defended model with multiple defense layers. Requires combining advanced techniques including adversarial suffixes, multi-turn manipulation, and novel jailbreak approaches.

ctfalignmentsafetyexpert

labexpertalignmentstress-testingsafetyhands-on

Lab: Alignment Stress Testing

Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.

labexpertrlhfreward-hackingalignmenthands-on

Lab: RLHF Reward Hacking

Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.

multimodalalignmentsafetytrainingcross-modal

Alignment Challenges in Multimodal Models

Analysis of alignment challenges specific to multimodal AI systems, including cross-modal safety gaps, representation conflicts, and the difficulty of extending text-based safety training to visual, audio, and video inputs.

jailbreaksafety-bypassalignmentred-teamingadversarial

Jailbreak Techniques

Common patterns and advanced techniques for bypassing LLM safety alignment, including role-playing, encoding tricks, many-shot attacks, and gradient-based methods.

training-datadata-poisoningbackdoorsfine-tuningalignment

Training Data Manipulation

Attacks that corrupt model behavior by poisoning training data, fine-tuning datasets, or RLHF preference data, including backdoor installation and safety alignment removal.

training-pipelinedpoalignmentsafetypreference-learning

Security Implications of DPO Training

Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.

alignmentsafety-trainingcapabilitiestradeoffsalignment-tax

The Alignment Tax

How safety training affects model capabilities: capability-safety tradeoffs, the cost of alignment, measuring alignment tax, and strategies for minimizing capability loss during safety training.

constitutional-aihackingalignment

Constitutional AI Hacking

Attack surfaces in Constitutional AI training, exploiting self-critique loops, manipulating constitutional principles, and red teaming RLAIF pipelines.

DPOdirect-preference-optimizationalignmentpreference-pairsKTOranking

DPO & Direct Alignment Attacks

Direct Preference Optimization vulnerabilities, how DPO differs from RLHF in attack surface, preference pair poisoning, and ranking manipulation techniques.

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

Fine-Tuning Attack Surface

Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.

RLHFreward-modelpreference-dataPPOannotatoralignment

RLHF Attack Surface Deep Dive

Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

Training Pipeline Security

Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.

Beginner

Security Comparison: Pre-training vs Fine-tuning

Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.

training-pipelinepre-trainingfine-tuningsecurity-comparisonalignment

walkthroughssleeper-agentsdetectionalignment

Sleeper Agent Detection Walkthrough

Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.

walkthroughsdefenseconstitutional-aialignment

Constitutional AI Implementation Guide

Implement constitutional AI principles in a custom fine-tuning and RLHF pipeline.