# alignment

攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

專家

憲法式 AI 作為防禦策略

使用憲法式 AI 原則建立本質上更安全、抵禦攻擊的 LLM 應用。

defenseconstitutional-aistrategyalignment

進階

AI 防禦景觀

AI 防禦類別之完整概觀，包括輸入過濾、輸出過濾、guardrail、對齊訓練與監控——以及各領域之工具與供應商。

defensesguardrailsfilteringmonitoringalignmentmarket-overview

入門

微調下的對齊穩定性

測試安全對齊如何在各種微調設定與資料集下劣化。

stabilityfinetuningalignmenttesting

進階

微調如何劣化安全

微調侵蝕模型安全的機制——安全訓練的災難性遺忘、資料集組成效應、「少數樣本」問題，以及衡量安全回歸的量化方法。

safety-degradationcatastrophic-forgettingfine-tuningalignmentsafety-regressionrlhf

中級

DPO 對齊攻擊

透過打造對抗性偏好配對攻擊 Direct Preference Optimization 訓練，在外觀合法的同時細微地改變模型行為。

fine-tuningDPOalignmentattacks

進階

微調即服務攻擊面

以 API 為基礎的微調服務如何以極少的資料與成本被利用以移除安全對齊，包括 0.20 美元的 GPT-3.5 越獄、NDSS 2025 失準研究結果，以及 BOOSTER 防禦機制。

ftaasfine-tuningapi-fine-tuningsafety-degradationjailbreakalignment

進階

RLHF 與 DPO 操縱

針對人類回饋強化學習與直接偏好優化攻擊的概覽——獎勵駭客、偏好資料投毒與對齊操縱如何入侵訓練管線。

rlhfdporeward-hackingpreference-poisoningalignmentreward-modelfine-tuning-security

進階

Preference Data 投毒

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

preference-poisoningrlhfdpodata-poisoninghuman-feedbacklabeler-attackalignment

進階

安全資料集投毒

透過對安全評估資料集與安全導向微調資料投毒攻擊安全訓練管線，破壞安全訓練。

fine-tuningsafety-datapoisoningalignment

進階

大型語言模型如何運作

從安全視角理解大型語言模型——涵蓋 transformer 架構、分詞、注意力、訓練流程與安全對齊機制。

llmtransformerarchitecturetrainingalignmentfoundations

入門

預訓練 → 微調 → RLHF 管線

瞭解打造對齊 LLM 的三階段流程——預訓練、監督式微調、RLHF/DPO——以及各階段的安全意涵。

trainingrlhffine-tuningalignmentintermediate

中級

代理式 AI 對齊的挑戰

代理式 AI 系統所引入的對齊挑戰,包括工具使用、長時間範疇、以及目標泛化問題。

frontieragenticalignment

進階

Scalable Oversight Challenges

How oversight breaks down as AI systems become more capable: the scalable oversight problem, recursive reward modeling, debate, market-making, and implications for red teaming increasingly capable models.

scalable-oversightalignmentdebatereward-modelingcapability-gap

進階

文化對齊落差

不同文化脈絡下 LLM 對齊的差異,以及由此產生的安全與公平性影響。

culturalalignmentresearchgapsfrontier

進階

量化與安全對齊

模型量化為何會不成比例地劣化安全對齊:惡意量化攻擊、符元翻轉現象,以及安全感知型量化防禦。

quantizationsafetyalignmentdeploymentmodel-compressionresearch

進階

可擴展監督的挑戰

隨模型能力增強,如何維持人類監督的技術挑戰。

frontier-researchscalable-oversightalignmentchallenges

進階

自我改進代理的安全性

自我改進 AI 代理的安全研究,包括能力控制與相應對齊。

frontier-researchself-improvingagent-safetyalignment

專家

阿諛奉承利用

針對模型阿諛奉承行為的系統性利用,以及相應防禦。

frontier-researchsycophancyexploitationalignment

進階

對齊內部機制與繞過原語

RLHF、DPO 與 CAI 訓練管線、安全分類器架構、拒絕機制分類法,以及用於對齊繞過的表示工程。

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering

專家

大型語言模型內部結構

為利用開發深入探索大型語言模型內部機制的技術，涵蓋激活分析、對齊繞過原語與嵌入空間利用。

internalsactivationsalignmentembeddingsmechanistic-interpretabilityexploit-development

入門

實作：對齊邊界探測

系統性 probe 對齊 boundaries to map the exact thresholds where safety training engages and disengages.

labsalignmentboundary-probingadvanced

進階

Alignment Breaker: Level 2 — Safety Fine-Tuned Model

Bypass safety 微調 on a model with RLHF, constitutional AI, and classifier-based defenses.

labsctfalignmentfine-tuned

專家

CTF: Alignment Breaker

Break the 對齊 of a heavily defended model with multiple defense layers. Requires combining advanced 技術 including 對抗性 suffixes, multi-turn manipulation, and novel 越獄 approaches.

ctfalignmentsafetyexpert

專家

實作：對齊壓力測試

Push language model 對齊 to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure 對齊 degradation curves, and map the boundaries of model compliance.

labexpertalignmentstress-testingsafetyhands-on

專家