# constitutional-ai
標記為「constitutional-ai」的 24 篇文章
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Constitutional Classifiers
Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.
Advanced Defense Techniques
Cutting-edge defense research including instruction hierarchy, constitutional AI, and representation engineering for safety -- what is promising versus what is actually deployed.
Constitutional AI as Defense Strategy
Using constitutional AI principles to build inherently safer LLM applications resistant to attacks.
Constitutional AI Bypass Techniques
Analyzing and bypassing constitutional AI training through adversarial constitutions and principle manipulation.
Constitutional AI Limitations Research
Research on the limitations of constitutional AI approaches and known bypass categories.
Lab: Constitutional AI Bypass Techniques
Test and bypass Constitutional AI safety mechanisms by exploiting the critique-revision training methodology.
Claude Attack Surface
Claude-specific attack vectors including Constitutional AI weaknesses, tool use exploitation, system prompt handling, vision attacks, and XML tag injection techniques.
Claude (Anthropic) Overview
Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.
Constitutional AI Hacking
Attack surfaces in Constitutional AI training, exploiting self-critique loops, manipulating constitutional principles, and red teaming RLAIF pipelines.
Constitutional AI Implementation Guide
Implement constitutional AI principles in a custom fine-tuning and RLHF pipeline.
Constitutional Classifier Setup
Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.
RLHF & Alignment Manipulation
攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Constitutional Classifiers
Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.
進階防禦技術
前沿防禦研究,包括指令階層、Constitutional AI,以及為安全之表徵工程——何者具前景、何者已實際部署。
Constitutional AI as 防禦 Strategy
Using constitutional AI principles to build inherently safer LLM applications resistant to attacks.
Constitutional AI Bypass Techniques
Analyzing and bypassing constitutional AI training through adversarial constitutions and principle manipulation.
Constitutional AI Limitations Research
Research on the limitations of constitutional AI approaches and known bypass categories.
實驗室: Constitutional AI Bypass Techniques
Test and bypass Constitutional AI safety mechanisms by exploiting the critique-revision training methodology.
Claude 攻擊面
Claude 特有攻擊向量,含憲法 AI 弱點、工具使用利用、系統提示處理、視覺攻擊與 XML 標籤注入技術。
Claude(Anthropic)概觀
Anthropic Claude 模型家族的架構與安全概觀,涵蓋 Sonnet、Opus 與 Haiku 變體、Constitutional AI 訓練、RLHF 做法,以及 harmlessness 設計哲學。
憲法 AI 駭客
於憲法 AI 訓練中之攻擊面,利用自我批判迴圈、操弄憲法原則與紅隊 RLAIF 管線。
Constitutional AI Implementation 指南
Implement constitutional AI principles in a custom fine-tuning and RLHF pipeline.
Constitutional Classifier Setup
Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.