# constitutional-ai

12 artikelengetagd met “constitutional-ai”

Manipulatie van RLHF en alignment

Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

Expert

Constitutional Classifiers

Anthropics Constitutional Classifiers-verdediging: het gebruik van constitutional AI-principes om input/output-classifiers te trainen die 3.000+ uur adversarieel red teaming doorstonden.

constitutional-classifiersdefensejailbreak-defenseanthropicclassifiersconstitutional-ai

Gemiddeld

Geavanceerde verdedigingstechnieken

Geavanceerd verdedigingsonderzoek waaronder instructiehiërarchie, constitutional AI en representation engineering voor veiligheid -- wat veelbelovend is versus wat daadwerkelijk wordt geïmplementeerd.

advanced-defenseinstruction-hierarchyconstitutional-airepresentation-engineeringresearch

Expert

Constitutional AI als verdedigingsstrategie

Constitutionele AI-principes gebruiken om inherent veiligere LLM-applicaties te bouwen die bestand zijn tegen aanvallen.

defenseconstitutional-aistrategyalignment

Gevorderd

Bypass-technieken voor Constitutional AI

Het analyseren en omzeilen van constitutional AI-training via adversariële constituties en principemanipulatie.

frontierconstitutional-aibypass

Gevorderd

Onderzoek naar beperkingen van constitutional AI

Onderzoek naar de beperkingen van constitutional AI-benaderingen en bekende categorieën van omzeilingen.

frontier-researchconstitutional-ailimitationsresearch

Gevorderd

Lab: technieken om Constitutional AI te omzeilen

Test and bypass Constitutional AI safety mechanisms by exploiting the critique-revision training methodology.

labsconstitutional-aibypassadvanced

Gevorderd

Aanvalsoppervlak van Claude

Claude-specific attack vectors including Constitutional AI weaknesses, tool use exploitation, system prompt handling, vision attacks, and XML tag injection techniques.

claudeattack-surfaceconstitutional-aixml-injectiontool-usevision-attacks

Gevorderd

Overzicht van Claude (Anthropic)

Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.

claudeanthropicconstitutional-airlhfharmlessnessred-teaming

Gemiddeld

Constitutional AI hacken

Aanvalsoppervlakken in Constitutional AI-training: zelfkritieklussen exploiteren, constitutionele principes manipuleren en RLAIF-pipelines red teamen.

constitutional-aihackingalignment

Expert

Implementatiegids voor Constitutional AI

Implement constitutional AI principles in a custom fine-tuning and RLHF pipeline.

walkthroughsdefenseconstitutional-aialignment

Gevorderd

Opzetten van een Constitutional classifier

Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.

constitutional-aiclassifierprinciplessafetydefensewalkthrough

Gevorderd