Skip to main content
redteams.ai
All tags

# PPO

1 articletagged with “PPO

RLHF Attack Surface Deep Dive

Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.

RLHFreward-modelpreference-dataPPOannotatoralignment
Expert