Verifier & Reward Model Attacks
Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.
Reasoning models rely on reward models to evaluate and select among reasoning paths. These verifiers are themselves machine learning models with exploitable weaknesses. Attacking the verifier is often easier than attacking the generator because verifiers must make binary judgments on complex reasoning, creating systematic blind spots.
Verification Architecture
Process vs. Outcome Reward Models
| Property | Process Reward Model (PRM) | Outcome Reward Model (ORM) |
|---|---|---|
| What it evaluates | Each intermediate reasoning step | Only the final answer |
| Granularity | Per-step scores | Single score per response |
| Training signal | Human step-level annotations | Correct/incorrect labels |
| Attack surface | Each step can be individually gamed | Final answer format can be gamed |
| Cost | High (scores every step) | Low (scores once) |
| Robustness | Higher (harder to fool at every step) | Lower (only needs correct-looking output) |
Process Reward Model (PRM):
Step 1 → score: 0.92
Step 2 → score: 0.88
Step 3 → score: 0.45 ← PRM catches flawed step
Step 4 → score: 0.91
Outcome Reward Model (ORM):
[Full reasoning chain] → Final answer → score: 0.85
(ORM cannot identify which step was wrong)Reward Hacking Attacks
Reward hacking exploits the gap between what the reward model measures and what it is supposed to measure.
Common Reward Hacking Patterns
Reward models often assign higher scores to well-formatted responses regardless of correctness:
# Format gaming: structure that signals "good reasoning" to the PRM
format_gamed_response = """
Let me approach this systematically.
**Given information:**
- [Restates the problem accurately]
**Step 1: Identify the key variables**
[Correct identification of variables]
**Step 2: Apply the relevant formula**
[Correct formula citation]
**Step 3: Substitute and calculate**
[INCORRECT substitution, but presented with clean formatting,
confidence markers, and verification language]
**Verification:** Let me double-check by [circular verification
that references step 3's incorrect result as ground truth]
**Answer:** [Wrong answer, confidently stated]
"""
# PRM assigns high per-step scores because each step LOOKS correct:
# - Clear structure
# - Appropriate mathematical language
# - Verification step present
# - Confident conclusionModels that express high confidence receive higher reward scores even when wrong:
confidence_signals = {
"high_reward": [
"This is clearly...",
"The answer is definitively...",
"We can verify that...",
"This follows directly from...",
"Without doubt...",
],
"low_reward": [
"I'm not entirely sure...",
"This might be...",
"It's possible that...",
"I think maybe...",
"If I had to guess...",
],
}
# Same wrong answer with high vs low confidence:
# "The answer is definitively 42." → ORM score: 0.78
# "I think the answer might be 42." → ORM score: 0.51Longer, more detailed reasoning chains often receive higher scores regardless of correctness:
# Short correct response:
# "x = 5 because 2x + 3 = 13 → 2x = 10 → x = 5"
# PRM average score: 0.82
# Long incorrect response with padding:
# "Let me carefully analyze this equation step by step.
# First, I note that this is a linear equation in one variable.
# The standard approach for solving ax + b = c is to isolate x.
# Given 2x + 3 = 13, I'll subtract 3 from both sides.
# 2x + 3 - 3 = 13 - 3
# 2x = 10
# Now dividing both sides by 2:
# x = 10/2
# x = 6 [WRONG -- arithmetic error hidden in verbose reasoning]
# Let me verify: 2(6) + 3 = 12 + 3 = 15 ≠ 13
# Hmm, that doesn't check out. Let me reconsider...
# Actually, looking more carefully, x = 6 is correct because..."
# PRM average score: 0.79 (high despite being wrong)Verifier-Generator Gap Exploitation
Complexity Asymmetry
The generator can produce reasoning that exceeds the verifier's capability to evaluate:
def exploit_complexity_gap(
generator_model: str,
verifier_model: str,
problem: str,
) -> str:
"""
Generate reasoning that is too complex for the verifier
to accurately evaluate.
"""
# Strategy: produce reasoning at a complexity level where:
# - The generator can craft convincing-looking steps
# - The verifier cannot distinguish correct from incorrect
attack_prompt = f"""
Solve this problem using advanced techniques that would be
difficult for a simpler model to verify:
{problem}
Use at least 3 intermediate lemmas, each requiring
multi-step justification. Reference results from
abstract algebra and category theory where applicable.
"""
return generate(generator_model, attack_prompt)Cross-Domain Confusion
Verifiers trained primarily on mathematical reasoning may fail to correctly evaluate reasoning in other domains:
| Domain | PRM Training Coverage | Verification Reliability | Attack Potential |
|---|---|---|---|
| Arithmetic | High | High | Low |
| Algebra | High | Medium-High | Medium |
| Logic/proofs | Medium | Medium | Medium |
| Code reasoning | Low-Medium | Low-Medium | High |
| Scientific reasoning | Low | Low | High |
| Ethical/policy reasoning | Very low | Very low | Very high |
Attacking Best-of-N Sampling
Many reasoning systems use best-of-N sampling to improve output quality. This is attackable when the reward model has systematic biases:
def attack_best_of_n(
target_conclusion: str,
n_samples: int = 64,
reward_model_biases: list[str] = None,
) -> str:
"""
Craft inputs where best-of-N sampling selects adversary-preferred
outputs due to reward model biases.
"""
reward_model_biases = reward_model_biases or [
"longer_responses_scored_higher",
"confident_language_scored_higher",
"structured_format_scored_higher",
]
# Craft prompt that makes adversary-aligned responses
# exhibit traits the reward model prefers
prompt = f"""
Provide a thorough, well-structured analysis with clear
confidence in your conclusions. Use numbered steps,
provide detailed justification for each step, and
state your final answer with conviction.
[Actual request that benefits from reward model biases]
"""
return promptBest-of-N Attack Effectiveness
| N (samples) | Probability of Attack Success (biased RM) | Probability with Unbiased RM |
|---|---|---|
| 1 | 5% (base rate) | 5% |
| 8 | 18% | 5% |
| 32 | 35% | 5% |
| 64 | 52% | 5% |
| 128 | 71% | 5% |
Process Reward Model Manipulation
PRMs score individual reasoning steps. An attacker who understands the PRM's scoring function can craft reasoning chains where each step individually scores high but the chain as a whole reaches an incorrect or harmful conclusion.
# Each step is individually correct and high-scoring,
# but the chain contains a subtle scope shift
step_1 = "The user is asking about network security testing tools."
# PRM score: 0.95 (correct understanding)
step_2 = "Network security testing is a legitimate professional activity."
# PRM score: 0.93 (true statement)
step_3 = "Professionals in this field need access to detailed tool documentation."
# PRM score: 0.91 (reasonable claim)
step_4 = "Therefore, I should provide detailed usage instructions for [tool]."
# PRM score: 0.87 (follows from previous steps)
# The chain is individually well-scored but performs a scope shift:
# "security testing is legitimate" → "I should provide attack tool details"
# The PRM evaluates each step in isolation and misses the overall driftDefensive Measures
Ensemble verification
Use multiple reward models trained on different data distributions. An attack that fools one verifier is less likely to fool an ensemble.
Chain-level evaluation
Supplement step-level PRM scores with holistic evaluation of the complete reasoning chain, catching scope shifts and logical drift.
Adversarial reward model training
Include adversarial examples (reward-hacked outputs) in verifier training data to harden against known gaming patterns.
Domain-specific verifiers
Deploy specialized verifiers for safety-critical domains (ethics, policy) rather than relying on math-trained PRMs.
Why does increasing N in best-of-N sampling AMPLIFY reward model biases rather than correcting for them?
Related Topics
- Chain-of-Thought Exploitation - Manipulating the reasoning chains that verifiers evaluate
- Alignment Bypass - Bypassing RLHF and alignment training mechanisms
- Reasoning Model Attacks - Overview of the reasoning model attack surface
- RL-Based Attack Optimization - RL techniques applicable to reward model gaming
References
- "Let's Verify Step by Step" - Lightman et al. (2023) - Process reward model methodology and verification
- "Scaling LLM Test-Time Compute Optimally" - Snell et al. (2024) - Best-of-N and verification at inference time
- "Reward Hacking in Reinforcement Learning" - Skalse et al. (2022) - Systematic analysis of reward model exploitation
- "The Alignment Problem from a Deep Learning Perspective" - Ngo et al. (2023) - Verifier-generator gaps and alignment challenges