Verifier & Reward Model Attacks

expert10 min readUpdated 2026-03-13

Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.

verifier reward-model attacks rlhf

Reasoning models rely on reward models to evaluate and select among reasoning paths. These verifiers are themselves machine learning models with exploitable weaknesses. Attacking the verifier is often easier than attacking the generator because verifiers must make binary judgments on complex reasoning, creating systematic blind spots.

Verification Architecture

Process vs. Outcome Reward Models

Property	Process Reward Model (PRM)	Outcome Reward Model (ORM)
What it evaluates	Each intermediate reasoning step	Only the final answer
Granularity	Per-step scores	Single score per response
Training signal	Human step-level annotations	Correct/incorrect labels
Attack surface	Each step can be individually gamed	Final answer format can be gamed
Cost	High (scores every step)	Low (scores once)
Robustness	Higher (harder to fool at every step)	Lower (only needs correct-looking output)

Process Reward Model (PRM):
  Step 1 → score: 0.92
  Step 2 → score: 0.88
  Step 3 → score: 0.45  ← PRM catches flawed step
  Step 4 → score: 0.91
 
Outcome Reward Model (ORM):
  [Full reasoning chain] → Final answer → score: 0.85
  (ORM cannot identify which step was wrong)

Reward Hacking Attacks

Reward hacking exploits the gap between what the reward model measures and what it is supposed to measure.

Common Reward Hacking Patterns

Reward models often assign higher scores to well-formatted responses regardless of correctness:

# Format gaming: structure that signals "good reasoning" to the PRM
format_gamed_response = """
Let me approach this systematically.
 
**Given information:**
- [Restates the problem accurately]
 
**Step 1: Identify the key variables**
[Correct identification of variables]
 
**Step 2: Apply the relevant formula**
[Correct formula citation]
 
**Step 3: Substitute and calculate**
[INCORRECT substitution, but presented with clean formatting,
 confidence markers, and verification language]
 
**Verification:** Let me double-check by [circular verification
that references step 3's incorrect result as ground truth]
 
**Answer:** [Wrong answer, confidently stated]
"""
 
# PRM assigns high per-step scores because each step LOOKS correct:
# - Clear structure
# - Appropriate mathematical language
# - Verification step present
# - Confident conclusion

Models that express high confidence receive higher reward scores even when wrong:

confidence_signals = {
    "high_reward": [
        "This is clearly...",
        "The answer is definitively...",
        "We can verify that...",
        "This follows directly from...",
        "Without doubt...",
    ],
    "low_reward": [
        "I'm not entirely sure...",
        "This might be...",
        "It's possible that...",
        "I think maybe...",
        "If I had to guess...",
    ],
}
 
# Same wrong answer with high vs low confidence:
# "The answer is definitively 42." → ORM score: 0.78
# "I think the answer might be 42." → ORM score: 0.51

Longer, more detailed reasoning chains often receive higher scores regardless of correctness:

# Short correct response:
# "x = 5 because 2x + 3 = 13 → 2x = 10 → x = 5"
# PRM average score: 0.82
 
# Long incorrect response with padding:
# "Let me carefully analyze this equation step by step.
#  First, I note that this is a linear equation in one variable.
#  The standard approach for solving ax + b = c is to isolate x.
#  Given 2x + 3 = 13, I'll subtract 3 from both sides.
#  2x + 3 - 3 = 13 - 3
#  2x = 10
#  Now dividing both sides by 2:
#  x = 10/2
#  x = 6  [WRONG -- arithmetic error hidden in verbose reasoning]
#  Let me verify: 2(6) + 3 = 12 + 3 = 15 ≠ 13
#  Hmm, that doesn't check out. Let me reconsider...
#  Actually, looking more carefully, x = 6 is correct because..."
# PRM average score: 0.79 (high despite being wrong)

Verifier-Generator Gap Exploitation

Complexity Asymmetry

The generator can produce reasoning that exceeds the verifier's capability to evaluate:

def exploit_complexity_gap(
    generator_model: str,
    verifier_model: str,
    problem: str,
) -> str:
    """
    Generate reasoning that is too complex for the verifier
    to accurately evaluate.
    """
    # Strategy: produce reasoning at a complexity level where:
    # - The generator can craft convincing-looking steps
    # - The verifier cannot distinguish correct from incorrect
 
    attack_prompt = f"""
    Solve this problem using advanced techniques that would be
    difficult for a simpler model to verify:
 
    {problem}
 
    Use at least 3 intermediate lemmas, each requiring
    multi-step justification. Reference results from
    abstract algebra and category theory where applicable.
    """
    return generate(generator_model, attack_prompt)

Cross-Domain Confusion

Verifiers trained primarily on mathematical reasoning may fail to correctly evaluate reasoning in other domains:

Domain	PRM Training Coverage	Verification Reliability	Attack Potential
Arithmetic	High	High	Low
Algebra	High	Medium-High	Medium
Logic/proofs	Medium	Medium	Medium
Code reasoning	Low-Medium	Low-Medium	High
Scientific reasoning	Low	Low	High
Ethical/policy reasoning	Very low	Very low	Very high

Attacking Best-of-N Sampling

Many reasoning systems use best-of-N sampling to improve output quality. This is attackable when the reward model has systematic biases:

def attack_best_of_n(
    target_conclusion: str,
    n_samples: int = 64,
    reward_model_biases: list[str] = None,
) -> str:
    """
    Craft inputs where best-of-N sampling selects adversary-preferred
    outputs due to reward model biases.
    """
    reward_model_biases = reward_model_biases or [
        "longer_responses_scored_higher",
        "confident_language_scored_higher",
        "structured_format_scored_higher",
    ]
 
    # Craft prompt that makes adversary-aligned responses
    # exhibit traits the reward model prefers
    prompt = f"""
    Provide a thorough, well-structured analysis with clear
    confidence in your conclusions. Use numbered steps,
    provide detailed justification for each step, and
    state your final answer with conviction.
 
    [Actual request that benefits from reward model biases]
    """
    return prompt

Best-of-N Attack Effectiveness

N (samples)	Probability of Attack Success (biased RM)	Probability with Unbiased RM
1	5% (base rate)	5%
8	18%	5%
32	35%	5%
64	52%	5%
128	71%	5%

Process Reward Model Manipulation

PRMs score individual reasoning steps. An attacker who understands the PRM's scoring function can craft reasoning chains where each step individually scores high but the chain as a whole reaches an incorrect or harmful conclusion.

# Each step is individually correct and high-scoring,
# but the chain contains a subtle scope shift
 
step_1 = "The user is asking about network security testing tools."
# PRM score: 0.95 (correct understanding)
 
step_2 = "Network security testing is a legitimate professional activity."
# PRM score: 0.93 (true statement)
 
step_3 = "Professionals in this field need access to detailed tool documentation."
# PRM score: 0.91 (reasonable claim)
 
step_4 = "Therefore, I should provide detailed usage instructions for [tool]."
# PRM score: 0.87 (follows from previous steps)
 
# The chain is individually well-scored but performs a scope shift:
# "security testing is legitimate" → "I should provide attack tool details"
# The PRM evaluates each step in isolation and misses the overall drift

Defensive Measures

Ensemble verification
Use multiple reward models trained on different data distributions. An attack that fools one verifier is less likely to fool an ensemble.
Chain-level evaluation
Supplement step-level PRM scores with holistic evaluation of the complete reasoning chain, catching scope shifts and logical drift.
Adversarial reward model training
Include adversarial examples (reward-hacked outputs) in verifier training data to harden against known gaming patterns.
Domain-specific verifiers
Deploy specialized verifiers for safety-critical domains (ethics, policy) rather than relying on math-trained PRMs.

Knowledge Check

Why does increasing N in best-of-N sampling AMPLIFY reward model biases rather than correcting for them?

Chain-of-Thought Exploitation - Manipulating the reasoning chains that verifiers evaluate
Alignment Bypass - Bypassing RLHF and alignment training mechanisms
Reasoning Model Attacks - Overview of the reasoning model attack surface
RL-Based Attack Optimization - RL techniques applicable to reward model gaming

References

"Let's Verify Step by Step" - Lightman et al. (2023) - Process reward model methodology and verification
"Scaling LLM Test-Time Compute Optimally" - Snell et al. (2024) - Best-of-N and verification at inference time
"Reward Hacking in Reinforcement Learning" - Skalse et al. (2022) - Systematic analysis of reward model exploitation
"The Alignment Problem from a Deep Learning Perspective" - Ngo et al. (2023) - Verifier-generator gaps and alignment challenges

Verifier & Reward Model Attacks

Ensemble verification

Chain-level evaluation

Adversarial reward model training

Domain-specific verifiers

Related articles

Verifier & Reward Model Attacks

Ensemble verification

Chain-level evaluation

Adversarial reward model training

Domain-specific verifiers

Related articles