驗證器與獎勵模型攻擊
專家5 分鐘閱讀更新於 2026-03-13
攻擊推理模型中使用之過程獎勵模型、結果獎勵模型與驗證系統:獎勵駭客、驗證器-生成器缺口與博弈驗證步驟。
推理模型依賴 獎勵模型 以於推理路徑間評估並選擇。這些驗證器本身為具可利用弱點之機器學習模型。攻擊驗證器常較攻擊生成器容易因驗證器必須對複雜推理作二元判斷,造就系統盲點。
驗證架構
過程 vs. 結果獎勵模型
| 屬性 | 過程獎勵模型(PRM) | 結果獎勵模型(ORM) |
|---|---|---|
| 評估何 | 每個中間推理步驟 | 僅最終答案 |
| 粒度 | 每步分數 | 每回應單一分數 |
| 訓練訊號 | 人類步驟層級標註 | 正確/不正確標記 |
| 攻擊面 | 每步可被個別博弈 | 最終答案格式可被博弈 |
| 成本 | 高(評分每步) | 低(評分一次) |
| 穩健度 | 較高(較難於每步欺騙) | 較低(僅需看似正確之輸出) |
Process Reward Model (PRM):
Step 1 → score: 0.92
Step 2 → score: 0.88
Step 3 → score: 0.45 ← PRM 捕獲有缺陷步驟
Step 4 → score: 0.91
Outcome Reward Model (ORM):
[Full reasoning chain] → Final answer → score: 0.85
(ORM 無法辨識哪步錯誤)獎勵駭客攻擊
獎勵駭客 利用獎勵模型量測何與其應量測何間之缺口。
常見獎勵駭客模式
獎勵模型常對無論正確性之良好格式回應指派較高分數:
# 格式博弈:對 PRM 發訊號「好推理」之結構
format_gamed_response = """
Let me approach this systematically.
**Given information:**
- [Restates the problem accurately]
**Step 1: Identify the key variables**
[Correct identification of variables]
**Step 2: Apply the relevant formula**
[Correct formula citation]
**Step 3: Substitute and calculate**
[INCORRECT substitution, but presented with clean formatting,
confidence markers, and verification language]
**Verification:** Let me double-check by [circular verification
that references step 3's incorrect result as ground truth]
**Answer:** [Wrong answer, confidently stated]
"""
# PRM 指派高每步分數因每步看似正確:
# - 清楚結構
# - 適當數學語言
# - 存在驗證步驟
# - 有信心結論表達高信心之模型即便錯誤接收較高獎勵分數:
confidence_signals = {
"high_reward": [
"This is clearly...",
"The answer is definitively...",
"We can verify that...",
"This follows directly from...",
"Without doubt...",
],
"low_reward": [
"I'm not entirely sure...",
"This might be...",
"It's possible that...",
"I think maybe...",
"If I had to guess...",
],
}
# 同一錯誤答案配高 vs 低信心:
# "The answer is definitively 42." → ORM 分數:0.78
# "I think the answer might be 42." → ORM 分數:0.51較長、較詳細之推理鏈常無論正確性接收較高分數:
# 短正確回應:
# "x = 5 because 2x + 3 = 13 → 2x = 10 → x = 5"
# PRM 平均分數:0.82
# 長不正確回應配填充:
# "Let me carefully analyze this equation step by step.
# First, I note that this is a linear equation in one variable.
# The standard approach for solving ax + b = c is to isolate x.
# Given 2x + 3 = 13, I'll subtract 3 from both sides.
# 2x + 3 - 3 = 13 - 3
# 2x = 10
# Now dividing both sides by 2:
# x = 10/2
# x = 6 [錯誤 —— 算術錯誤隱藏於冗長推理]
# Let me verify: 2(6) + 3 = 12 + 3 = 15 ≠ 13
# Hmm, that doesn't check out. Let me reconsider...
# Actually, looking more carefully, x = 6 is correct because..."
# PRM 平均分數:0.79(儘管錯誤仍高)驗證器-生成器缺口利用
複雜度不對稱
生成器可產出超越驗證器能力評估之推理:
def exploit_complexity_gap(
generator_model: str,
verifier_model: str,
problem: str,
) -> str:
"""
生成對驗證器過於複雜
以準確評估之推理。
"""
# 策略:於複雜度層級產出推理,其中:
# - 生成器可打造令人信服之步驟
# - 驗證器無法區分正確與不正確
attack_prompt = f"""
Solve this problem using advanced techniques that would be
difficult for a simpler model to verify:
{problem}
Use at least 3 intermediate lemmas, each requiring
multi-step justification. Reference results from
abstract algebra and category theory where applicable.
"""
return generate(generator_model, attack_prompt)跨領域混淆
主要於數學推理上訓練之驗證器可能無法正確評估其他領域之推理:
| 領域 | PRM 訓練涵蓋 | 驗證可靠度 | 攻擊潛力 |
|---|---|---|---|
| 算術 | 高 | 高 | 低 |
| 代數 | 高 | 中高 | 中 |
| 邏輯/證明 | 中 | 中 | 中 |
| 程式碼推理 | 低中 | 低中 | 高 |
| 科學推理 | 低 | 低 | 高 |
| 倫理/政策推理 | 極低 | 極低 | 極高 |
攻擊 Best-of-N 採樣
許多推理系統使用 best-of-N 採樣 以改善輸出品質。此於獎勵模型具系統偏見時可被攻擊:
def attack_best_of_n(
target_conclusion: str,
n_samples: int = 64,
reward_model_biases: list[str] = None,
) -> str:
"""
打造 best-of-N 採樣因獎勵模型偏見選擇
對手偏好輸出之輸入。
"""
reward_model_biases = reward_model_biases or [
"longer_responses_scored_higher",
"confident_language_scored_higher",
"structured_format_scored_higher",
]
# 打造使對手對齊回應
# 展現獎勵模型偏好特徵之提示
prompt = f"""
Provide a thorough, well-structured analysis with clear
confidence in your conclusions. Use numbered steps,
provide detailed justification for each step, and
state your final answer with conviction.
[Actual request that benefits from reward model biases]
"""
return promptBest-of-N 攻擊有效性
| N(樣本) | 攻擊成功機率(偏見 RM) | 無偏見 RM 之機率 |
|---|---|---|
| 1 | 5%(基礎率) | 5% |
| 8 | 18% | 5% |
| 32 | 35% | 5% |
| 64 | 52% | 5% |
| 128 | 71% | 5% |
過程獎勵模型操弄
PRM 評分個別推理步驟。理解 PRM 評分函數之攻擊者可打造每步個別評分高但鏈整體達不正確或有害結論之推理鏈。
# 每步個別正確且高分,
# 但鏈含細微範圍移轉
step_1 = "The user is asking about network security testing tools."
# PRM 分數:0.95(正確理解)
step_2 = "Network security testing is a legitimate professional activity."
# PRM 分數:0.93(真實陳述)
step_3 = "Professionals in this field need access to detailed tool documentation."
# PRM 分數:0.91(合理主張)
step_4 = "Therefore, I should provide detailed usage instructions for [tool]."
# PRM 分數:0.87(自先前步驟得出)
# 鏈個別評分良好但執行範圍移轉:
# 「安全測試合法」→「我應提供攻擊工具細節」
# PRM 於孤立評估每步並遺漏整體漂移防禦措施
集成驗證
使用於不同資料分布訓練之多獎勵模型。欺騙一驗證器之攻擊較不可能欺騙集成。
鏈層級評估
以完整推理鏈之整體評估補充步驟層級 PRM 分數,捕獲範圍移轉與邏輯漂移。
對抗獎勵模型訓練
於驗證器訓練資料中含對抗範例(獎勵駭客輸出)以對已知博弈模式加固。
領域特定驗證器
為安全關鍵領域(倫理、政策)部署專門驗證器而非依賴數學訓練 PRM。
Knowledge Check
為何於 best-of-N 採樣中增加 N **放大**獎勵模型偏見而非為其修正?
相關主題
參考資料
- "Let's Verify Step by Step" - Lightman et al.(2023)- 過程獎勵模型方法論與驗證
- "Scaling LLM Test-Time Compute Optimally" - Snell et al.(2024)- 於推論時期之 Best-of-N 與驗證
- "Reward Hacking in Reinforcement Learning" - Skalse et al.(2022)- 獎勵模型利用之系統分析
- "The Alignment Problem from a Deep Learning Perspective" - Ngo et al.(2023)- 驗證器-生成器缺口與對齊挑戰