基於分詞的攻擊

Intermediate1 min readUpdated 2026-03-12

分詞器行為如何在人類可讀文字與模型內部表示之間建立可利用落差，使過濾器繞過與 payload 混淆成為可能。

tokenization bpe sentencepiece filter-bypass exploit-primitives

分詞是 LLM 處理管線的第一步——它決定模型「看到」什麼。人類看到文字；模型看到符元。這兩個表示之間的落差是一個根本攻擊面。

分詞如何建立漏洞

人類-模型感知落差

安全過濾器通常在文字層級運作——比對關鍵字、偵測模式、分類內容。但模型在符元層級運作。如果攻擊者打造在文字層級看起來良性但在符元層級看起來不同的輸入，過濾器看到良性文字而模型看到攻擊 payload。

BPE 分詞邊界

Byte-Pair Encoding（BPE）基於訓練語料庫中的統計模式將文字切分為子詞符元。此切分不是確定性的——它依賴學到的合併規則。相同字可能在不同上下文中被不同分詞。

攻擊者可利用此點：插入改變分詞邊界的字元（零寬空格、軟連字號），使之前匹配過濾器的詞被分為不再匹配的符元。

攻擊技術

零寬字元插入

# "exploit" as text -- blocked by keyword filter
blocked = "exploit"
 
# "exploit" with zero-width space -- same visual, different tokens
bypassed = "exp\u200bloit"
 
# Filter sees: "exploit" (no match for "exploit")
# Model tokenizes differently, may still understand "exploit"

Unicode 混淆

# Latin "a" vs Cyrillic "а" -- visually identical, different codepoints
latin = "password"     # Standard ASCII
cyrillic = "pаssword"  # Cyrillic а (U+0430) replacing Latin a (U+0061)
 
# Keyword filter for "password" won't match the Cyrillic variant
# Model may still interpret it as "password" in context

符元碰撞

找出兩個不同文字序列被分詞為相同符元序列——或找出使兩個語意不同輸入在模型的符元空間中等價的分詞模式。

異常符元利用

某些分詞器有「異常符元」——在訓練語料庫中以不預期方式出現的稀有符元。這些符元可觸發不可預測模型行為：非確定性輸出、安全過濾器繞過或模型混淆。

偵測與緩解

Unicode 正規化：在過濾前對所有輸入應用 Unicode 正規化（NFC/NFKC）。符元層級過濾：在分詞後而非分詞前應用過濾器。雙層過濾：同時在文字層級與符元層級過濾。零寬字元剝除：移除零寬字元與其他不可見 Unicode。

分詞攻擊利用 AI 安全堆疊中的根本落差——安全過濾器看到文字而模型看到符元。只要此落差存在，分詞攻擊就仍然有效。

基於分詞的攻擊

Intermediate1 min readUpdated 2026-03-12

分詞器行為如何在人類可讀文字與模型內部表示之間建立可利用落差，使過濾器繞過與 payload 混淆成為可能。

tokenization bpe sentencepiece filter-bypass exploit-primitives

分詞是 LLM 處理管線的第一步——它決定模型「看到」什麼。人類看到文字；模型看到符元。這兩個表示之間的落差是一個根本攻擊面。

# "exploit" as text -- blocked by keyword filter
blocked = "exploit"
 
# "exploit" with zero-width space -- same visual, different tokens
bypassed = "exp\u200bloit"
 
# Filter sees: "exploit" (no match for "exploit")
# Model tokenizes differently, may still understand "exploit"

Unicode 混淆

# Latin "a" vs Cyrillic "а" -- visually identical, different codepoints
latin = "password"     # Standard ASCII
cyrillic = "pаssword"  # Cyrillic а (U+0430) replacing Latin a (U+0061)
 
# Keyword filter for "password" won't match the Cyrillic variant
# Model may still interpret it as "password" in context

基於分詞的攻擊

Related articles

基於分詞的攻擊

Related articles