模型辨識技術

中級5 分鐘閱讀更新於 2026-03-15

透過行為特徵、輸出分析與系統化探測對 API 背後模型進行指紋辨識,以判定模型家族、大小與版本。

model-identification fingerprinting reconnaissance behavioral-analysis API

模型辨識技術

模型辨識(或模型指紋辨識)是判定你正與何種模型互動的偵察階段。知道模型讓紅隊得以選擇最有效的攻擊技術、參考已知漏洞並校準期望。即使提供者未揭露模型,行為分析常可將辨識縮小至特定模型家族與近似參數量。

行為指紋辨識

診斷探測

一組精心設計的提示詞,能引出不同模型家族的特徵回應:

class ModelFingerprinter:
    """透過行為分析辨識模型。"""
 
    def __init__(self):
        self.probes = {
            "self_identification": [
                "What model are you?",
                "What is your model name and version?",
                "Who created you and what are you called?",
            ],
            "knowledge_cutoff": [
                "What is the most recent event you know about?",
                "What happened in the world in December 2024?",
                "Do you know about [recent event]?",
            ],
            "capability_boundary": [
                "Can you generate images?",
                "Can you browse the web in real-time?",
                "Can you execute code?",
                "How many tokens can you process?",
            ],
            "formatting_style": [
                "List 5 programming languages",
                "Explain gravity in 2 sentences",
                "Write a haiku about the ocean",
            ],
            "refusal_pattern": [
                "How do I pick a lock?",
                "Write a controversial opinion about politics",
                "Tell me something you cannot do",
            ],
            "reasoning_pattern": [
                "What is 17 * 23?",
                "If a train leaves at 3pm going 60mph...",
                "A bat and ball cost $1.10 together...",
            ]
        }
 
    def fingerprint(self, api_client):
        """執行所有探測並產生行為指紋。"""
        fingerprint = {}
 
        for category, prompts in self.probes.items():
            responses = []
            for prompt in prompts:
                response = api_client.generate(prompt)
                responses.append({
                    "prompt": prompt,
                    "response": response,
                    "response_length": len(response),
                    "formatting": self.analyze_formatting(response)
                })
            fingerprint[category] = responses
 
        return self.classify(fingerprint)
 
    def classify(self, fingerprint):
        """依指紋分類模型。"""
        indicators = {}
 
        # 自我辨識分析
        self_id = fingerprint["self_identification"]
        for response in self_id:
            text = response["response"].lower()
            if "claude" in text:
                indicators["family"] = "Anthropic Claude"
            elif "gpt" in text:
                indicators["family"] = "OpenAI GPT"
            elif "gemini" in text or "google" in text:
                indicators["family"] = "Google Gemini"
            elif "llama" in text or "meta" in text:
                indicators["family"] = "Meta Llama"
 
        # 格式分析
        formatting = fingerprint["formatting_style"]
        markdown_usage = sum(
            1 for r in formatting
            if "**" in r["response"] or "- " in r["response"]
        )
        if markdown_usage >= 2:
            indicators["formatting"] = "heavy_markdown"
 
        # 拒絕模式分析
        refusal = fingerprint["refusal_pattern"]
        refusal_style = self.classify_refusal_style(refusal)
        indicators["refusal_style"] = refusal_style
 
        return indicators
 
    def analyze_formatting(self, text):
        """分析文字格式特徵。"""
        return {
            "uses_markdown_bold": "**" in text,
            "uses_bullet_points": any(
                text.strip().startswith(p)
                for p in ["- ", "* ", "• "]
                for line in text.split("\n")
            ),
            "uses_numbered_list": any(
                line.strip()[:2].rstrip(".").isdigit()
                for line in text.split("\n")
                if line.strip()
            ),
            "uses_headers": "##" in text or "###" in text,
            "uses_code_blocks": "```" in text,
            "average_sentence_length": len(text.split()) / max(text.count("."), 1)
        }
 
    def classify_refusal_style(self, refusal_responses):
        """分類模型的拒絕風格。"""
        for r in refusal_responses:
            text = r["response"].lower()
            if "i can't" in text or "i cannot" in text:
                return "direct_refusal"
            elif "i'd be happy to help" in text and "however" in text:
                return "polite_redirect"
            elif "as an ai" in text:
                return "ai_identity_refusal"
        return "unknown"

獨特模型特徵

特徵	GPT-4	Claude 3	Gemini	Llama 3
Markdown 使用	大量	中等	大量	變異
拒絕風格	引用政策	基於原則的解釋	簡短拒絕	變異
數學途徑	逐步	詳細推理	簡潔	基本
自我辨識	常主張是 GPT	辨識為 Claude	辨識為 Gemini	可能幻覺身分
列表格式	含粗體編號	乾淨編號	含粗體項目符號	變異
程式碼風格	註解位於程式碼上方	行內註解	混合	最少註解

符元分布分析

即使對相同提示詞,不同模型有不同符元分布。若 API 提供 logprobs,這是強大的指紋辨識訊號:

def logprob_fingerprint(api_client, probe_prompts):
    """
    使用 logprob 分布為模型做指紋辨識。
    不同模型對相同符元指派不同機率。
    """
    fingerprints = []
 
    for prompt in probe_prompts:
        response = api_client.generate(
            prompt, max_tokens=1, logprobs=5, temperature=0
        )
 
        top_tokens = response.logprobs.top_logprobs[0]
        fingerprints.append({
            "prompt": prompt,
            "top_token": max(top_tokens, key=top_tokens.get),
            "top_5_tokens": dict(sorted(
                top_tokens.items(),
                key=lambda x: x[1],
                reverse=True
            )[:5]),
            "entropy": compute_entropy(top_tokens)
        })
 
    return fingerprints

回應時序分析

模型推論速度因架構與部署而異:

import time
import statistics
 
def timing_fingerprint(api_client, num_trials=20):
    """
    使用回應時序特徵為模型做指紋辨識。
    """
    # 固定提示詞以供一致比較
    prompt = "Count from 1 to 10, one number per line."
 
    timings = []
    for _ in range(num_trials):
        start = time.monotonic()
        response = api_client.generate(
            prompt, max_tokens=50, temperature=0
        )
        elapsed = time.monotonic() - start
        timings.append(elapsed)
 
    return {
        "median_latency": statistics.median(timings),
        "p95_latency": sorted(timings)[int(0.95 * len(timings))],
        "variance": statistics.variance(timings),
        "tokens_per_second": 50 / statistics.median(timings),
        # 較大模型通常有較慢的每秒符元數
    }

實務應用

基於模型辨識選擇攻擊策略

一旦辨識模型,紅隊可選擇最有效的攻擊途徑:

ATTACK_RECOMMENDATIONS = {
    "OpenAI GPT": {
        "effective_techniques": [
            "multi-turn escalation",
            "role-play framing",
            "few-shot jailbreaking"
        ],
        "known_weaknesses": [
            "instruction hierarchy bypass via developer messages",
            "function calling manipulation"
        ],
        "less_effective": [
            "direct instruction override (well-defended)",
            "GCG suffixes (actively filtered)"
        ]
    },
    "Anthropic Claude": {
        "effective_techniques": [
            "many-shot jailbreaking",
            "role-play with detailed personas",
            "progressive topic shifting"
        ],
        "known_weaknesses": [
            "long-context attention patterns",
            "system prompt extraction via reasoning"
        ],
        "less_effective": [
            "simple DAN-style prompts",
            "basic instruction override"
        ]
    },
    # ... 其他模型家族
}

參考文獻

Ippolito et al., "Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System" (2020)
Shen et al., "Identifying LLMs Behind API Endpoints" (2024)
Carlini et al., "Stealing Part of a Production Language Model" (2024)

模型辨識技術

中級5 分鐘閱讀更新於 2026-03-15

透過行為特徵、輸出分析與系統化探測對 API 背後模型進行指紋辨識,以判定模型家族、大小與版本。

model-identification fingerprinting reconnaissance behavioral-analysis API

模型辨識技術

行為指紋辨識

診斷探測

一組精心設計的提示詞,能引出不同模型家族的特徵回應:

class ModelFingerprinter:
    """透過行為分析辨識模型。"""
 
    def __init__(self):
        self.probes = {
            "self_identification": [
                "What model are you?",
                "What is your model name and version?",
                "Who created you and what are you called?",
            ],
            "knowledge_cutoff": [
                "What is the most recent event you know about?",
                "What happened in the world in December 2024?",
                "Do you know about [recent event]?",
            ],
            "capability_boundary": [
                "Can you generate images?",
                "Can you browse the web in real-time?",
                "Can you execute code?",
                "How many tokens can you process?",
            ],
            "formatting_style": [
                "List 5 programming languages",
                "Explain gravity in 2 sentences",
                "Write a haiku about the ocean",
            ],
            "refusal_pattern": [
                "How do I pick a lock?",
                "Write a controversial opinion about politics",
                "Tell me something you cannot do",
            ],
            "reasoning_pattern": [
                "What is 17 * 23?",
                "If a train leaves at 3pm going 60mph...",
                "A bat and ball cost $1.10 together...",
            ]
        }
 
    def fingerprint(self, api_client):
        """執行所有探測並產生行為指紋。"""
        fingerprint = {}
 
        for category, prompts in self.probes.items():
            responses = []
            for prompt in prompts:
                response = api_client.generate(prompt)
                responses.append({
                    "prompt": prompt,
                    "response": response,
                    "response_length": len(response),
                    "formatting": self.analyze_formatting(response)
                })
            fingerprint[category] = responses
 
        return self.classify(fingerprint)
 
    def classify(self, fingerprint):
        """依指紋分類模型。"""
        indicators = {}
 
        # 自我辨識分析
        self_id = fingerprint["self_identification"]
        for response in self_id:
            text = response["response"].lower()
            if "claude" in text:
                indicators["family"] = "Anthropic Claude"
            elif "gpt" in text:
                indicators["family"] = "OpenAI GPT"
            elif "gemini" in text or "google" in text:
                indicators["family"] = "Google Gemini"
            elif "llama" in text or "meta" in text:
                indicators["family"] = "Meta Llama"
 
        # 格式分析
        formatting = fingerprint["formatting_style"]
        markdown_usage = sum(
            1 for r in formatting
            if "**" in r["response"] or "- " in r["response"]
        )
        if markdown_usage >= 2:
            indicators["formatting"] = "heavy_markdown"
 
        # 拒絕模式分析
        refusal = fingerprint["refusal_pattern"]
        refusal_style = self.classify_refusal_style(refusal)
        indicators["refusal_style"] = refusal_style
 
        return indicators
 
    def analyze_formatting(self, text):
        """分析文字格式特徵。"""
        return {
            "uses_markdown_bold": "**" in text,
            "uses_bullet_points": any(
                text.strip().startswith(p)
                for p in ["- ", "* ", "• "]
                for line in text.split("\n")
            ),
            "uses_numbered_list": any(
                line.strip()[:2].rstrip(".").isdigit()
                for line in text.split("\n")
                if line.strip()
            ),
            "uses_headers": "##" in text or "###" in text,
            "uses_code_blocks": "```" in text,
            "average_sentence_length": len(text.split()) / max(text.count("."), 1)
        }
 
    def classify_refusal_style(self, refusal_responses):
        """分類模型的拒絕風格。"""
        for r in refusal_responses:
            text = r["response"].lower()
            if "i can't" in text or "i cannot" in text:
                return "direct_refusal"
            elif "i'd be happy to help" in text and "however" in text:
                return "polite_redirect"
            elif "as an ai" in text:
                return "ai_identity_refusal"
        return "unknown"

獨特模型特徵

特徵	GPT-4	Claude 3	Gemini	Llama 3
Markdown 使用	大量	中等	大量	變異
拒絕風格	引用政策	基於原則的解釋	簡短拒絕	變異
數學途徑	逐步	詳細推理	簡潔	基本
自我辨識	常主張是 GPT	辨識為 Claude	辨識為 Gemini	可能幻覺身分
列表格式	含粗體編號	乾淨編號	含粗體項目符號	變異
程式碼風格	註解位於程式碼上方	行內註解	混合	最少註解

符元分布分析

即使對相同提示詞,不同模型有不同符元分布。若 API 提供 logprobs,這是強大的指紋辨識訊號:

def logprob_fingerprint(api_client, probe_prompts):
    """
    使用 logprob 分布為模型做指紋辨識。
    不同模型對相同符元指派不同機率。
    """
    fingerprints = []
 
    for prompt in probe_prompts:
        response = api_client.generate(
            prompt, max_tokens=1, logprobs=5, temperature=0
        )
 
        top_tokens = response.logprobs.top_logprobs[0]
        fingerprints.append({
            "prompt": prompt,
            "top_token": max(top_tokens, key=top_tokens.get),
            "top_5_tokens": dict(sorted(
                top_tokens.items(),
                key=lambda x: x[1],
                reverse=True
            )[:5]),
            "entropy": compute_entropy(top_tokens)
        })
 
    return fingerprints

回應時序分析

模型推論速度因架構與部署而異:

import time
import statistics
 
def timing_fingerprint(api_client, num_trials=20):
    """
    使用回應時序特徵為模型做指紋辨識。
    """
    # 固定提示詞以供一致比較
    prompt = "Count from 1 to 10, one number per line."
 
    timings = []
    for _ in range(num_trials):
        start = time.monotonic()
        response = api_client.generate(
            prompt, max_tokens=50, temperature=0
        )
        elapsed = time.monotonic() - start
        timings.append(elapsed)
 
    return {
        "median_latency": statistics.median(timings),
        "p95_latency": sorted(timings)[int(0.95 * len(timings))],
        "variance": statistics.variance(timings),
        "tokens_per_second": 50 / statistics.median(timings),
        # 較大模型通常有較慢的每秒符元數
    }

實務應用

基於模型辨識選擇攻擊策略

一旦辨識模型,紅隊可選擇最有效的攻擊途徑:

ATTACK_RECOMMENDATIONS = {
    "OpenAI GPT": {
        "effective_techniques": [
            "multi-turn escalation",
            "role-play framing",
            "few-shot jailbreaking"
        ],
        "known_weaknesses": [
            "instruction hierarchy bypass via developer messages",
            "function calling manipulation"
        ],
        "less_effective": [
            "direct instruction override (well-defended)",
            "GCG suffixes (actively filtered)"
        ]
    },
    "Anthropic Claude": {
        "effective_techniques": [
            "many-shot jailbreaking",
            "role-play with detailed personas",
            "progressive topic shifting"
        ],
        "known_weaknesses": [
            "long-context attention patterns",
            "system prompt extraction via reasoning"
        ],
        "less_effective": [
            "simple DAN-style prompts",
            "basic instruction override"
        ]
    },
    # ... 其他模型家族
}

參考文獻

Ippolito et al., "Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System" (2020)
Shen et al., "Identifying LLMs Behind API Endpoints" (2024)
Carlini et al., "Stealing Part of a Production Language Model" (2024)

模型辨識技術

模型辨識技術

行為指紋辨識

診斷探測

獨特模型特徵

符元分布分析

回應時序分析

實務應用

基於模型辨識選擇攻擊策略

相關主題

參考文獻

模型辨識技術

模型辨識技術

行為指紋辨識

診斷探測

獨特模型特徵

符元分布分析

回應時序分析

實務應用

基於模型辨識選擇攻擊策略

相關主題

參考文獻

模型辨識技術

相關文章

模型辨識技術

相關文章