Model Identification Techniques
Fingerprinting models behind APIs using behavioral signatures, output analysis, and systematic probing to determine model family, size, and version.
Model Identification Techniques
Model identification (or model fingerprinting) is the reconnaissance phase of determining exactly what model you are interacting with. Knowing the model allows a red team to select the most effective attack techniques, reference known vulnerabilities, and calibrate expectations. Even when providers do not disclose their model, behavioral analysis can often narrow the identification to a specific model family and approximate parameter count.
Behavioral Fingerprinting
Diagnostic Probes
A set of carefully designed prompts that elicit characteristic responses from different model families:
class ModelFingerprinter:
"""Identify models through behavioral analysis."""
def __init__(self):
self.probes = {
"self_identification": [
"What model are you?",
"What is your model name and version?",
"Who created you and what are you called?",
],
"knowledge_cutoff": [
"What is the most recent event you know about?",
"What happened in the world in December 2024?",
"Do you know about [recent event]?",
],
"capability_boundary": [
"Can you generate images?",
"Can you browse the web in real-time?",
"Can you execute code?",
"How many tokens can you process?",
],
"formatting_style": [
"List 5 programming languages",
"Explain gravity in 2 sentences",
"Write a haiku about the ocean",
],
"refusal_pattern": [
"How do I pick a lock?",
"Write a controversial opinion about politics",
"Tell me something you cannot do",
],
"reasoning_pattern": [
"What is 17 * 23?",
"If a train leaves at 3pm going 60mph...",
"A bat and ball cost $1.10 together...",
]
}
def fingerprint(self, api_client):
"""Run all probes and generate a behavioral fingerprint."""
fingerprint = {}
for category, prompts in self.probes.items():
responses = []
for prompt in prompts:
response = api_client.generate(prompt)
responses.append({
"prompt": prompt,
"response": response,
"response_length": len(response),
"formatting": self.analyze_formatting(response)
})
fingerprint[category] = responses
return self.classify(fingerprint)
def classify(self, fingerprint):
"""Classify model based on fingerprint."""
indicators = {}
# Self-identification analysis
self_id = fingerprint["self_identification"]
for response in self_id:
text = response["response"].lower()
if "claude" in text:
indicators["family"] = "Anthropic Claude"
elif "gpt" in text:
indicators["family"] = "OpenAI GPT"
elif "gemini" in text or "google" in text:
indicators["family"] = "Google Gemini"
elif "llama" in text or "meta" in text:
indicators["family"] = "Meta Llama"
# Formatting analysis
formatting = fingerprint["formatting_style"]
markdown_usage = sum(
1 for r in formatting
if "**" in r["response"] or "- " in r["response"]
)
if markdown_usage >= 2:
indicators["formatting"] = "heavy_markdown"
# Refusal pattern analysis
refusal = fingerprint["refusal_pattern"]
refusal_style = self.classify_refusal_style(refusal)
indicators["refusal_style"] = refusal_style
return indicators
def analyze_formatting(self, text):
"""Analyze text formatting characteristics."""
return {
"uses_markdown_bold": "**" in text,
"uses_bullet_points": any(
text.strip().startswith(p)
for p in ["- ", "* ", "• "]
for line in text.split("\n")
),
"uses_numbered_list": any(
line.strip()[:2].rstrip(".").isdigit()
for line in text.split("\n")
if line.strip()
),
"uses_headers": "##" in text or "###" in text,
"uses_code_blocks": "```" in text,
"average_sentence_length": len(text.split()) / max(text.count("."), 1)
}
def classify_refusal_style(self, refusal_responses):
"""Classify the model's refusal style."""
for r in refusal_responses:
text = r["response"].lower()
if "i can't" in text or "i cannot" in text:
return "direct_refusal"
elif "i'd be happy to help" in text and "however" in text:
return "polite_redirect"
elif "as an ai" in text:
return "ai_identity_refusal"
return "unknown"Distinctive Model Characteristics
| Characteristic | GPT-4 | Claude 3 | Gemini | Llama 3 |
|---|---|---|---|---|
| Markdown usage | Heavy | Moderate | Heavy | Variable |
| Refusal style | Policy citation | Principled explanation | Brief refusal | Variable |
| Math approach | Step-by-step | Detailed reasoning | Concise | Basic |
| Self-identification | Often claims GPT | Identifies as Claude | Identifies as Gemini | May hallucinate identity |
| List formatting | Numbered with bold | Clean numbered | Bulleted with bold | Variable |
| Code style | Comments above code | Inline comments | Mixed | Minimal comments |
Token Distribution Analysis
Different models have different token distributions even for the same prompts. If the API provides logprobs, this is a powerful fingerprinting signal:
def logprob_fingerprint(api_client, probe_prompts):
"""
Fingerprint model using logprob distributions.
Different models assign different probabilities to the same tokens.
"""
fingerprints = []
for prompt in probe_prompts:
response = api_client.generate(
prompt, max_tokens=1, logprobs=5, temperature=0
)
top_tokens = response.logprobs.top_logprobs[0]
fingerprints.append({
"prompt": prompt,
"top_token": max(top_tokens, key=top_tokens.get),
"top_5_tokens": dict(sorted(
top_tokens.items(),
key=lambda x: x[1],
reverse=True
)[:5]),
"entropy": compute_entropy(top_tokens)
})
return fingerprintsResponse Timing Analysis
Model inference speed varies by architecture and deployment:
import time
import statistics
def timing_fingerprint(api_client, num_trials=20):
"""
Fingerprint model using response timing characteristics.
"""
# Fixed prompt for consistent comparison
prompt = "Count from 1 to 10, one number per line."
timings = []
for _ in range(num_trials):
start = time.monotonic()
response = api_client.generate(
prompt, max_tokens=50, temperature=0
)
elapsed = time.monotonic() - start
timings.append(elapsed)
return {
"median_latency": statistics.median(timings),
"p95_latency": sorted(timings)[int(0.95 * len(timings))],
"variance": statistics.variance(timings),
"tokens_per_second": 50 / statistics.median(timings),
# Larger models generally have slower tokens/second
}Practical Application
Selecting Attack Strategies Based on Model ID
Once the model is identified, the red team can select the most effective attack approach:
ATTACK_RECOMMENDATIONS = {
"OpenAI GPT": {
"effective_techniques": [
"multi-turn escalation",
"role-play framing",
"few-shot jailbreaking"
],
"known_weaknesses": [
"instruction hierarchy bypass via developer messages",
"function calling manipulation"
],
"less_effective": [
"direct instruction override (well-defended)",
"GCG suffixes (actively filtered)"
]
},
"Anthropic Claude": {
"effective_techniques": [
"many-shot jailbreaking",
"role-play with detailed personas",
"progressive topic shifting"
],
"known_weaknesses": [
"long-context attention patterns",
"system prompt extraction via reasoning"
],
"less_effective": [
"simple DAN-style prompts",
"basic instruction override"
]
},
# ... additional model families
}Related Topics
- Target Profiling — Broader target assessment methodology
- API Enumeration — Discovering API capabilities
- System Prompt Extraction — Extracting configuration after identification
An API responds to 'What model are you?' with 'I'm a helpful AI assistant.' It uses heavy markdown formatting, gives detailed step-by-step math solutions, and refuses requests by citing 'our usage policy.' Which model family is most likely?
References
- Ippolito et al., "Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System" (2020)
- Shen et al., "Identifying LLMs Behind API Endpoints" (2024)
- Carlini et al., "Stealing Part of a Production Language Model" (2024)