Model Deep Dives
Why model-specific knowledge matters for AI red teaming, how different architectures create different attack surfaces, and a systematic methodology for profiling any new model.
Every LLM has a personality when it comes to security. Two models that produce similar benchmark scores can have radically different vulnerability profiles. A jailbreak that works reliably against one model may fail completely against another, while a defense that seems robust in one context may be trivially bypassed in a different architecture. This section equips you with the model-specific knowledge needed to red team effectively across the major model families.
Why Model-Specific Knowledge Matters
General-purpose red teaming techniques are necessary but not sufficient. Consider the following scenario: you have a working jailbreak that bypasses GPT-4's safety filters by exploiting function calling semantics. You attempt the same technique against Claude and it fails entirely -- not because Claude is more secure, but because it processes tool calls differently. Meanwhile, a Constitutional AI weakness specific to Claude goes untested because your playbook was built for a different architecture.
This happens constantly in practice. Red teams that treat all models as interchangeable black boxes miss model-specific vulnerabilities and waste time on techniques that have no chance of working against their target.
The Architecture-to-Attack-Surface Pipeline
A model's architecture, training methodology, and deployment infrastructure collectively define its attack surface. Each layer introduces distinct vulnerability classes:
| Layer | What It Determines | Security Impact |
|---|---|---|
| Base architecture | Token processing, attention patterns, context handling | Tokenization attacks, context window exploits, attention manipulation |
| Training methodology | Safety alignment approach (RLHF, Constitutional AI, DPO) | Alignment bypass techniques, training data extraction |
| Fine-tuning and post-training | Instruction following, refusal behavior, tool use | Jailbreak susceptibility, system prompt adherence |
| API and deployment | Rate limits, content filters, function calling, multimodal inputs | Filter bypass, API abuse, cross-modal injection |
| Ecosystem integration | Plugins, tools, retrieval, code execution | Indirect injection, tool exploitation, privilege escalation |
A model trained with Constitutional AI (like Claude) has different alignment failure modes than one trained with RLHF (like GPT-4). A natively multimodal model (like Gemini) has attack surfaces that text-only models lack. An open-weight model (like Llama) exposes its weights to direct manipulation in ways that closed-source models never can.
Dimensions of Model Difference
When profiling a model for red teaming, evaluate it across these key dimensions:
Safety Training Approach
The method used to align a model fundamentally shapes its failure modes.
RLHF (Reinforcement Learning from Human Feedback) trains models to produce outputs that human raters prefer. This creates safety behavior that is learned from examples rather than derived from principles. RLHF-trained models tend to be vulnerable to distribution shift -- inputs that fall outside the patterns seen during safety training.
Constitutional AI uses a set of principles to guide self-critique and revision. Models trained this way may exhibit different failure modes: they can sometimes be convinced that a harmful request does not violate their principles, or that the principles themselves should be reinterpreted in context.
Direct Preference Optimization (DPO) and related techniques modify the training objective directly. These approaches may produce different refusal calibration than RLHF, sometimes refusing too broadly or too narrowly.
Context Window and Memory
Models with longer context windows (Gemini's 1M+ tokens, Claude's 200K tokens) are susceptible to attacks that exploit the full context length. Many-shot jailbreaking, for example, becomes more effective with longer contexts because more examples can be packed into a single prompt. Context window size also affects the viability of indirect injection attacks that embed payloads in large documents.
Multimodal Capabilities
Models that accept images, audio, or video alongside text have additional attack surfaces. Visual prompt injection, steganographic payloads, and cross-modal confusion attacks are only possible against multimodal models. The way a model fuses information across modalities creates unique opportunities for attackers.
Tool Use and Function Calling
Models with tool use capabilities introduce an entirely new attack class. The way a model parses function definitions, constructs function calls, and handles function responses varies significantly across providers. See Agent & Agentic Exploitation for deep coverage of tool-use attacks.
Deployment and API Surface
Rate limits, content filtering pipelines, streaming behavior, and API parameter handling all vary by provider. These infrastructure-level differences affect which attacks are practical and which testing methodologies are effective.
Methodology for Profiling a New Model
When you encounter a model you have not previously assessed, follow this systematic profiling process before attempting exploitation.
Phase 1: Reconnaissance
Gather publicly available information about the model:
- Model card and technical report -- Architecture details, training data descriptions, stated safety measures
- API documentation -- Available parameters, supported modalities, rate limits, content policies
- Known vulnerabilities -- Search for published research, blog posts, and CVE-equivalent disclosures
- Community findings -- Forums, social media, and responsible disclosure reports often surface techniques before formal publications
Phase 2: Baseline Assessment
Establish the model's default behavior before attempting any attacks:
- Refusal calibration -- Submit a standardized set of requests across harm categories (violence, illegal activity, privacy, bias). Record what the model refuses and how it phrases refusals.
- System prompt adherence -- Test how strongly the model follows system-level instructions versus user-level overrides.
- Output format compliance -- Determine whether the model reliably follows structured output constraints, as format manipulation is a common attack primitive.
- Tool use behavior -- If the model supports function calling, test its behavior with malformed schemas, conflicting instructions, and edge-case inputs.
Phase 3: Attack Surface Mapping
Map the specific attack surfaces based on your reconnaissance and baseline assessment:
Model Attack Surface Map
========================
1. Input channels: [text, image, audio, video, files, URLs]
2. Output channels: [text, function calls, code, images]
3. Safety layers: [pre-filter, alignment, post-filter, content policy]
4. Integration points: [tools, retrieval, plugins, code execution]
5. Context handling: [window size, memory, conversation state]
6. Known weaknesses: [from reconnaissance phase]
Phase 4: Targeted Testing
With your attack surface map in hand, design targeted test cases for each identified surface. Prioritize based on:
- Impact -- Which attack surfaces, if exploited, lead to the most significant consequences?
- Novelty -- Which surfaces are least likely to have been tested by others?
- Transferability -- Which findings would generalize to other deployments of the same model?
Phase 5: Cross-Model Validation
Test your findings against other models to determine whether vulnerabilities are model-specific or architectural:
- If a technique works across multiple models, it likely exploits a fundamental LLM limitation
- If it only works on one model, it targets that model's specific training or deployment choices
- Document both cases, as model-specific vulnerabilities are often the most actionable for defenders
Section Overview
This section provides deep dives into the major model families you will encounter in production:
- GPT-4 / GPT-4o -- OpenAI's flagship models, their rumored MoE architecture, function calling surface, and known vulnerability history
- Claude -- Anthropic's model family, Constitutional AI training, and the unique attack surfaces it creates
- Gemini -- Google's natively multimodal model, long context exploitation, and Google ecosystem integration risks
- Open-Weight Models -- Llama, Mistral, Qwen, DeepSeek, and the fundamentally different threat model when weights are public
- Cross-Model Comparison -- Standardized comparison methodology, safety coverage gaps, and jailbreak portability
Each model section follows the same structure: architecture overview, attack surface analysis, documented vulnerabilities, and testing methodology. This consistency allows you to build a mental model for comparing models and quickly identify what makes each one unique from a security perspective.
Related Topics
- Prompt Injection & Jailbreaks -- Core injection techniques applied across all models
- LLM Internals -- Architecture fundamentals that underpin model-specific behaviors
- Agent & Agentic Exploitation -- Tool use and function calling attacks across model families
- Exploit Dev & Tooling -- Building automated testing tools for model assessment
- Multimodal Attacks -- Cross-modal attack techniques relevant to vision-capable models
References
- Anthropic (2024). "Many-Shot Jailbreaking"
- Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"
- Zou, A. et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models"
- Shayegani, E. et al. (2023). "Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks"
- OWASP (2025). OWASP Top 10 for LLM Applications
Why is it important to profile a model's safety training approach (RLHF vs Constitutional AI vs DPO) before red teaming?