What is GPT-4 / GPT-4o?

Architecture overview of OpenAI's GPT-4 and GPT-4o models, including rumored Mixture of Experts design, capabilities, API surface, and security-relevant features for red teaming.

What is Claude (Anthropic)?

Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.

What is Claude Architecture Security Analysis?

Deep security analysis of Claude's architecture including extended thinking, tool use, and safety mechanisms.

What is Gemini (Google)?

Architecture overview of Google's Gemini model family, including natively multimodal design, long context capabilities, Google ecosystem integration, and security-relevant features for red teaming.

What is GPT-4 Architecture Security Analysis?

Deep security analysis of GPT-4's architecture including function calling, vision, and safety layers.

What is Gemini Architecture Security Analysis?

Deep security analysis of Gemini's native multimodal architecture and long-context capabilities.

What is Open-Weight Models?

Security analysis of open-weight models including Llama, Mistral, Qwen, and DeepSeek, covering unique risks from full weight access, fine-tuning attacks, and deployment security challenges.

What is Cross-Model Comparison?

Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.

What is Llama 4 Security Analysis?

Security analysis of Llama 4 including open-weight attack surface and fine-tuning vulnerabilities.

What is DeepSeek-R1 Security Analysis?

Security analysis of DeepSeek-R1's reasoning capabilities and MoE architecture vulnerabilities.

Model Deep Dives

intermediate9 min readUpdated 2026-03-15

Why model-specific knowledge matters for AI red teaming, how different architectures create different attack surfaces, and a systematic methodology for profiling any new model.

model-security red-teaming attack-surface methodology architecture

Every LLM has a personality when it comes to security. Two models that produce similar benchmark scores can have radically different vulnerability profiles. A jailbreak that works reliably against one model may fail completely against another, while a defense that seems robust in one context may be trivially bypassed in a different architecture. This section equips you with the model-specific knowledge needed to red team effectively across the major model families.

Why Model-Specific Knowledge Matters

General-purpose red teaming techniques are necessary but not sufficient. Consider the following scenario: you have a working jailbreak that bypasses GPT-4's safety filters by exploiting function calling semantics. You attempt the same technique against Claude and it fails entirely -- not because Claude is more secure, but because it processes tool calls differently. Meanwhile, a Constitutional AI weakness specific to Claude goes untested because your playbook was built for a different architecture.

This happens constantly in practice. Red teams that treat all models as interchangeable black boxes miss model-specific vulnerabilities and waste time on techniques that have no chance of working against their target.

The Architecture-to-Attack-Surface Pipeline

A model's architecture, training methodology, and deployment infrastructure collectively define its attack surface. Each layer introduces distinct vulnerability classes:

Layer	What It Determines	Security Impact
Base architecture	Token processing, attention patterns, context handling	Tokenization attacks, context window exploits, attention manipulation
Training methodology	Safety alignment approach (RLHF, Constitutional AI, DPO)	Alignment bypass techniques, training data extraction
Fine-tuning and post-training	Instruction following, refusal behavior, tool use	Jailbreak susceptibility, system prompt adherence
API and deployment	Rate limits, content filters, function calling, multimodal inputs	Filter bypass, API abuse, cross-modal injection
Ecosystem integration	Plugins, tools, retrieval, code execution	Indirect injection, tool exploitation, privilege escalation

A model trained with Constitutional AI (like Claude) has different alignment failure modes than one trained with RLHF (like GPT-4). A natively multimodal model (like Gemini) has attack surfaces that text-only models lack. An open-weight model (like Llama) exposes its weights to direct manipulation in ways that closed-source models never can.

Dimensions of Model Difference

When profiling a model for red teaming, evaluate it across these key dimensions:

Safety Training Approach

The method used to align a model fundamentally shapes its failure modes.

RLHF (Reinforcement Learning from Human Feedback) trains models to produce outputs that human raters prefer. This creates safety behavior that is learned from examples rather than derived from principles. RLHF-trained models tend to be vulnerable to distribution shift -- inputs that fall outside the patterns seen during safety training.

Constitutional AI uses a set of principles to guide self-critique and revision. Models trained this way may exhibit different failure modes: they can sometimes be convinced that a harmful request does not violate their principles, or that the principles themselves should be reinterpreted in context.

Direct Preference Optimization (DPO) and related techniques modify the training objective directly. These approaches may produce different refusal calibration than RLHF, sometimes refusing too broadly or too narrowly.

Context Window and Memory

Models with longer context windows (Gemini's 1M+ tokens, Claude's 200K tokens) are susceptible to attacks that exploit the full context length. Many-shot jailbreaking, for example, becomes more effective with longer contexts because more examples can be packed into a single prompt. Context window size also affects the viability of indirect injection attacks that embed payloads in large documents.

Multimodal Capabilities

Models that accept images, audio, or video alongside text have additional attack surfaces. Visual prompt injection, steganographic payloads, and cross-modal confusion attacks are only possible against multimodal models. The way a model fuses information across modalities creates unique opportunities for attackers.

Tool Use and Function Calling

Models with tool use capabilities introduce an entirely new attack class. The way a model parses function definitions, constructs function calls, and handles function responses varies significantly across providers. See Agent & Agentic Exploitation for deep coverage of tool-use attacks.

Deployment and API Surface

Rate limits, content filtering pipelines, streaming behavior, and API parameter handling all vary by provider. These infrastructure-level differences affect which attacks are practical and which testing methodologies are effective.

Methodology for Profiling a New Model

When you encounter a model you have not previously assessed, follow this systematic profiling process before attempting exploitation.

Phase 1: Reconnaissance

Gather publicly available information about the model:

Model card and technical report -- Architecture details, training data descriptions, stated safety measures
API documentation -- Available parameters, supported modalities, rate limits, content policies
Known vulnerabilities -- Search for published research, blog posts, and CVE-equivalent disclosures
Community findings -- Forums, social media, and responsible disclosure reports often surface techniques before formal publications

Phase 2: Baseline Assessment

Establish the model's default behavior before attempting any attacks:

Refusal calibration -- Submit a standardized set of requests across harm categories (violence, illegal activity, privacy, bias). Record what the model refuses and how it phrases refusals.
System prompt adherence -- Test how strongly the model follows system-level instructions versus user-level overrides.
Output format compliance -- Determine whether the model reliably follows structured output constraints, as format manipulation is a common attack primitive.
Tool use behavior -- If the model supports function calling, test its behavior with malformed schemas, conflicting instructions, and edge-case inputs.

Phase 3: Attack Surface Mapping

Map the specific attack surfaces based on your reconnaissance and baseline assessment:

Model Attack Surface Map
========================
1. Input channels:    [text, image, audio, video, files, URLs]
2. Output channels:   [text, function calls, code, images]
3. Safety layers:     [pre-filter, alignment, post-filter, content policy]
4. Integration points: [tools, retrieval, plugins, code execution]
5. Context handling:   [window size, memory, conversation state]
6. Known weaknesses:   [from reconnaissance phase]

Phase 4: Targeted Testing

With your attack surface map in hand, design targeted test cases for each identified surface. Prioritize based on:

Impact -- Which attack surfaces, if exploited, lead to the most significant consequences?
Novelty -- Which surfaces are least likely to have been tested by others?
Transferability -- Which findings would generalize to other deployments of the same model?

Phase 5: Cross-Model Validation

Test your findings against other models to determine whether vulnerabilities are model-specific or architectural:

If a technique works across multiple models, it likely exploits a fundamental LLM limitation
If it only works on one model, it targets that model's specific training or deployment choices
Document both cases, as model-specific vulnerabilities are often the most actionable for defenders

Section Overview

This section provides deep dives into the major model families you will encounter in production:

GPT-4 / GPT-4o -- OpenAI's flagship models, their rumored MoE architecture, function calling surface, and known vulnerability history
Claude -- Anthropic's model family, Constitutional AI training, and the unique attack surfaces it creates
Gemini -- Google's natively multimodal model, long context exploitation, and Google ecosystem integration risks
Open-Weight Models -- Llama, Mistral, Qwen, DeepSeek, and the fundamentally different threat model when weights are public
Cross-Model Comparison -- Standardized comparison methodology, safety coverage gaps, and jailbreak portability

Each model section follows the same structure: architecture overview, attack surface analysis, documented vulnerabilities, and testing methodology. This consistency allows you to build a mental model for comparing models and quickly identify what makes each one unique from a security perspective.

Prompt Injection & Jailbreaks -- Core injection techniques applied across all models
LLM Internals -- Architecture fundamentals that underpin model-specific behaviors
Agent & Agentic Exploitation -- Tool use and function calling attacks across model families
Exploit Dev & Tooling -- Building automated testing tools for model assessment
Multimodal Attacks -- Cross-modal attack techniques relevant to vision-capable models

References

Anthropic (2024). "Many-Shot Jailbreaking"
Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"
Zou, A. et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models"
Shayegani, E. et al. (2023). "Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks"
OWASP (2025). OWASP Top 10 for LLM Applications

Knowledge Check

Why is it important to profile a model's safety training approach (RLHF vs Constitutional AI vs DPO) before red teaming?

Model Deep Dives

Why Model-Specific Knowledge Matters

The Architecture-to-Attack-Surface Pipeline

Dimensions of Model Difference

Safety Training Approach

Context Window and Memory

Multimodal Capabilities

Tool Use and Function Calling

Deployment and API Surface

Methodology for Profiling a New Model

Phase 1: Reconnaissance

Phase 2: Baseline Assessment

Phase 3: Attack Surface Mapping

Phase 4: Targeted Testing

Phase 5: Cross-Model Validation

Section Overview

References

Learning Path

Model Deep Dives

Why Model-Specific Knowledge Matters

The Architecture-to-Attack-Surface Pipeline

Dimensions of Model Difference

Safety Training Approach

Context Window and Memory

Multimodal Capabilities

Tool Use and Function Calling

Deployment and API Surface

Methodology for Profiling a New Model

Phase 1: Reconnaissance

Phase 2: Baseline Assessment

Phase 3: Attack Surface Mapping

Phase 4: Targeted Testing

Phase 5: Cross-Model Validation

Section Overview

References

Learning Path

Model Deep Dives

Learning Path

Related articles

Model Deep Dives

Learning Path

Related articles