What is Attack Surface?

Comprehensive analysis of GPT-4-specific attack vectors including function calling exploitation, vision input attacks, system message hierarchy abuse, structured output manipulation, and known jailbreak patterns.

What is Known Vulnerabilities?

Documented GPT-4 vulnerabilities including DAN jailbreaks, data extraction incidents, system prompt leaks, tool-use exploits, and fine-tuning safety removal.

What is Testing Methodology?

Systematic methodology for red teaming GPT-4, including API-based probing techniques, rate limit considerations, content policy mapping, and safety boundary discovery.

GPT-4 / GPT-4o Overview

intermediate8 min readUpdated 2026-03-15

Architecture overview of OpenAI's GPT-4 and GPT-4o models, including rumored Mixture of Experts design, capabilities, API surface, and security-relevant features for red teaming.

gpt-4 openai architecture moe red-teaming

GPT-4 is OpenAI's flagship large language model and one of the most widely deployed LLMs in production applications. For red teamers, it represents both the most commonly encountered target and one of the most heavily tested -- meaning easy wins are rare, but architectural understanding reveals attack surfaces that surface-level testing misses.

Architecture

Rumored Mixture of Experts (MoE)

Multiple credible sources, including leaked information and independent analysis, suggest GPT-4 uses a Mixture of Experts (MoE) architecture. The reported configuration involves approximately 1.8 trillion total parameters across multiple expert networks, with roughly 220 billion parameters activated per forward pass through a routing mechanism.

From a security perspective, MoE architecture has several implications:

Expert routing as an attack surface -- Different inputs may activate different expert networks. If safety behavior is concentrated in specific experts, routing manipulation could potentially bypass safety.
Inconsistent behavior across domains -- Different experts may have different safety calibrations, leading to inconsistent refusal behavior across topics.
Sparse activation effects -- The gating mechanism's decisions about which experts to activate may be influenceable through carefully crafted inputs.

Model Variants

Variant	Context Window	Key Differences	Red Team Relevance
GPT-4 (original)	8K / 32K	Dense attention, slower	Baseline for vulnerability comparison
GPT-4 Turbo	128K	Expanded context, faster, cheaper	Many-shot attacks viable, knowledge cutoff differences
GPT-4o	128K	Natively multimodal, faster	Vision attack surface, audio input attacks
GPT-4o-mini	128K	Smaller, cheaper, faster	Potentially weaker safety, cost-effective for automated testing

Each variant may have different safety tuning. GPT-4o-mini, being a smaller and cheaper model, has historically shown weaker safety guardrails in certain categories -- a pattern common across model families where smaller variants receive less safety investment.

Training and Safety Approach

OpenAI uses a multi-layered safety approach for GPT-4:

RLHF (Reinforcement Learning from Human Feedback)

GPT-4's primary alignment mechanism is RLHF, where human raters evaluate model outputs and the model is trained to maximize preference scores. This creates safety behavior that is learned from rated examples rather than derived from explicit principles.

Security implications of RLHF:

Safety behavior is strongest for patterns well-represented in training data
Novel phrasings or unusual contexts may fall outside the trained distribution
The model may exhibit sycophantic behavior -- agreeing with users even when it should refuse -- because RLHF rewards agreeability
Refusal calibration can shift between model updates without public documentation

Rule-Based Reward Models (RBRM)

OpenAI supplements RLHF with rule-based reward models that score outputs against specific policy violations. These provide more consistent enforcement than human feedback alone but can be reverse-engineered through systematic probing.

Content Policy and Moderation API

A separate moderation layer evaluates both inputs and outputs against OpenAI's content policy. This is distinct from the model's own safety training and can be tested independently. The Moderation API is publicly available and can be probed to map content policy boundaries before testing the model itself.

API Surface

GPT-4's API provides multiple interaction channels, each representing a distinct attack surface:

Chat Completions API

The primary interface uses a message array with role-based formatting:

{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "System instructions..."},
    {"role": "user", "content": "User message..."},
    {"role": "assistant", "content": "Previous response..."}
  ]
}

The role hierarchy (system > user > assistant) is enforced through training but not structurally guaranteed. System message override attacks remain a core testing area (see Attack Surface).

Function Calling / Tool Use

GPT-4 supports structured function calling where the model generates JSON arguments for defined functions:

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "search_database",
      "description": "Search the customer database",
      "parameters": {
        "type": "object",
        "properties": {
          "query": {"type": "string"}
        }
      }
    }
  }]
}

Function definitions are injected into the model's context and processed alongside other instructions. This creates opportunities for injection through function descriptions, parameter schemas, and function response content. See Agent & Agentic Exploitation for detailed tool-use attack patterns.

Structured Outputs

The response_format parameter constrains the model's output to valid JSON matching a provided schema. While designed for reliability, structured outputs interact with safety training in complex ways -- the model may produce content in structured format that it would refuse in free-text format, or safety refusals may break the required schema.

Vision Input (GPT-4o)

GPT-4o accepts images alongside text, creating cross-modal attack opportunities. Images can contain text that the model reads and follows, embedding indirect injection payloads in visual content.

Additional Parameters

Temperature and top_p -- Affect output randomness, which influences safety behavior consistency
Logprobs -- Returns token-level log probabilities, useful for understanding the model's confidence in safety-related decisions
Logit bias -- Directly biases token probabilities, potentially suppressing refusal tokens
Stop sequences -- Can be used to truncate safety disclaimers

Key Capabilities for Red Teaming

Code Interpretation

GPT-4 can generate and reason about code in most programming languages. This is relevant for:

Testing whether the model generates exploits or malware when properly prompted
Understanding how code generation interacts with safety filters
Evaluating tool use in coding assistant deployments

Web Browsing (ChatGPT)

In ChatGPT deployments, GPT-4 can browse the web, creating indirect injection vectors through attacker-controlled web pages. The model reads page content and may follow instructions embedded in it.

File Analysis

GPT-4 can process uploaded files (PDFs, spreadsheets, code files), each representing a potential injection vector. Malicious content embedded in documents can influence the model's behavior when it processes those files.

OpenAI-Specific Considerations

Model Updates and Versioning

OpenAI regularly updates models, sometimes changing safety behavior without public notice. Pin specific model versions (e.g., gpt-4-0613) when conducting reproducible testing. The gpt-4 alias may point to different underlying models over time.

Rate Limits and Usage Tiers

API rate limits vary by account tier and affect testing throughput. Automated red teaming campaigns must account for rate limiting to avoid disruption and ensure test coverage.

Custom GPTs and Assistants API

OpenAI's Custom GPTs and Assistants API allow third parties to build on GPT-4 with custom system prompts and tool configurations. These deployments are often less thoroughly secured than OpenAI's first-party products and represent high-value targets for red teaming.

GPT-4 Attack Surface -- Specific attack vectors for GPT-4
GPT-4 Known Vulnerabilities -- Documented exploits and incidents
GPT-4 Testing Methodology -- Systematic testing procedures
Prompt Injection & Jailbreaks -- Core injection techniques applicable to GPT-4
Cross-Model Comparison -- How GPT-4 compares to other model families

References

OpenAI (2023). "GPT-4 Technical Report"
OpenAI (2024). "GPT-4o System Card"
OpenAI (2025). API Documentation
Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"

Knowledge Check

Why does GPT-4's rumored Mixture of Experts architecture matter for red teaming?

GPT-4 / GPT-4o Overview

Learning Path

Related articles

GPT-4 / GPT-4o Overview

Learning Path

Related articles