GPT-4 / GPT-4o Overview
Architecture overview of OpenAI's GPT-4 and GPT-4o models, including rumored Mixture of Experts design, capabilities, API surface, and security-relevant features for red teaming.
GPT-4 is OpenAI's flagship large language model and one of the most widely deployed LLMs in production applications. For red teamers, it represents both the most commonly encountered target and one of the most heavily tested -- meaning easy wins are rare, but architectural understanding reveals attack surfaces that surface-level testing misses.
Architecture
Rumored Mixture of Experts (MoE)
Multiple credible sources, including leaked information and independent analysis, suggest GPT-4 uses a Mixture of Experts (MoE) architecture. The reported configuration involves approximately 1.8 trillion total parameters across multiple expert networks, with roughly 220 billion parameters activated per forward pass through a routing mechanism.
From a security perspective, MoE architecture has several implications:
- Expert routing as an attack surface -- Different inputs may activate different expert networks. If safety behavior is concentrated in specific experts, routing manipulation could potentially bypass safety.
- Inconsistent behavior across domains -- Different experts may have different safety calibrations, leading to inconsistent refusal behavior across topics.
- Sparse activation effects -- The gating mechanism's decisions about which experts to activate may be influenceable through carefully crafted inputs.
Model Variants
| Variant | Context Window | Key Differences | Red Team Relevance |
|---|---|---|---|
| GPT-4 (original) | 8K / 32K | Dense attention, slower | Baseline for vulnerability comparison |
| GPT-4 Turbo | 128K | Expanded context, faster, cheaper | Many-shot attacks viable, knowledge cutoff differences |
| GPT-4o | 128K | Natively multimodal, faster | Vision attack surface, audio input attacks |
| GPT-4o-mini | 128K | Smaller, cheaper, faster | Potentially weaker safety, cost-effective for automated testing |
Each variant may have different safety tuning. GPT-4o-mini, being a smaller and cheaper model, has historically shown weaker safety guardrails in certain categories -- a pattern common across model families where smaller variants receive less safety investment.
Training and Safety Approach
OpenAI uses a multi-layered safety approach for GPT-4:
RLHF (Reinforcement Learning from Human Feedback)
GPT-4's primary alignment mechanism is RLHF, where human raters evaluate model outputs and the model is trained to maximize preference scores. This creates safety behavior that is learned from rated examples rather than derived from explicit principles.
Security implications of RLHF:
- Safety behavior is strongest for patterns well-represented in training data
- Novel phrasings or unusual contexts may fall outside the trained distribution
- The model may exhibit sycophantic behavior -- agreeing with users even when it should refuse -- because RLHF rewards agreeability
- Refusal calibration can shift between model updates without public documentation
Rule-Based Reward Models (RBRM)
OpenAI supplements RLHF with rule-based reward models that score outputs against specific policy violations. These provide more consistent enforcement than human feedback alone but can be reverse-engineered through systematic probing.
Content Policy and Moderation API
A separate moderation layer evaluates both inputs and outputs against OpenAI's content policy. This is distinct from the model's own safety training and can be tested independently. The Moderation API is publicly available and can be probed to map content policy boundaries before testing the model itself.
API Surface
GPT-4's API provides multiple interaction channels, each representing a distinct attack surface:
Chat Completions API
The primary interface uses a message array with role-based formatting:
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "System instructions..."},
{"role": "user", "content": "User message..."},
{"role": "assistant", "content": "Previous response..."}
]
}The role hierarchy (system > user > assistant) is enforced through training but not structurally guaranteed. System message override attacks remain a core testing area (see Attack Surface).
Function Calling / Tool Use
GPT-4 supports structured function calling where the model generates JSON arguments for defined functions:
{
"tools": [{
"type": "function",
"function": {
"name": "search_database",
"description": "Search the customer database",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
}
}
}
}]
}Function definitions are injected into the model's context and processed alongside other instructions. This creates opportunities for injection through function descriptions, parameter schemas, and function response content. See Agent & Agentic Exploitation for detailed tool-use attack patterns.
Structured Outputs
The response_format parameter constrains the model's output to valid JSON matching a provided schema. While designed for reliability, structured outputs interact with safety training in complex ways -- the model may produce content in structured format that it would refuse in free-text format, or safety refusals may break the required schema.
Vision Input (GPT-4o)
GPT-4o accepts images alongside text, creating cross-modal attack opportunities. Images can contain text that the model reads and follows, embedding indirect injection payloads in visual content.
Additional Parameters
- Temperature and top_p -- Affect output randomness, which influences safety behavior consistency
- Logprobs -- Returns token-level log probabilities, useful for understanding the model's confidence in safety-related decisions
- Logit bias -- Directly biases token probabilities, potentially suppressing refusal tokens
- Stop sequences -- Can be used to truncate safety disclaimers
Key Capabilities for Red Teaming
Code Interpretation
GPT-4 can generate and reason about code in most programming languages. This is relevant for:
- Testing whether the model generates exploits or malware when properly prompted
- Understanding how code generation interacts with safety filters
- Evaluating tool use in coding assistant deployments
Web Browsing (ChatGPT)
In ChatGPT deployments, GPT-4 can browse the web, creating indirect injection vectors through attacker-controlled web pages. The model reads page content and may follow instructions embedded in it.
File Analysis
GPT-4 can process uploaded files (PDFs, spreadsheets, code files), each representing a potential injection vector. Malicious content embedded in documents can influence the model's behavior when it processes those files.
OpenAI-Specific Considerations
Model Updates and Versioning
OpenAI regularly updates models, sometimes changing safety behavior without public notice. Pin specific model versions (e.g., gpt-4-0613) when conducting reproducible testing. The gpt-4 alias may point to different underlying models over time.
Rate Limits and Usage Tiers
API rate limits vary by account tier and affect testing throughput. Automated red teaming campaigns must account for rate limiting to avoid disruption and ensure test coverage.
Custom GPTs and Assistants API
OpenAI's Custom GPTs and Assistants API allow third parties to build on GPT-4 with custom system prompts and tool configurations. These deployments are often less thoroughly secured than OpenAI's first-party products and represent high-value targets for red teaming.
Related Topics
- GPT-4 Attack Surface -- Specific attack vectors for GPT-4
- GPT-4 Known Vulnerabilities -- Documented exploits and incidents
- GPT-4 Testing Methodology -- Systematic testing procedures
- Prompt Injection & Jailbreaks -- Core injection techniques applicable to GPT-4
- Cross-Model Comparison -- How GPT-4 compares to other model families
References
- OpenAI (2023). "GPT-4 Technical Report"
- OpenAI (2024). "GPT-4o System Card"
- OpenAI (2025). API Documentation
- Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"
Why does GPT-4's rumored Mixture of Experts architecture matter for red teaming?