Model Types and Their Attack Surfaces
How text, vision, multimodal, embedding, and code generation models each present unique vulnerabilities and attack surfaces for red teamers.
Why Model Type Matters
Not all AI models are created equal from a security perspective. A text-only chatbot, a vision classifier, a code generation tool, and an embedding model all use neural networks, but their inputs, outputs, and failure modes differ dramatically. Understanding the model type tells you which attacks are possible, which are likely to succeed, and which are irrelevant.
Text Generation Models
Text generation models (GPT-4, Claude, Llama, Mistral) accept text input and produce text output. They are the most common target for AI red teaming.
Attack Surface
| Vector | Description | Techniques |
|---|---|---|
| Prompt injection | Adversarial instructions in user input or retrieved content | Direct injection, indirect injection, instruction hierarchy exploitation |
| Jailbreaking | Overriding safety training to produce restricted content | Role-play, encoding tricks, many-shot, crescendo attacks |
| System prompt extraction | Making the model reveal its system instructions | Instruction confusion, role reversal, meta-prompting |
| Training data extraction | Prompting the model to reproduce memorized training data | Verbatim elicitation, completion-based extraction |
| Output manipulation | Steering model outputs toward specific content | Prefix injection, logit manipulation (if API exposes logprobs) |
Unique Characteristics
Text models are trained on internet-scale text data, which means they have memorized a vast amount of information — including potentially sensitive, copyrighted, and personal data. Their natural language interface makes them accessible to attackers with no technical expertise, and their general-purpose nature means that guardrails must cover an enormous range of potential harmful outputs.
Vision Models
Vision models process images as input. This includes image classifiers (ResNet, ViT), object detectors (YOLO), and image-understanding models that generate text descriptions of images.
Attack Surface
| Vector | Description | Techniques |
|---|---|---|
| Adversarial perturbation | Imperceptible pixel changes that cause misclassification | FGSM, PGD, C&W attacks |
| Adversarial patches | Visible but seemingly innocuous image patches that trigger specific behavior | Physical-world stickers, QR-code-like adversarial patches |
| Typographic attacks | Text embedded in images that influences model behavior | Adding text to images that vision-language models process as instructions |
| Steganographic payloads | Hidden data in images that models detect but humans do not | Embedding adversarial content in image metadata or pixel patterns |
| Data poisoning | Corrupting training images to embed backdoors | Trigger patches that cause specific classification when present |
Unique Characteristics
Vision models operate in a continuous input space (pixel values) rather than a discrete one (tokens), which enables gradient-based adversarial perturbation attacks that have no direct analogue in text models. The physical-world applicability of vision attacks also differs — adversarial patches on stop signs or medical images can have safety-critical consequences.
Multimodal Models
Multimodal models (GPT-4V, Gemini, Claude with vision) accept multiple input types — text and images, or text, images, and audio. They represent the frontier of both capability and attack surface.
Attack Surface
Multimodal models inherit the attack surfaces of every modality they support, plus entirely new cross-modal attack vectors:
| Vector | Description | Techniques |
|---|---|---|
| Cross-modal injection | Adversarial instructions in one modality (e.g., image) that affect processing of another modality (e.g., text) | Text in images that overrides system prompts |
| Modality mismatch | Exploiting differences in how the model processes different modalities | Using images to convey content that would be blocked in text |
| Modal confusion | Causing the model to misinterpret the relationship between modalities | Pairing innocuous text with adversarial images |
| OCR exploitation | Exploiting the model's ability to read text in images | Embedding prompt injection in screenshots, documents, or photos |
| Audio injection | Hidden commands in audio that the model processes but humans cannot hear | Ultrasonic or masked audio commands |
Unique Characteristics
The cross-modal attack surface is the defining security challenge of multimodal models. Each modality has its own pre-processing pipeline, and the fusion layer where modalities are combined is a high-value target. Attacks that exploit the gap between how different modalities are filtered are particularly effective because guardrail systems often have inconsistent coverage across modalities.
Embedding Models
Embedding models (text-embedding-3, voyage-3, BGE) convert text (or other data) into dense numerical vectors. They do not generate text — they produce fixed-dimensional vectors used for search, classification, and RAG.
Attack Surface
| Vector | Description | Techniques |
|---|---|---|
| Embedding collision | Crafting adversarial text that produces similar embeddings to target text | Gradient-based optimization, genetic algorithms |
| Retrieval poisoning | Injecting documents that will be retrieved for specific queries | SEO-like techniques for vector databases |
| Membership inference | Determining whether specific text was in the training data | Statistical analysis of embedding distances |
| Embedding inversion | Recovering original text from embeddings | Trained inversion models that map vectors back to text |
| Clustering manipulation | Corrupting the organization of an embedding space | Inserting adversarial documents that distort cluster boundaries |
Unique Characteristics
Embedding models are foundational to RAG systems. If an attacker can manipulate what gets retrieved, they can influence the generative model's output without ever directly attacking it. Embedding attacks are indirect but highly effective because they exploit the trust that generative models place in retrieved content.
Code Generation Models
Code generation models (Codex, CodeLlama, StarCoder, Copilot) are trained to produce executable code. This creates a fundamentally different risk profile than text generation.
Attack Surface
| Vector | Description | Techniques |
|---|---|---|
| Malicious code generation | Tricking the model into producing code with vulnerabilities or backdoors | Prompt injection targeting code comments, specification manipulation |
| Dependency confusion | Generated code references malicious packages | Training data includes references to attacker-controlled packages |
| Code injection via context | Injecting malicious code through repository context or RAG | Poisoning code repositories that the model retrieves from |
| Execution environment escape | Generated code that escapes its sandbox | Shell commands, file system access, network calls in generated code |
| Supply chain attacks | Compromising the model's training data to embed persistent code vulnerabilities | Poisoning open-source repositories used in training data |
Unique Characteristics
The critical difference with code generation models is that their output is often executed. A harmful text output is concerning; harmful code that gets executed is an active exploit. Code generation models also face the unique challenge of package hallucination — generating import statements for packages that do not exist, which attackers can register and populate with malware.
Specialized and Domain-Specific Models
Several categories of specialized models present unique attack surfaces:
Speech and Audio Models
Models like Whisper (speech-to-text) and text-to-speech systems face adversarial audio attacks, speaker spoofing, and hidden command injection in audio streams.
Image Generation Models
Models like DALL-E, Stable Diffusion, and Midjourney can be manipulated to generate harmful imagery, bypass content filters through obfuscated prompts, or reproduce copyrighted material. Attacks focus on prompt obfuscation and classifier evasion.
Recommendation Models
Models that recommend content, products, or actions can be manipulated through interaction poisoning (strategic clicks or ratings), preference injection, and filter bubble exploitation.
Classification Models
Binary or multi-class classifiers used for content moderation, spam detection, or safety filtering are targets for evasion attacks. Techniques include adversarial perturbation, boundary exploitation, and training data manipulation.
Choosing Your Attack Strategy by Model Type
| Model Type | Primary Attack Strategy | Key Risk |
|---|---|---|
| Text generation | Prompt injection, jailbreaking | Harmful content, data leakage |
| Vision | Adversarial perturbation, typographic attacks | Misclassification, safety bypass |
| Multimodal | Cross-modal injection, modality mismatch | Widest attack surface, filter gaps |
| Embedding | Retrieval poisoning, embedding collision | Indirect influence on downstream models |
| Code generation | Malicious code generation, execution escape | Code execution, supply chain compromise |
Related Topics
- The AI Landscape — the broader ecosystem these models operate in
- Deployment Patterns — how deployment affects model-type-specific risks
- Adversarial ML: Core Concepts — the theoretical foundations of adversarial attacks
- How LLMs Work — deep dive into text generation model internals
References
- "Adversarial Examples in the Physical World" - Kurakin et al. (2017) - Seminal work demonstrating adversarial perturbations that transfer to physical-world vision systems
- "Visual Adversarial Examples Jailbreak Aligned Large Language Models" - Qi et al. (2023) - Research on cross-modal attacks against multimodal language models
- "Poisoning Language Models During Instruction Tuning" - Wan et al. (2023) - Techniques for embedding backdoors in fine-tuned language models
- "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions" - Pearce et al. (2022) - Systematic evaluation of security vulnerabilities in AI-generated code
Why are multimodal models considered to have the widest attack surface?