Model Types and Their Attack Surfaces

beginner10 min readUpdated 2026-03-15

How text, vision, multimodal, embedding, and code generation models each present unique vulnerabilities and attack surfaces for red teamers.

model-types attack-surface multimodal code-generation beginner

Why Model Type Matters

Not all AI models are created equal from a security perspective. A text-only chatbot, a vision classifier, a code generation tool, and an embedding model all use neural networks, but their inputs, outputs, and failure modes differ dramatically. Understanding the model type tells you which attacks are possible, which are likely to succeed, and which are irrelevant.

Text Generation Models

Text generation models (GPT-4, Claude, Llama, Mistral) accept text input and produce text output. They are the most common target for AI red teaming.

Attack Surface

Vector	Description	Techniques
Prompt injection	Adversarial instructions in user input or retrieved content	Direct injection, indirect injection, instruction hierarchy exploitation
Jailbreaking	Overriding safety training to produce restricted content	Role-play, encoding tricks, many-shot, crescendo attacks
System prompt extraction	Making the model reveal its system instructions	Instruction confusion, role reversal, meta-prompting
Training data extraction	Prompting the model to reproduce memorized training data	Verbatim elicitation, completion-based extraction
Output manipulation	Steering model outputs toward specific content	Prefix injection, logit manipulation (if API exposes logprobs)

Unique Characteristics

Text models are trained on internet-scale text data, which means they have memorized a vast amount of information — including potentially sensitive, copyrighted, and personal data. Their natural language interface makes them accessible to attackers with no technical expertise, and their general-purpose nature means that guardrails must cover an enormous range of potential harmful outputs.

Vision Models

Vision models process images as input. This includes image classifiers (ResNet, ViT), object detectors (YOLO), and image-understanding models that generate text descriptions of images.

Attack Surface

Vector	Description	Techniques
Adversarial perturbation	Imperceptible pixel changes that cause misclassification	FGSM, PGD, C&W attacks
Adversarial patches	Visible but seemingly innocuous image patches that trigger specific behavior	Physical-world stickers, QR-code-like adversarial patches
Typographic attacks	Text embedded in images that influences model behavior	Adding text to images that vision-language models process as instructions
Steganographic payloads	Hidden data in images that models detect but humans do not	Embedding adversarial content in image metadata or pixel patterns
Data poisoning	Corrupting training images to embed backdoors	Trigger patches that cause specific classification when present

Unique Characteristics

Vision models operate in a continuous input space (pixel values) rather than a discrete one (tokens), which enables gradient-based adversarial perturbation attacks that have no direct analogue in text models. The physical-world applicability of vision attacks also differs — adversarial patches on stop signs or medical images can have safety-critical consequences.

Multimodal Models

Multimodal models (GPT-4V, Gemini, Claude with vision) accept multiple input types — text and images, or text, images, and audio. They represent the frontier of both capability and attack surface.

Attack Surface

Multimodal models inherit the attack surfaces of every modality they support, plus entirely new cross-modal attack vectors:

Vector	Description	Techniques
Cross-modal injection	Adversarial instructions in one modality (e.g., image) that affect processing of another modality (e.g., text)	Text in images that overrides system prompts
Modality mismatch	Exploiting differences in how the model processes different modalities	Using images to convey content that would be blocked in text
Modal confusion	Causing the model to misinterpret the relationship between modalities	Pairing innocuous text with adversarial images
OCR exploitation	Exploiting the model's ability to read text in images	Embedding prompt injection in screenshots, documents, or photos
Audio injection	Hidden commands in audio that the model processes but humans cannot hear	Ultrasonic or masked audio commands

Unique Characteristics

The cross-modal attack surface is the defining security challenge of multimodal models. Each modality has its own pre-processing pipeline, and the fusion layer where modalities are combined is a high-value target. Attacks that exploit the gap between how different modalities are filtered are particularly effective because guardrail systems often have inconsistent coverage across modalities.

Embedding Models

Embedding models (text-embedding-3, voyage-3, BGE) convert text (or other data) into dense numerical vectors. They do not generate text — they produce fixed-dimensional vectors used for search, classification, and RAG.

Attack Surface

Vector	Description	Techniques
Embedding collision	Crafting adversarial text that produces similar embeddings to target text	Gradient-based optimization, genetic algorithms
Retrieval poisoning	Injecting documents that will be retrieved for specific queries	SEO-like techniques for vector databases
Membership inference	Determining whether specific text was in the training data	Statistical analysis of embedding distances
Embedding inversion	Recovering original text from embeddings	Trained inversion models that map vectors back to text
Clustering manipulation	Corrupting the organization of an embedding space	Inserting adversarial documents that distort cluster boundaries

Unique Characteristics

Embedding models are foundational to RAG systems. If an attacker can manipulate what gets retrieved, they can influence the generative model's output without ever directly attacking it. Embedding attacks are indirect but highly effective because they exploit the trust that generative models place in retrieved content.

Code Generation Models

Code generation models (Codex, CodeLlama, StarCoder, Copilot) are trained to produce executable code. This creates a fundamentally different risk profile than text generation.

Attack Surface

Vector	Description	Techniques
Malicious code generation	Tricking the model into producing code with vulnerabilities or backdoors	Prompt injection targeting code comments, specification manipulation
Dependency confusion	Generated code references malicious packages	Training data includes references to attacker-controlled packages
Code injection via context	Injecting malicious code through repository context or RAG	Poisoning code repositories that the model retrieves from
Execution environment escape	Generated code that escapes its sandbox	Shell commands, file system access, network calls in generated code
Supply chain attacks	Compromising the model's training data to embed persistent code vulnerabilities	Poisoning open-source repositories used in training data

Unique Characteristics

The critical difference with code generation models is that their output is often executed. A harmful text output is concerning; harmful code that gets executed is an active exploit. Code generation models also face the unique challenge of package hallucination — generating import statements for packages that do not exist, which attackers can register and populate with malware.

Specialized and Domain-Specific Models

Several categories of specialized models present unique attack surfaces:

Speech and Audio Models

Models like Whisper (speech-to-text) and text-to-speech systems face adversarial audio attacks, speaker spoofing, and hidden command injection in audio streams.

Image Generation Models

Models like DALL-E, Stable Diffusion, and Midjourney can be manipulated to generate harmful imagery, bypass content filters through obfuscated prompts, or reproduce copyrighted material. Attacks focus on prompt obfuscation and classifier evasion.

Recommendation Models

Models that recommend content, products, or actions can be manipulated through interaction poisoning (strategic clicks or ratings), preference injection, and filter bubble exploitation.

Classification Models

Binary or multi-class classifiers used for content moderation, spam detection, or safety filtering are targets for evasion attacks. Techniques include adversarial perturbation, boundary exploitation, and training data manipulation.

Choosing Your Attack Strategy by Model Type

Model Type	Primary Attack Strategy	Key Risk
Text generation	Prompt injection, jailbreaking	Harmful content, data leakage
Vision	Adversarial perturbation, typographic attacks	Misclassification, safety bypass
Multimodal	Cross-modal injection, modality mismatch	Widest attack surface, filter gaps
Embedding	Retrieval poisoning, embedding collision	Indirect influence on downstream models
Code generation	Malicious code generation, execution escape	Code execution, supply chain compromise

The AI Landscape — the broader ecosystem these models operate in
Deployment Patterns — how deployment affects model-type-specific risks
Adversarial ML: Core Concepts — the theoretical foundations of adversarial attacks
How LLMs Work — deep dive into text generation model internals

References

"Adversarial Examples in the Physical World" - Kurakin et al. (2017) - Seminal work demonstrating adversarial perturbations that transfer to physical-world vision systems
"Visual Adversarial Examples Jailbreak Aligned Large Language Models" - Qi et al. (2023) - Research on cross-modal attacks against multimodal language models
"Poisoning Language Models During Instruction Tuning" - Wan et al. (2023) - Techniques for embedding backdoors in fine-tuned language models
"Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions" - Pearce et al. (2022) - Systematic evaluation of security vulnerabilities in AI-generated code

Knowledge Check

Why are multimodal models considered to have the widest attack surface?

Edit this page on GitHub

Model Types and Their Attack Surfaces

beginner10 min readUpdated 2026-03-15

How text, vision, multimodal, embedding, and code generation models each present unique vulnerabilities and attack surfaces for red teamers.

model-types attack-surface multimodal code-generation beginner

Why Model Type Matters

Text Generation Models

Text generation models (GPT-4, Claude, Llama, Mistral) accept text input and produce text output. They are the most common target for AI red teaming.

Attack Surface

Vector	Description	Techniques
Prompt injection	Adversarial instructions in user input or retrieved content	Direct injection, indirect injection, instruction hierarchy exploitation
Jailbreaking	Overriding safety training to produce restricted content	Role-play, encoding tricks, many-shot, crescendo attacks
System prompt extraction	Making the model reveal its system instructions	Instruction confusion, role reversal, meta-prompting
Training data extraction	Prompting the model to reproduce memorized training data	Verbatim elicitation, completion-based extraction
Output manipulation	Steering model outputs toward specific content	Prefix injection, logit manipulation (if API exposes logprobs)

Unique Characteristics

Vision Models

Vision models process images as input. This includes image classifiers (ResNet, ViT), object detectors (YOLO), and image-understanding models that generate text descriptions of images.

Attack Surface

Vector	Description	Techniques
Adversarial perturbation	Imperceptible pixel changes that cause misclassification	FGSM, PGD, C&W attacks
Adversarial patches	Visible but seemingly innocuous image patches that trigger specific behavior	Physical-world stickers, QR-code-like adversarial patches
Typographic attacks	Text embedded in images that influences model behavior	Adding text to images that vision-language models process as instructions
Steganographic payloads	Hidden data in images that models detect but humans do not	Embedding adversarial content in image metadata or pixel patterns
Data poisoning	Corrupting training images to embed backdoors	Trigger patches that cause specific classification when present

Unique Characteristics

Multimodal Models

Multimodal models (GPT-4V, Gemini, Claude with vision) accept multiple input types — text and images, or text, images, and audio. They represent the frontier of both capability and attack surface.

Attack Surface

Multimodal models inherit the attack surfaces of every modality they support, plus entirely new cross-modal attack vectors:

Vector	Description	Techniques
Cross-modal injection	Adversarial instructions in one modality (e.g., image) that affect processing of another modality (e.g., text)	Text in images that overrides system prompts
Modality mismatch	Exploiting differences in how the model processes different modalities	Using images to convey content that would be blocked in text
Modal confusion	Causing the model to misinterpret the relationship between modalities	Pairing innocuous text with adversarial images
OCR exploitation	Exploiting the model's ability to read text in images	Embedding prompt injection in screenshots, documents, or photos
Audio injection	Hidden commands in audio that the model processes but humans cannot hear	Ultrasonic or masked audio commands

Unique Characteristics

Embedding Models

Attack Surface

Vector	Description	Techniques
Embedding collision	Crafting adversarial text that produces similar embeddings to target text	Gradient-based optimization, genetic algorithms
Retrieval poisoning	Injecting documents that will be retrieved for specific queries	SEO-like techniques for vector databases
Membership inference	Determining whether specific text was in the training data	Statistical analysis of embedding distances
Embedding inversion	Recovering original text from embeddings	Trained inversion models that map vectors back to text
Clustering manipulation	Corrupting the organization of an embedding space	Inserting adversarial documents that distort cluster boundaries

Unique Characteristics

Code Generation Models

Code generation models (Codex, CodeLlama, StarCoder, Copilot) are trained to produce executable code. This creates a fundamentally different risk profile than text generation.

Attack Surface

Vector	Description	Techniques
Malicious code generation	Tricking the model into producing code with vulnerabilities or backdoors	Prompt injection targeting code comments, specification manipulation
Dependency confusion	Generated code references malicious packages	Training data includes references to attacker-controlled packages
Code injection via context	Injecting malicious code through repository context or RAG	Poisoning code repositories that the model retrieves from
Execution environment escape	Generated code that escapes its sandbox	Shell commands, file system access, network calls in generated code
Supply chain attacks	Compromising the model's training data to embed persistent code vulnerabilities	Poisoning open-source repositories used in training data

Model Type	Primary Attack Strategy	Key Risk
Text generation	Prompt injection, jailbreaking	Harmful content, data leakage
Vision	Adversarial perturbation, typographic attacks	Misclassification, safety bypass
Multimodal	Cross-modal injection, modality mismatch	Widest attack surface, filter gaps
Embedding	Retrieval poisoning, embedding collision	Indirect influence on downstream models
Code generation	Malicious code generation, execution escape	Code execution, supply chain compromise

The AI Landscape — the broader ecosystem these models operate in
Deployment Patterns — how deployment affects model-type-specific risks
Adversarial ML: Core Concepts — the theoretical foundations of adversarial attacks
How LLMs Work — deep dive into text generation model internals

References

"Adversarial Examples in the Physical World" - Kurakin et al. (2017) - Seminal work demonstrating adversarial perturbations that transfer to physical-world vision systems
"Visual Adversarial Examples Jailbreak Aligned Large Language Models" - Qi et al. (2023) - Research on cross-modal attacks against multimodal language models
"Poisoning Language Models During Instruction Tuning" - Wan et al. (2023) - Techniques for embedding backdoors in fine-tuned language models
"Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions" - Pearce et al. (2022) - Systematic evaluation of security vulnerabilities in AI-generated code

Knowledge Check

Why are multimodal models considered to have the widest attack surface?

Edit this page on GitHub

Model Types and Their Attack Surfaces

Related articles

Model Types and Their Attack Surfaces

Related articles