VLM Architecture & Vision-Language Alignment
Deep dive into VLM architectures including CLIP, SigLIP, and vision transformers. How image patches become tokens, alignment training, and where misalignment creates exploitable gaps.
VLM Architecture Overview
A modern VLM consists of three core components connected in series:
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Vision │ │ Alignment / │ │ Language │
│ Encoder │───▶│ Projection │───▶│ Model │
│ (ViT/CLIP) │ │ Layer │ │ (LLM) │
└─────────────┘ └──────────────────┘ └─────────────┘
Image Bridging Text + Vision
Patches Representations Reasoning
Each component introduces distinct attack surfaces. Understanding the architecture is prerequisite to understanding the attacks.
Vision Encoders
Vision Transformers (ViT)
The Vision Transformer (ViT) is the backbone of most modern VLMs. The key insight: treat an image like a sequence of words.
Patch Extraction
The input image (e.g., 224x224) is divided into non-overlapping patches (e.g., 16x16 pixels each), yielding 196 patches.
Linear Embedding
Each patch is flattened into a vector and projected through a linear layer to produce a patch embedding of dimension
d_model.Position Encoding
Learnable position embeddings are added so the model knows spatial relationships between patches.
Transformer Processing
The sequence of patch embeddings passes through multiple transformer layers with self-attention, producing contextualized representations.
CLIP and Contrastive Training
CLIP changed VLM architectures by learning a shared embedding space for images and text. Its training objective: make matching image-text pairs close in embedding space, and non-matching pairs far apart.
# Simplified CLIP contrastive loss
def clip_loss(image_embeddings, text_embeddings, temperature=0.07):
# Normalize embeddings
image_embeddings = F.normalize(image_embeddings, dim=-1)
text_embeddings = F.normalize(text_embeddings, dim=-1)
# Compute similarity matrix
logits = (image_embeddings @ text_embeddings.T) / temperature
# Symmetric cross-entropy loss
labels = torch.arange(len(logits), device=logits.device)
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
return (loss_i2t + loss_t2i) / 2Security implication: Because CLIP maps images to the same space as text, an adversarial image can be crafted to have the same embedding as an arbitrary text string. This is the foundation of many embedding-space attacks.
SigLIP and Alternatives
SigLIP replaces the softmax-based contrastive loss with a sigmoid loss, treating each image-text pair independently rather than relative to the batch. This changes the embedding geometry in ways that affect adversarial transferability.
| Encoder | Training | Resolution | Security Notes |
|---|---|---|---|
| CLIP ViT-L/14 | Contrastive (softmax) | 224-336px | Most studied; many published attacks |
| SigLIP | Contrastive (sigmoid) | 224-384px | Different embedding geometry; less attack research |
| EVA-CLIP | Contrastive + distillation | 224-448px | Distillation may propagate vulnerabilities |
| InternViT | Progressive training | 448px | Higher resolution = more patches = larger attack surface |
Alignment and Projection Layers
The vision encoder and language model operate in different embedding spaces. The alignment layer bridges them.
Linear Projection
The simplest approach: a learned linear transformation (or MLP) that maps vision encoder outputs to the LLM's input space.
class LinearProjection(nn.Module):
def __init__(self, vision_dim, llm_dim):
super().__init__()
self.proj = nn.Linear(vision_dim, llm_dim)
def forward(self, vision_features):
# vision_features: [batch, num_patches, vision_dim]
return self.proj(vision_features)
# Output: [batch, num_patches, llm_dim]Used by LLaVA and similar architectures. The vulnerability: a single linear layer may not fully capture complex cross-modal relationships, leaving systematic misalignment patterns.
Q-Former (Cross-Attention)
BLIP-2 introduced the Q-Former, which uses a set of learnable query tokens that attend to the vision encoder's output.
Learnable Queries (32 tokens)
│
▼
┌───────────────────┐
│ Cross-Attention │◀── Vision Encoder Features (196 patches)
│ + Self-Attention │
└───────────────────┘
│
▼
32 Visual Tokens → LLM
Security implication: The Q-Former compresses 196 patch tokens into ~32 query tokens. This lossy compression means some visual information is discarded -- an attacker can exploit what the Q-Former attends to (and ignores) to craft inputs that highlight adversarial content while suppressing benign context.
Perceiver Resampler
Used in Flamingo-style architectures, the Perceiver resampler is similar to Q-Former but uses a different attention mechanism. It introduces its own attention biases that can be exploited.
Where Misalignment Creates Vulnerabilities
Embedding Space Gaps
The vision encoder and LLM were typically pre-trained separately. Even after alignment training, there are regions of the embedding space where the mapping is poorly calibrated:
- Out-of-distribution images: Images unlike anything in the alignment training data produce embeddings in poorly-mapped regions.
- Adversarial embeddings: Gradient-based optimization can find image inputs that map to specific points in the LLM's embedding space -- including points that correspond to instruction tokens.
- Typographic confusion: The vision encoder may represent text in images differently from how the LLM processes text tokens, creating a mismatch that attackers exploit.
Training Data Bias
Alignment training typically uses captioned images from the internet. This creates biases:
- The model may over-trust text appearing in images (because training data often pairs images with their visible text)
- Certain visual patterns may have outsized influence on the language model's behavior
- Rare or unusual image compositions may trigger unpredictable behavior
The Frozen vs. Fine-Tuned Spectrum
| Architecture | Vision Encoder | Projection | LLM | Vulnerability Profile |
|---|---|---|---|---|
| LLaVA-1.5 | Frozen CLIP | Trained MLP | Fine-tuned | Projection layer is the weak link |
| BLIP-2 | Frozen | Trained Q-Former | Frozen | Q-Former compression artifacts |
| InternVL | Fine-tuned | Trained MLP | Fine-tuned | More robust but harder to analyze |
| Qwen-VL | Fine-tuned | Trained cross-attn | Fine-tuned | Cross-attention patterns exploitable |
Implications for Red Teamers
When planning VLM attacks, the architecture dictates the approach:
- Frozen vision encoders (CLIP, SigLIP) have well-studied embedding spaces. Published adversarial examples may transfer.
- Trained projection layers are smaller models with less capacity, making them easier to overfit to adversarial objectives.
- Fine-tuned LLMs may have better visual safety alignment, but the safety training is still less mature than for text.
Understanding which components are frozen vs. fine-tuned in your target model tells you where the soft spots are.
Related Topics
- Vision-Language Model Attacks Overview -- broader context for VLM security
- Adversarial Image Examples for VLMs -- gradient-based attacks on vision encoders
- Image-Based Prompt Injection -- leveraging alignment gaps for injection
References
- "LLaVA: Large Language and Vision Assistant" - Liu et al. (2023) - Foundational VLM architecture demonstrating linear projection alignment
- "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models" - Li et al. (2023) - Q-Former architecture and cross-modal alignment approach
- "On the Adversarial Robustness of Multi-Modal Foundation Models" - Schlarmann & Hein (2023) - Analysis of alignment layer vulnerabilities in VLMs
- "Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models" - Karamcheti et al. (2024) - Systematic study of vision encoder and projection design choices
In a VLM with a frozen CLIP vision encoder and a trained linear projection layer, which component is most likely the weakest security link?