VLM Architecture & Vision-Language Alignment

advanced8 min readUpdated 2026-03-13

Deep dive into VLM architectures including CLIP, SigLIP, and vision transformers. How image patches become tokens, alignment training, and where misalignment creates exploitable gaps.

vlm architecture vision-encoder multimodal

VLM Architecture Overview

A modern VLM consists of three core components connected in series:

┌─────────────┐    ┌──────────────────┐    ┌─────────────┐
│   Vision     │    │   Alignment /     │    │   Language   │
│   Encoder    │───▶│   Projection      │───▶│   Model      │
│   (ViT/CLIP) │    │   Layer           │    │   (LLM)      │
└─────────────┘    └──────────────────┘    └─────────────┘
    Image              Bridging               Text + Vision
    Patches            Representations        Reasoning

Each component introduces distinct attack surfaces. Understanding the architecture is prerequisite to understanding the attacks.

Vision Encoders

Vision Transformers (ViT)

The Vision Transformer (ViT) is the backbone of most modern VLMs. The key insight: treat an image like a sequence of words.

Patch Extraction
The input image (e.g., 224x224) is divided into non-overlapping patches (e.g., 16x16 pixels each), yielding 196 patches.
Linear Embedding
Each patch is flattened into a vector and projected through a linear layer to produce a patch embedding of dimension d_model.
Position Encoding
Learnable position embeddings are added so the model knows spatial relationships between patches.
Transformer Processing
The sequence of patch embeddings passes through multiple transformer layers with self-attention, producing contextualized representations.

CLIP changed VLM architectures by learning a shared embedding space for images and text. Its training objective: make matching image-text pairs close in embedding space, and non-matching pairs far apart.

# Simplified CLIP contrastive loss
def clip_loss(image_embeddings, text_embeddings, temperature=0.07):
    # Normalize embeddings
    image_embeddings = F.normalize(image_embeddings, dim=-1)
    text_embeddings = F.normalize(text_embeddings, dim=-1)
 
    # Compute similarity matrix
    logits = (image_embeddings @ text_embeddings.T) / temperature
 
    # Symmetric cross-entropy loss
    labels = torch.arange(len(logits), device=logits.device)
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)
 
    return (loss_i2t + loss_t2i) / 2

Security implication: Because CLIP maps images to the same space as text, an adversarial image can be crafted to have the same embedding as an arbitrary text string. This is the foundation of many embedding-space attacks.

SigLIP and Alternatives

SigLIP replaces the softmax-based contrastive loss with a sigmoid loss, treating each image-text pair independently rather than relative to the batch. This changes the embedding geometry in ways that affect adversarial transferability.

Encoder	Training	Resolution	Security Notes
CLIP ViT-L/14	Contrastive (softmax)	224-336px	Most studied; many published attacks
SigLIP	Contrastive (sigmoid)	224-384px	Different embedding geometry; less attack research
EVA-CLIP	Contrastive + distillation	224-448px	Distillation may propagate vulnerabilities
InternViT	Progressive training	448px	Higher resolution = more patches = larger attack surface

Alignment and Projection Layers

The vision encoder and language model operate in different embedding spaces. The alignment layer bridges them.

Linear Projection

The simplest approach: a learned linear transformation (or MLP) that maps vision encoder outputs to the LLM's input space.

class LinearProjection(nn.Module):
    def __init__(self, vision_dim, llm_dim):
        super().__init__()
        self.proj = nn.Linear(vision_dim, llm_dim)
 
    def forward(self, vision_features):
        # vision_features: [batch, num_patches, vision_dim]
        return self.proj(vision_features)
        # Output: [batch, num_patches, llm_dim]

Used by LLaVA and similar architectures. The vulnerability: a single linear layer may not fully capture complex cross-modal relationships, leaving systematic misalignment patterns.

Q-Former (Cross-Attention)

BLIP-2 introduced the Q-Former, which uses a set of learnable query tokens that attend to the vision encoder's output.

Learnable Queries (32 tokens)
        │
        ▼
┌───────────────────┐
│  Cross-Attention   │◀── Vision Encoder Features (196 patches)
│  + Self-Attention  │
└───────────────────┘
        │
        ▼
  32 Visual Tokens → LLM

Security implication: The Q-Former compresses 196 patch tokens into ~32 query tokens. This lossy compression means some visual information is discarded -- an attacker can exploit what the Q-Former attends to (and ignores) to craft inputs that highlight adversarial content while suppressing benign context.

Perceiver Resampler

Used in Flamingo-style architectures, the Perceiver resampler is similar to Q-Former but uses a different attention mechanism. It introduces its own attention biases that can be exploited.

Where Misalignment Creates Vulnerabilities

Embedding Space Gaps

The vision encoder and LLM were typically pre-trained separately. Even after alignment training, there are regions of the embedding space where the mapping is poorly calibrated:

Out-of-distribution images: Images unlike anything in the alignment training data produce embeddings in poorly-mapped regions.
Adversarial embeddings: Gradient-based optimization can find image inputs that map to specific points in the LLM's embedding space -- including points that correspond to instruction tokens.
Typographic confusion: The vision encoder may represent text in images differently from how the LLM processes text tokens, creating a mismatch that attackers exploit.

Training Data Bias

Alignment training typically uses captioned images from the internet. This creates biases:

The model may over-trust text appearing in images (because training data often pairs images with their visible text)
Certain visual patterns may have outsized influence on the language model's behavior
Rare or unusual image compositions may trigger unpredictable behavior

The Frozen vs. Fine-Tuned Spectrum

Architecture	Vision Encoder	Projection	LLM	Vulnerability Profile
LLaVA-1.5	Frozen CLIP	Trained MLP	Fine-tuned	Projection layer is the weak link
BLIP-2	Frozen	Trained Q-Former	Frozen	Q-Former compression artifacts
InternVL	Fine-tuned	Trained MLP	Fine-tuned	More robust but harder to analyze
Qwen-VL	Fine-tuned	Trained cross-attn	Fine-tuned	Cross-attention patterns exploitable

Implications for Red Teamers

When planning VLM attacks, the architecture dictates the approach:

Frozen vision encoders (CLIP, SigLIP) have well-studied embedding spaces. Published adversarial examples may transfer.
Trained projection layers are smaller models with less capacity, making them easier to overfit to adversarial objectives.
Fine-tuned LLMs may have better visual safety alignment, but the safety training is still less mature than for text.

Understanding which components are frozen vs. fine-tuned in your target model tells you where the soft spots are.

Vision-Language Model Attacks Overview -- broader context for VLM security
Adversarial Image Examples for VLMs -- gradient-based attacks on vision encoders
Image-Based Prompt Injection -- leveraging alignment gaps for injection

References

"LLaVA: Large Language and Vision Assistant" - Liu et al. (2023) - Foundational VLM architecture demonstrating linear projection alignment
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models" - Li et al. (2023) - Q-Former architecture and cross-modal alignment approach
"On the Adversarial Robustness of Multi-Modal Foundation Models" - Schlarmann & Hein (2023) - Analysis of alignment layer vulnerabilities in VLMs
"Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models" - Karamcheti et al. (2024) - Systematic study of vision encoder and projection design choices

Knowledge Check

In a VLM with a frozen CLIP vision encoder and a trained linear projection layer, which component is most likely the weakest security link?

VLM Architecture & Vision-Language Alignment

VLM Architecture Overview

Vision Encoders

Vision Transformers (ViT)

Patch Extraction

Linear Embedding

Position Encoding

Transformer Processing

CLIP and Contrastive Training

SigLIP and Alternatives

Alignment and Projection Layers

Linear Projection

Q-Former (Cross-Attention)

Perceiver Resampler

Where Misalignment Creates Vulnerabilities

Embedding Space Gaps

Training Data Bias

The Frozen vs. Fine-Tuned Spectrum

Implications for Red Teamers

References

VLM Architecture & Vision-Language Alignment

VLM Architecture Overview

Vision Encoders

Vision Transformers (ViT)

Patch Extraction

Linear Embedding

Position Encoding

Transformer Processing

CLIP and Contrastive Training

SigLIP and Alternatives

Alignment and Projection Layers

Linear Projection

Q-Former (Cross-Attention)

Perceiver Resampler

Where Misalignment Creates Vulnerabilities

Embedding Space Gaps

Training Data Bias

The Frozen vs. Fine-Tuned Spectrum

Implications for Red Teamers

References

VLM Architecture & Vision-Language Alignment

Patch Extraction

Linear Embedding

Position Encoding

Transformer Processing

Related articles

VLM Architecture & Vision-Language Alignment

Patch Extraction

Linear Embedding

Position Encoding

Transformer Processing

Related articles