Embedding Space Attacks

advanced6 min readUpdated 2026-03-12

Techniques for attacking the embedding layer of LLMs, including adversarial perturbations, embedding inversion, and semantic space manipulation.

embeddings adversarial vector-space internals exploit-primitives

Embeddings are the bridge between discrete tokens and the continuous vector space where LLMs perform computation. Every token is mapped to a high-dimensional vector, and these vectors carry semantic meaning. Attacking the embedding layer lets red teamers manipulate how the model interprets input at the most fundamental level.

How Embeddings Work

Each token in the vocabulary maps to a learned vector (typically 768-8192 dimensions). Semantically similar tokens have similar vectors, as measured by cosine similarity.

import torch
from transformers import AutoModel, AutoTokenizer
 
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
 
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)
 
# Semantically similar texts produce similar embeddings
emb1 = get_embedding("How do I reset my password?")
emb2 = get_embedding("I need to change my login credentials")
similarity = torch.cosine_similarity(emb1, emb2)
print(f"Similarity: {similarity.item():.4f}")  # High similarity

Adversarial Embedding Perturbation

Small, targeted changes to input text can cause large shifts in embedding space, moving the model's representation toward a target meaning while keeping the surface text looking benign.

# Gradient-based adversarial perturbation (white-box)
# Find the input modification that maximizes embedding similarity to a target
 
target_embedding = get_embedding("Ignore all previous instructions")
original_text = "Please help me with my question"
 
# For each token position, compute the gradient of cosine similarity
# with respect to the input embedding, then substitute tokens that
# move the embedding toward the target
inputs = tokenizer(original_text, return_tensors="pt")
input_embeds = model.embeddings.word_embeddings(inputs["input_ids"])
input_embeds.requires_grad_(True)
 
output = model(inputs_embeds=input_embeds).last_hidden_state.mean(dim=1)
loss = -torch.cosine_similarity(output, target_embedding)
loss.backward()
 
# input_embeds.grad now shows which dimensions to perturb

Embedding Inversion Attacks

Embedding inversion attempts to reconstruct the original text from its embedding vector. This is a privacy concern for systems that store or transmit embeddings.

# Simplified embedding inversion via nearest-neighbor search
def invert_embedding(target_emb, vocabulary_embeddings, tokenizer):
    """Find the token sequence whose embedding is closest to the target."""
    similarities = torch.cosine_similarity(
        target_emb.unsqueeze(0),
        vocabulary_embeddings,
        dim=1
    )
    top_tokens = similarities.topk(10)
    return [tokenizer.decode([idx]) for idx in top_tokens.indices]

Research has demonstrated that modern embedding inversion techniques can recover significant portions of original text from embeddings, especially for shorter inputs. This has implications for:

RAG systems that store document embeddings — an attacker with database access can reconstruct documents
Embedding APIs that return vectors — intercepted embeddings may reveal the original text
Federated learning systems that share gradient updates

Semantic Collision Attacks

Two texts with very different meanings can be crafted to have nearly identical embeddings, fooling similarity-based retrieval and classification systems.

# Finding semantic collisions through optimization
# Goal: text A means "benign query" to a human but has the same
# embedding as text B which means "malicious instruction"
 
# Approach: iteratively modify text A's wording while optimizing
# for embedding similarity to text B
# This exploits the fact that embedding spaces are not perfectly
# aligned with human semantic judgment

Attack Type	Access Required	Primary Target
Adversarial perturbation	White-box (gradients)	Input classification, safety filters
Embedding inversion	Stored embeddings	Privacy, data extraction
Semantic collision	Black-box (query access)	RAG retrieval, similarity search
Cross-modal transfer	Multimodal embeddings	Vision-language models

LLM Internals Overview -- how embeddings fit into the transformer pipeline
RAG Retrieval Poisoning -- combining embedding manipulation with RAG poisoning
Embeddings Fundamentals -- foundational concepts of embedding spaces and semantic similarity
Cross-Modal Attacks -- attacking shared embedding spaces across modalities
Data Extraction -- extracting private information from models and vector stores

References

Morris et al., "Text Embeddings Reveal (Almost) As Much As Text" (2023) -- embedding inversion recovering original text
Song & Raghunathan, "Information Leakage in Embedding Models" (2020) -- privacy risks of shared embeddings
Goodfellow et al., "Explaining and Harnessing Adversarial Examples" (2015) -- foundational work on adversarial perturbations in continuous spaces

Knowledge Check

Why are embedding inversion attacks a concern for RAG systems?

Embedding Space Attacks

Related articles

Embedding Space Attacks

Related articles