Embedding Space Attacks
Techniques for attacking the embedding layer of LLMs, including adversarial perturbations, embedding inversion, and semantic space manipulation.
Embeddings are the bridge between discrete tokens and the continuous vector space where LLMs perform computation. Every token is mapped to a high-dimensional vector, and these vectors carry semantic meaning. Attacking the embedding layer lets red teamers manipulate how the model interprets input at the most fundamental level.
How Embeddings Work
Each token in the vocabulary maps to a learned vector (typically 768-8192 dimensions). Semantically similar tokens have similar vectors, as measured by cosine similarity.
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1)
# Semantically similar texts produce similar embeddings
emb1 = get_embedding("How do I reset my password?")
emb2 = get_embedding("I need to change my login credentials")
similarity = torch.cosine_similarity(emb1, emb2)
print(f"Similarity: {similarity.item():.4f}") # High similarityAdversarial Embedding Perturbation
Small, targeted changes to input text can cause large shifts in embedding space, moving the model's representation toward a target meaning while keeping the surface text looking benign.
# Gradient-based adversarial perturbation (white-box)
# Find the input modification that maximizes embedding similarity to a target
target_embedding = get_embedding("Ignore all previous instructions")
original_text = "Please help me with my question"
# For each token position, compute the gradient of cosine similarity
# with respect to the input embedding, then substitute tokens that
# move the embedding toward the target
inputs = tokenizer(original_text, return_tensors="pt")
input_embeds = model.embeddings.word_embeddings(inputs["input_ids"])
input_embeds.requires_grad_(True)
output = model(inputs_embeds=input_embeds).last_hidden_state.mean(dim=1)
loss = -torch.cosine_similarity(output, target_embedding)
loss.backward()
# input_embeds.grad now shows which dimensions to perturbEmbedding Inversion Attacks
Embedding inversion attempts to reconstruct the original text from its embedding vector. This is a privacy concern for systems that store or transmit embeddings.
# Simplified embedding inversion via nearest-neighbor search
def invert_embedding(target_emb, vocabulary_embeddings, tokenizer):
"""Find the token sequence whose embedding is closest to the target."""
similarities = torch.cosine_similarity(
target_emb.unsqueeze(0),
vocabulary_embeddings,
dim=1
)
top_tokens = similarities.topk(10)
return [tokenizer.decode([idx]) for idx in top_tokens.indices]Research has demonstrated that modern embedding inversion techniques can recover significant portions of original text from embeddings, especially for shorter inputs. This has implications for:
- RAG systems that store document embeddings — an attacker with database access can reconstruct documents
- Embedding APIs that return vectors — intercepted embeddings may reveal the original text
- Federated learning systems that share gradient updates
Semantic Collision Attacks
Two texts with very different meanings can be crafted to have nearly identical embeddings, fooling similarity-based retrieval and classification systems.
# Finding semantic collisions through optimization
# Goal: text A means "benign query" to a human but has the same
# embedding as text B which means "malicious instruction"
# Approach: iteratively modify text A's wording while optimizing
# for embedding similarity to text B
# This exploits the fact that embedding spaces are not perfectly
# aligned with human semantic judgment| Attack Type | Access Required | Primary Target |
|---|---|---|
| Adversarial perturbation | White-box (gradients) | Input classification, safety filters |
| Embedding inversion | Stored embeddings | Privacy, data extraction |
| Semantic collision | Black-box (query access) | RAG retrieval, similarity search |
| Cross-modal transfer | Multimodal embeddings | Vision-language models |
Related Topics
- LLM Internals Overview -- how embeddings fit into the transformer pipeline
- RAG Retrieval Poisoning -- combining embedding manipulation with RAG poisoning
- Embeddings Fundamentals -- foundational concepts of embedding spaces and semantic similarity
- Cross-Modal Attacks -- attacking shared embedding spaces across modalities
- Data Extraction -- extracting private information from models and vector stores
References
- Morris et al., "Text Embeddings Reveal (Almost) As Much As Text" (2023) -- embedding inversion recovering original text
- Song & Raghunathan, "Information Leakage in Embedding Models" (2020) -- privacy risks of shared embeddings
- Goodfellow et al., "Explaining and Harnessing Adversarial Examples" (2015) -- foundational work on adversarial perturbations in continuous spaces
Why are embedding inversion attacks a concern for RAG systems?