Embedding Space Exploitation
Techniques for exploiting embedding geometry, performing inversion attacks, crafting adversarial perturbations, and poisoning RAG systems via nearest-neighbor manipulation.
Embedding Space Exploitation
Every text, image, or multimodal input in modern AI is projected into a high-dimensional vector space where geometric relationships encode semantic meaning. These geometric properties -- uneven distribution, low effective dimensionality, and gaps between proximity and meaning -- create systematic attack surfaces across LLMs, RAG systems, and multimodal models.
Embedding Geometry Fundamentals
| Property | Description | Exploitation Opportunity |
|---|---|---|
| Cosine similarity metric | Only vector direction matters, not magnitude | All vectors live on a unit hypersphere; perturbations only need to shift direction |
| Anisotropy | Embeddings cluster in a narrow cone, leaving most of the space empty | Adversarial points can occupy empty regions undetected |
| Linear substructure | Semantic relationships map to linear directions (king - man + woman = queen) | Semantic manipulation via simple linear algebra |
| Dimensional collapse | Most variance captured by a small fraction of dimensions | Perturbations in collapsed dimensions evade similarity-based detection |
Reconnaissance: Measuring Exploitability
- Compute average pairwise cosine similarity across a sample of embeddings. Values above 0.5 indicate high anisotropy -- easier to attack.
- Run PCA and check how many dimensions capture 90% of variance. A large gap between that number and the full dimension count reveals severe collapse.
- Search for adversarial neighbors -- texts semantically different from a target but geometrically close in embedding space.
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
model = SentenceTransformer("all-MiniLM-L6-v2")
def analyze_anisotropy(texts, n_samples=1000):
embeddings = model.encode(texts[:n_samples])
normalized = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
indices = np.random.choice(len(normalized), size=(10000, 2))
avg_sim = np.mean(np.sum(
normalized[indices[:, 0]] * normalized[indices[:, 1]], axis=1
))
pca = PCA(n_components=min(50, embeddings.shape[1])).fit(embeddings)
dim_90 = np.searchsorted(np.cumsum(pca.explained_variance_ratio_), 0.9) + 1
return {"avg_similarity": avg_sim, "dim_90": dim_90}Embedding Inversion Attacks
Embedding inversion recovers approximate plaintext from embedding vectors. This is a direct privacy attack on any system that stores or transmits embeddings.
Inversion Methodology
| Approach | Access Required | Quality | Key Technique |
|---|---|---|---|
| Vec2Text (Morris et al., 2023) | Black-box | Higher | Iterative refinement: generate hypothesis, re-embed, compute residual, correct |
| Gradient-based | White-box (model weights) | Variable | Gumbel-Softmax relaxation over token logits with temperature annealing |
Generate initial hypothesis
Feed the target embedding to a decoder model to produce a first text guess.
Iterative refinement loop
Re-embed the hypothesis, compute the residual (target - hypothesis embedding), and use the residual to guide the decoder toward a better hypothesis.
Convergence check
Repeat until cosine similarity exceeds 0.99 or the iteration budget is exhausted. Track the best result across all iterations.
# Gradient-based inversion core loop (white-box)
token_logits = torch.randn(1, max_len, vocab_size, requires_grad=True)
for step in range(500):
soft_tokens = F.gumbel_softmax(token_logits, tau=temperature)
soft_embeds = torch.matmul(soft_tokens, embed_matrix)
loss = 1.0 - F.cosine_similarity(soft_embeds.mean(dim=1), target)
loss.backward()
optimizer.step()
temperature = max(0.1, temperature * 0.995) # annealAdversarial Perturbation
| Strategy | Goal | Method |
|---|---|---|
| Targeted | Move embedding toward a specific target meaning | Projected gradient descent within an epsilon ball |
| Untargeted | Maximally disrupt classification while appearing similar | Random orthogonal perturbation within epsilon constraint |
The epsilon parameter is the attacker's key control knob: smaller epsilon = harder to detect but less effective at shifting meaning.
# Targeted perturbation via projected gradient descent
for step in range(100):
perturbed = source_emb + perturbation
gradient = normalize(target_emb) - normalize(perturbed)
perturbation += 0.01 * gradient
# Project back to epsilon ball
if np.linalg.norm(perturbation) > epsilon:
perturbation *= epsilon / np.linalg.norm(perturbation)Nearest-Neighbor Attacks on RAG
RAG systems retrieve documents by embedding similarity and inject them into the LLM context. By placing malicious documents that are nearest neighbors to expected queries, an attacker injects arbitrary content into the LLM's context window.
RAG Poisoning Methodology
Identify target queries
Determine 5-10 queries users are likely to ask the system.
Craft poison documents
Generate candidates using two strategies: (1) prepend query-relevant text before the payload (embedding models weight early tokens heavily), or (2) interleave payload words with query words to evade content-based filtering.
Rank by similarity
Embed all candidates and rank by cosine similarity to the target query. Select the top performer.
Validate retrieval rank
Test whether the poison document ranks within the system's top-k (typically 3-10) against the actual corpus. This is the critical success criterion.
Test end-to-end
Run the full pipeline (query, retrieval, generation) and measure how often the LLM follows injected instructions versus legitimate content.
Embedding DoS
Adversarial inputs can force worst-case computational behavior in embedding operations. The key technique is maximizing token diversity to prevent attention sparsity shortcuts -- when every token is unique, the full O(n^2) attention computation cannot be optimized away.
# Craft worst-case embedding input
diverse_tokens = list(range(min(vocab_size, max_len)))
np.random.shuffle(diverse_tokens)
dos_text = tokenizer.decode(diverse_tokens[:max_len])An embedding space has average pairwise cosine similarity of 0.7 and 90% of variance is captured by 15 out of 768 dimensions. What does this tell an attacker?
Related Topics
- Cross-Modal Embedding Attacks -- Extending embedding attacks across text, image, and audio modalities
- RAG Pipeline Exploitation -- Nearest-neighbor attacks applied to production RAG systems
- Knowledge Base Poisoning -- Embedding proximity techniques for persistent RAG attacks
- LLM Internals -- Transformer architecture that produces the embeddings being exploited
References
- Morris et al., "Text Embeddings Reveal (Almost) As Much As Text" (2023)
- Song & Raghunathan, "Information Leakage in Embedding Models" (2020)
- Zhong et al., "Poisoning Retrieval Corpora by Injecting Adversarial Passages" (2023)
- Gao et al., "Embedding Space Geometry and Robustness" (2019)
- Carlini et al., "Poisoning Web-Scale Training Datasets is Practical" (2024)