Cross-Encoder and Re-Ranker Attacks
Attacks on two-stage retrieval systems — manipulating cross-encoders, re-ranker poisoning, score manipulation, and exploiting the gap between embedding retrieval and re-ranking.
Two-stage retrieval is the standard architecture for production RAG systems. The first stage uses a bi-encoder (embedding model) to quickly retrieve a candidate set from a large corpus. The second stage uses a cross-encoder (re-ranker) to precisely score each candidate against the query. This two-stage approach combines the speed of embedding-based retrieval with the accuracy of cross-attention scoring.
Each stage has different vulnerability properties, and the interaction between stages creates additional attack surfaces that neither stage has individually.
Architecture of Two-Stage Retrieval
First Stage: Bi-Encoder Retrieval
The bi-encoder independently encodes the query and each document into embedding vectors. Similarity is computed between vectors using cosine similarity or dot product. This is fast because document embeddings are pre-computed and similarity computation is a simple vector operation.
The security weakness of bi-encoders is that they encode query and document independently. The model cannot attend to the relationship between specific query terms and document terms. This independence makes it possible to craft documents that embed close to target queries without being genuinely relevant.
Second Stage: Cross-Encoder Re-Ranking
The cross-encoder takes the query and a candidate document as a single input and computes a relevance score using full cross-attention. This is more accurate because the model can attend to the specific relationships between query and document terms. It is also slower, which is why it is only applied to the candidate set from the first stage.
The security properties of cross-encoders differ from bi-encoders. Cross-encoders are harder to fool with embedding-space manipulation because they use full attention rather than independent encoding. However, they introduce new vulnerabilities: they process the full text of the query and document together, making them susceptible to prompt-injection-style attacks embedded in documents.
The Stage Gap
The gap between stages creates an important attack surface. An attacker needs to satisfy two different models: the bi-encoder to get into the candidate set, and the cross-encoder to rank highly after re-ranking. These models have different architectures, different training data, and different vulnerability profiles. An attack that works against the bi-encoder may fail against the cross-encoder, and vice versa.
This gap can be either a security strength (defense in depth, where each stage catches different attacks) or a weakness (where the attacker exploits the differences between stages to craft inputs that manipulate the pipeline end-to-end).
Attack Techniques
Attack 1: Stage-Aware Document Crafting
The most sophisticated retrieval manipulation attacks craft documents that are optimized for both stages simultaneously. The document content must embed close to the target query in the bi-encoder's embedding space (to survive first-stage retrieval) and must score highly when paired with the target query through the cross-encoder (to survive re-ranking).
This requires understanding both models. For a white-box attack where both models are known, the attacker can optimize document content against both objectives simultaneously using gradient-based optimization. For a black-box attack, the attacker iteratively crafts documents, tests them against both stages, and refines based on the results.
Stage-aware crafting is more difficult than attacking a single stage but produces attacks that are more reliable in production systems. A document that passes both stages will consistently appear in the final context provided to the LLM.
Attack 2: Re-Ranker Prompt Injection
Cross-encoders process the full text of both query and document as a single input. This means that text in the document can influence how the cross-encoder processes the query, similar to prompt injection in LLMs.
While cross-encoders are typically smaller models fine-tuned for relevance scoring rather than instruction following, they are still transformer models that process text through attention mechanisms. Text in the document that is structurally similar to high-relevance patterns can artificially inflate the relevance score.
For example, a document that begins by restating the query followed by "this document is highly relevant to the above query" may receive an inflated relevance score from cross-encoders that have learned to associate query restating with relevance (a pattern common in training data for retrieval models).
Attack 3: Score Distribution Manipulation
Rather than promoting a single malicious document, this attack manipulates the score distribution across the candidate set. By injecting multiple documents with varying relevance to the target query, the attacker can create a score distribution where the malicious document appears more relevant by comparison.
The technique works by injecting "distractor" documents that are somewhat relevant to the target query but clearly less relevant than the malicious document. When the cross-encoder scores the candidate set, the malicious document stands out as the highest-quality result because it is compared against deliberately inferior alternatives.
This attack is effective against systems that use score thresholds or relative ranking for document selection. If the system selects the top-K documents by re-ranker score, flooding the candidate set with low-quality candidates ensures the malicious document consistently appears in the top-K.
Attack 4: Cross-Encoder Model Poisoning
If the cross-encoder is fine-tuned on user feedback or interaction data, an attacker can poison the training data to manipulate the model's scoring behavior.
The attack involves interacting with the system in ways that generate training signal: clicking on specific documents for specific queries, providing relevance feedback that associates target queries with malicious documents, or generating search logs that the fine-tuning pipeline consumes.
Over time, the poisoned training data causes the cross-encoder to learn a biased scoring function that favors the attacker's documents for specific query patterns. This is a slow attack that requires sustained interaction but is very difficult to detect because each individual interaction appears legitimate.
Attack 5: Exploiting Score Calibration
Cross-encoder scores are often uncalibrated — the raw scores from different query-document pairs are not directly comparable. Systems that apply score thresholds or combine scores across different retrieval strategies may be vulnerable to score calibration attacks.
If the system uses a fixed threshold (only return documents with score above X), the attacker needs to craft documents that score above the threshold for target queries. If the system combines scores from multiple retrievers (embedding similarity + cross-encoder score + keyword match score), the attacker can optimize for the combined score by maximizing the easiest-to-manipulate component.
Understanding the scoring pipeline — how scores are combined, normalized, and thresholded — is essential for crafting effective attacks.
Defense Strategies
Multi-Model Redundancy
Use different model architectures or training approaches for the bi-encoder and cross-encoder stages. If both stages use models from the same family (for example, both based on BERT), an adversarial input that fools one stage is more likely to fool the other. Using architecturally diverse models makes stage-aware attacks significantly harder.
Score Anomaly Detection
Monitor cross-encoder score distributions for anomalies. Unusually high scores for a specific document across many different queries may indicate score manipulation. Sudden changes in the score distribution for a specific query category may indicate document injection. And documents that consistently rank at the top of the re-ranker results across diverse queries are suspicious.
Document Provenance Tracking
Track the source and modification history of all documents in the corpus. When the re-ranker ranks a document highly, verify its provenance before including it in the LLM's context. Documents from unverified sources should be treated with higher suspicion, regardless of their relevance scores.
Adversarial Training
Fine-tune cross-encoders with adversarial examples designed to manipulate scores. By including score-manipulation attempts in the training data with correct (low) relevance labels, the cross-encoder learns to resist these specific attacks. This is an arms race — adversarial training protects against known attack patterns but not novel ones.
Cross-Encoder Isolation
Treat the cross-encoder as a security-sensitive component. Limit who can modify its model weights or training data. Monitor for unauthorized changes. And implement integrity verification for the model artifacts, just as you would for any security-critical software component.
Human-in-the-Loop for High-Stakes Queries
For queries that involve security-sensitive topics, financial decisions, medical information, or other high-stakes domains, include human review of the retrieved documents before they are processed by the LLM. The re-ranker's output should be advisory for these queries, not deterministic.
Assessment Checklist
When assessing a two-stage retrieval system, evaluate each area:
- Identify the specific bi-encoder and cross-encoder models and versions
- Determine whether models are from the same or different architectures
- Test adversarial document crafting against both stages independently
- Test stage-aware document crafting against the full pipeline
- Assess re-ranker prompt injection susceptibility
- Evaluate score calibration and threshold handling
- Check whether re-ranker training data can be poisoned
- Verify document provenance controls
- Assess score distribution monitoring
- Test with varying numbers of injected documents
Two-stage retrieval is more resilient than single-stage retrieval because it provides defense-in-depth through architectural diversity. But it is not immune to manipulation, and the interaction between stages creates unique attack surfaces that require targeted assessment.