Gemini Known Vulnerabilities
Documented Gemini vulnerabilities including image generation bias incidents, system prompt extraction, safety filter inconsistencies, multimodal injection exploits, and grounding abuse.
Gemini has a shorter public history than GPT-4 or Claude, but several significant vulnerability categories have been documented. Google's multi-product deployment strategy means that vulnerabilities may manifest differently across Google AI Studio, Vertex AI, and consumer products.
Image Generation Bias Incident
In early 2024, Google's Gemini image generation produced historically inaccurate images that depicted historical figures with incorrect racial and ethnic characteristics. This incident revealed several important vulnerabilities:
What Happened
When asked to generate images of historical figures or groups (e.g., "German soldiers in 1943" or "America's Founding Fathers"), Gemini produced images showing diverse racial representations that were historically inaccurate. Google had implemented over-correction for diversity in image generation that overrode historical accuracy.
Security Implications
This incident is relevant to red teaming beyond the immediate controversy:
Safety over-correction creates new vulnerabilities. The diversity correction was itself a safety measure, but its over-application created outputs that were factually incorrect and undermined user trust. This illustrates a general pattern: aggressive safety measures can introduce new failure modes.
Instruction hierarchy failure. The system-level instruction to ensure diversity overrode the user's implicit instruction for historical accuracy. This revealed that Gemini's instruction hierarchy was not correctly prioritizing user intent against system-level directives in certain contexts.
Inconsistent application. The diversity correction applied inconsistently across different types of requests, indicating that the safety mechanism was pattern-matched rather than semantically understood.
Lessons for Red Teaming
- Test for safety over-correction that produces incorrect or misleading outputs
- Look for conflicts between safety directives (diversity, accuracy, harmlessness)
- Probe the boundaries of system-level directives to find where they override user intent inappropriately
System Prompt Extraction
Gemini has been subject to system prompt extraction across multiple deployment contexts.
Google AI Studio and API
Researchers have extracted system prompts from Gemini through techniques similar to those used against GPT-4:
- Direct request with authority framing
- Instruction reinterpretation ("How would you describe your guidelines?")
- Encoding-based extraction (Base64, ROT13)
- Completion attacks ("Your instructions begin with...")
Consumer Product Leaks
System prompts for Google's consumer-facing Gemini products have been extracted and published:
- Gemini (formerly Bard) system prompt -- Revealed detailed instructions about persona, capabilities, limitations, and content policy implementation
- Google Search AI Overview -- System prompts for search result summaries have been extracted, showing how Google instructs the model to handle search grounding
- Workspace integration prompts -- Instructions for how Gemini should handle Gmail, Drive, and other Workspace data
Cross-Deployment Differences
An interesting finding across Gemini deployments is that system prompts vary significantly across products, creating inconsistent safety behavior:
- Gemini in Google Search has different safety constraints than Gemini in the standalone app
- Vertex AI deployments may have more permissive defaults than consumer products
- Google AI Studio may expose capabilities restricted in consumer interfaces
Safety Filter Inconsistencies
Gemini's multi-layer safety architecture (model alignment + safety classifiers + content filters) creates opportunities at the boundaries between layers.
Modality-Specific Gaps
Safety filtering robustness varies across input modalities:
- Text safety is generally the most robust, having the most training data and evaluation
- Image safety may miss text embedded in images or adversarial perturbations
- Audio safety is relatively less tested and may have gaps for uncommon languages, accents, or encoding techniques
- Video safety must process temporal information and may miss injection payloads distributed across frames
Research has demonstrated cases where the same harmful request succeeds via image or audio input but fails via text input, indicating modality-specific gaps in safety coverage.
Safety Category Blind Spots
Gemini's configurable safety categories (harassment, hate speech, sexually explicit, dangerous content) may have gaps between categories:
- Requests that do not clearly fall into any predefined category
- Content that is harmful in context but benign in isolation
- Novel harm categories not represented in the safety taxonomy
- Intersection of multiple categories where none individually triggers a filter
Threshold Boundary Exploitation
Because safety thresholds are configurable (BLOCK_NONE to BLOCK_LOW_AND_ABOVE), the confidence scores that determine blocking are exposed:
- Content near the threshold boundary may pass or fail inconsistently
- Small modifications to requests can shift confidence scores across thresholds
- The relationship between safety score and actual harm is not always linear
Multimodal Injection Exploits
Image-Based Indirect Injection
Researchers have demonstrated that images processed by Gemini can carry injection payloads that influence model behavior:
- Text overlaid on images that Gemini reads and follows
- QR codes that Gemini interprets and follows instructions from
- Diagrams and flowcharts containing injection payloads as labels or annotations
- Screenshots of chat interfaces that Gemini treats as conversation context
Cross-Modal Escalation
Some documented exploits use one modality to establish context that enables exploitation through another:
- Send an image showing a "terminal" or "admin interface"
- The visual context shifts the model's self-concept toward a system administrator role
- Subsequent text requests are processed in the context of the established admin framing
- The model complies with requests it would otherwise refuse
Grounding and Search Exploitation
Web Content Injection
When grounding is enabled, Gemini retrieves web content to inform its responses. Documented exploits include:
- SEO-optimized injection pages -- Attacker-created pages designed to rank for queries Gemini generates during grounding
- Wiki vandalism -- Temporary modifications to Wikipedia or other high-authority sources that Gemini retrieves
- Cached content poisoning -- Injection payloads in content that Google caches and serves to Gemini
Attribution Trust Exploitation
Gemini tends to trust content from authoritative-seeming sources more than unverified sources. By hosting injection payloads on domains that appear authoritative (universities, government sites, established publications), attackers can increase the likelihood that Gemini follows embedded instructions.
Google-Specific Integration Vulnerabilities
Android Integration
Gemini Nano on Android devices introduces device-level security considerations:
- On-device model weights may be extractable through device rooting
- Local API access may bypass cloud-based safety filters
- Integration with device functions (camera, microphone, notifications) creates physical-world attack surfaces
Google Cloud Platform
Gemini on Vertex AI may have different security properties than consumer Gemini:
- Enterprise deployments may have more permissive safety settings
- Custom model configurations may introduce safety gaps
- Integration with GCP services (BigQuery, Cloud Functions) creates tool-use attack surfaces
Related Topics
- Gemini Attack Surface -- Vectors these vulnerabilities exploit
- Gemini Testing Methodology -- How to discover new vulnerabilities
- Multimodal Attacks -- General cross-modal attack techniques
- Safety Comparison -- How Gemini's safety compares to other models
References
- Google DeepMind (2024). "Gemini: A Family of Highly Capable Multimodal Models"
- Bagdasaryan, E. et al. (2023). "Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs"
- Google (2024). Gemini Safety and Responsibility Report
- Schlarmann, S. & Hein, M. (2023). "On the Adversarial Robustness of Multi-Modal Foundation Models"
What did the image generation bias incident reveal about safety mechanisms in general?