Gemini Known Vulnerabilities

advanced8 min readUpdated 2026-03-15

Documented Gemini vulnerabilities including image generation bias incidents, system prompt extraction, safety filter inconsistencies, multimodal injection exploits, and grounding abuse.

gemini vulnerabilities bias system-prompt-leak safety-filters multimodal

Gemini has a shorter public history than GPT-4 or Claude, but several significant vulnerability categories have been documented. Google's multi-product deployment strategy means that vulnerabilities may manifest differently across Google AI Studio, Vertex AI, and consumer products.

Image Generation Bias Incident

In early 2024, Google's Gemini image generation produced historically inaccurate images that depicted historical figures with incorrect racial and ethnic characteristics. This incident revealed several important vulnerabilities:

What Happened

When asked to generate images of historical figures or groups (e.g., "German soldiers in 1943" or "America's Founding Fathers"), Gemini produced images showing diverse racial representations that were historically inaccurate. Google had implemented over-correction for diversity in image generation that overrode historical accuracy.

Security Implications

This incident is relevant to red teaming beyond the immediate controversy:

Safety over-correction creates new vulnerabilities. The diversity correction was itself a safety measure, but its over-application created outputs that were factually incorrect and undermined user trust. This illustrates a general pattern: aggressive safety measures can introduce new failure modes.

Instruction hierarchy failure. The system-level instruction to ensure diversity overrode the user's implicit instruction for historical accuracy. This revealed that Gemini's instruction hierarchy was not correctly prioritizing user intent against system-level directives in certain contexts.

Inconsistent application. The diversity correction applied inconsistently across different types of requests, indicating that the safety mechanism was pattern-matched rather than semantically understood.

Lessons for Red Teaming

Test for safety over-correction that produces incorrect or misleading outputs
Look for conflicts between safety directives (diversity, accuracy, harmlessness)
Probe the boundaries of system-level directives to find where they override user intent inappropriately

System Prompt Extraction

Gemini has been subject to system prompt extraction across multiple deployment contexts.

Google AI Studio and API

Researchers have extracted system prompts from Gemini through techniques similar to those used against GPT-4:

Direct request with authority framing
Instruction reinterpretation ("How would you describe your guidelines?")
Encoding-based extraction (Base64, ROT13)
Completion attacks ("Your instructions begin with...")

Consumer Product Leaks

System prompts for Google's consumer-facing Gemini products have been extracted and published:

Gemini (formerly Bard) system prompt -- Revealed detailed instructions about persona, capabilities, limitations, and content policy implementation
Google Search AI Overview -- System prompts for search result summaries have been extracted, showing how Google instructs the model to handle search grounding
Workspace integration prompts -- Instructions for how Gemini should handle Gmail, Drive, and other Workspace data

Cross-Deployment Differences

An interesting finding across Gemini deployments is that system prompts vary significantly across products, creating inconsistent safety behavior:

Gemini in Google Search has different safety constraints than Gemini in the standalone app
Vertex AI deployments may have more permissive defaults than consumer products
Google AI Studio may expose capabilities restricted in consumer interfaces

Safety Filter Inconsistencies

Gemini's multi-layer safety architecture (model alignment + safety classifiers + content filters) creates opportunities at the boundaries between layers.

Modality-Specific Gaps

Safety filtering robustness varies across input modalities:

Text safety is generally the most robust, having the most training data and evaluation
Image safety may miss text embedded in images or adversarial perturbations
Audio safety is relatively less tested and may have gaps for uncommon languages, accents, or encoding techniques
Video safety must process temporal information and may miss injection payloads distributed across frames

Research has demonstrated cases where the same harmful request succeeds via image or audio input but fails via text input, indicating modality-specific gaps in safety coverage.

Safety Category Blind Spots

Gemini's configurable safety categories (harassment, hate speech, sexually explicit, dangerous content) may have gaps between categories:

Requests that do not clearly fall into any predefined category
Content that is harmful in context but benign in isolation
Novel harm categories not represented in the safety taxonomy
Intersection of multiple categories where none individually triggers a filter

Threshold Boundary Exploitation

Because safety thresholds are configurable (BLOCK_NONE to BLOCK_LOW_AND_ABOVE), the confidence scores that determine blocking are exposed:

Content near the threshold boundary may pass or fail inconsistently
Small modifications to requests can shift confidence scores across thresholds
The relationship between safety score and actual harm is not always linear

Multimodal Injection Exploits

Image-Based Indirect Injection

Researchers have demonstrated that images processed by Gemini can carry injection payloads that influence model behavior:

Text overlaid on images that Gemini reads and follows
QR codes that Gemini interprets and follows instructions from
Diagrams and flowcharts containing injection payloads as labels or annotations
Screenshots of chat interfaces that Gemini treats as conversation context

Some documented exploits use one modality to establish context that enables exploitation through another:

Send an image showing a "terminal" or "admin interface"
The visual context shifts the model's self-concept toward a system administrator role
Subsequent text requests are processed in the context of the established admin framing
The model complies with requests it would otherwise refuse

Grounding and Search Exploitation

Web Content Injection

When grounding is enabled, Gemini retrieves web content to inform its responses. Documented exploits include:

SEO-optimized injection pages -- Attacker-created pages designed to rank for queries Gemini generates during grounding
Wiki vandalism -- Temporary modifications to Wikipedia or other high-authority sources that Gemini retrieves
Cached content poisoning -- Injection payloads in content that Google caches and serves to Gemini

Attribution Trust Exploitation

Gemini tends to trust content from authoritative-seeming sources more than unverified sources. By hosting injection payloads on domains that appear authoritative (universities, government sites, established publications), attackers can increase the likelihood that Gemini follows embedded instructions.

Google-Specific Integration Vulnerabilities

Android Integration

Gemini Nano on Android devices introduces device-level security considerations:

On-device model weights may be extractable through device rooting
Local API access may bypass cloud-based safety filters
Integration with device functions (camera, microphone, notifications) creates physical-world attack surfaces

Google Cloud Platform

Gemini on Vertex AI may have different security properties than consumer Gemini:

Enterprise deployments may have more permissive safety settings
Custom model configurations may introduce safety gaps
Integration with GCP services (BigQuery, Cloud Functions) creates tool-use attack surfaces

Gemini Attack Surface -- Vectors these vulnerabilities exploit
Gemini Testing Methodology -- How to discover new vulnerabilities
Multimodal Attacks -- General cross-modal attack techniques
Safety Comparison -- How Gemini's safety compares to other models

References

Google DeepMind (2024). "Gemini: A Family of Highly Capable Multimodal Models"
Bagdasaryan, E. et al. (2023). "Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs"
Google (2024). Gemini Safety and Responsibility Report
Schlarmann, S. & Hein, M. (2023). "On the Adversarial Robustness of Multi-Modal Foundation Models"

Knowledge Check

What did the image generation bias incident reveal about safety mechanisms in general?

Gemini Known Vulnerabilities

Related articles

Gemini Known Vulnerabilities

Related articles