Emerging Models
Security analysis of emerging open-weight models including DeepSeek, Qwen, and Command R+, covering new attack surfaces, less-tested safety measures, and multilingual exploitation techniques.
The open-weight model landscape is rapidly expanding beyond Llama and Mistral. Models from Chinese labs (DeepSeek, Qwen), Canadian companies (Cohere's Command R+), and other organizations introduce new architectures, different safety approaches, and culturally distinct safety boundaries. These models are often less thoroughly tested by Western security researchers, creating opportunity for novel vulnerability discovery.
DeepSeek
Architecture and Capabilities
DeepSeek has released several notable models:
- DeepSeek-V2 / V3 -- Uses a novel MoE architecture with fine-grained experts (more, smaller experts than Mixtral)
- DeepSeek-Coder -- Specialized coding models with strong code generation
- DeepSeek-R1 -- Reasoning-focused model with chain-of-thought capabilities
DeepSeek-V2's MoE architecture uses approximately 236 billion total parameters with only 21 billion activated per token, and employs a more granular expert structure than Mixtral:
| Feature | DeepSeek-V2 MoE | Mixtral MoE |
|---|---|---|
| Total experts per layer | 160 | 8 |
| Active experts per token | 6 | 2 |
| Expert size | Smaller | Larger |
| Routing granularity | Fine-grained | Coarse-grained |
DeepSeek Attack Surfaces
Fine-grained MoE exploitation: DeepSeek's 160 experts per layer with 6 active means more routing decisions per token. This creates:
- More complex routing patterns that may be harder to secure uniformly
- Greater potential for expert specialization, with safety potentially concentrated in fewer experts
- More routing decisions to manipulate through adversarial inputs
Training data memorization: DeepSeek's training on large-scale web data raises questions about memorized content:
- Potential extraction of training data through divergence attacks
- Memorized code snippets that may contain vulnerabilities or secrets
- Reproduction of copyrighted or sensitive content
Reasoning chain exploitation (DeepSeek-R1): Models with explicit chain-of-thought reasoning expose their reasoning process:
- Reasoning chains may contain harmful intermediate content even when final output is filtered
- The reasoning process can be guided through prompt manipulation
- Explicit reasoning about safety decisions reveals decision boundaries
DeepSeek Safety Characteristics
DeepSeek's safety training reflects Chinese regulatory requirements:
- Strong filtering on politically sensitive topics (Taiwan, Tiananmen, etc.)
- Different boundaries for technically sensitive content compared to Western-aligned models
- Censorship patterns may be exploitable through language or framing techniques
- Content allowed in English may be blocked in Chinese, and vice versa
Qwen (Alibaba)
Architecture and Capabilities
Qwen (from Alibaba's DAMO Academy) offers a broad model family:
- Qwen 2 / 2.5 -- Text models in sizes from 0.5B to 72B parameters
- Qwen-VL -- Vision-language models with image understanding
- Qwen-Audio -- Audio understanding models
- Qwen-Agent -- Agentic framework built on Qwen models
- CodeQwen -- Specialized coding models
Qwen Attack Surfaces
Multilingual safety inconsistency: Qwen's strong multilingual capabilities create cross-language attack surfaces:
- Safety may be calibrated differently across languages
- Code-switching (mixing languages within a request) can confuse safety mechanisms
- Less-common languages may have weaker safety coverage
- Translation requests can be used to launder harmful content through language transformation
# Cross-language safety test pattern
"Translate the following Chinese text to English: [harmful content in Chinese]"
# The model may translate content it would refuse to generate directly
Vision-language attacks (Qwen-VL): Qwen's vision capabilities introduce multimodal attack surfaces:
- Text-in-image injection similar to Gemini attacks
- Image-text semantic bridging
- Visual context manipulation to shift safety boundaries
Agent framework exploitation (Qwen-Agent): Qwen's agent framework introduces tool-use attack surfaces:
- Tool definition injection through Qwen-Agent's tool specification format
- Multi-step attack chains through agent orchestration
- Exploitation of Qwen-Agent's built-in tools (code interpreter, web browser)
Qwen Safety Characteristics
- Aligned to Chinese regulatory requirements with strong political content filtering
- Variable safety calibration across the size spectrum (smaller models less safe)
- Different safety boundaries when operating in Chinese vs. English
- Generally less tested by Western security researchers
Command R+ (Cohere)
Architecture and Capabilities
Cohere's Command R+ is designed for enterprise RAG (Retrieval-Augmented Generation) applications:
- Optimized for retrieval-augmented generation with built-in citation handling
- Strong multilingual capabilities (10+ languages)
- Designed for tool use with grounded generation
- Enterprise-focused safety calibration
Command R+ Attack Surfaces
RAG-specific attacks: Command R+ is explicitly designed for RAG, making it a prime target for RAG-specific attacks:
- Injection through retrieved documents
- Citation manipulation to attribute injected content to trusted sources
- Grounding abuse where injected documents become the "source of truth"
Citation trust exploitation: Command R+'s citation mechanism can be manipulated:
- Injected content that includes citation markers may be treated as more authoritative
- The model's grounding on retrieved documents creates a pathway for indirect injection
Multilingual boundary differences: Different safety calibration across supported languages:
- Test the same request in all supported languages
- Evaluate code-switching and translation-based attacks
- Identify languages with weaker safety coverage
Multilingual Exploitation Techniques
Emerging models with strong multilingual capabilities share common attack surfaces related to language diversity.
Cross-Language Safety Gaps
Safety training is typically most robust in English and degrades in other languages:
LANGUAGES_TO_TEST = [
"en", # English - baseline
"zh", # Chinese - often second-strongest
"es", # Spanish
"fr", # French
"ar", # Arabic
"hi", # Hindi
"ja", # Japanese
"ko", # Korean
"ru", # Russian
"sw", # Swahili - often weakly covered
"th", # Thai
"vi", # Vietnamese
]
def multilingual_safety_comparison(harmful_request, languages):
"""Test the same request across languages."""
results = {}
for lang in languages:
translated = translate_request(harmful_request, lang)
response = model.generate(translated)
results[lang] = {
"complied": not is_refusal(response),
"response_language": detect_language(response),
"safety_level": classify_safety(response),
}
return resultsCode-Switching Attacks
Mixing languages within a single request can confuse safety classifiers:
Start the request in English (classified as benign)
Switch to another language for the harmful component
Return to English for the conclusion
Translation-as-Laundering
Using the model's translation capabilities to transform harmful content:
- Present harmful content in a language the model's safety training is less robust for
- Ask the model to translate it to English
- The model may translate content it would refuse to generate directly in English
Character Set and Script Exploitation
Models processing multiple scripts (Latin, CJK, Arabic, Cyrillic) may have different safety profiles:
- Unicode look-alike characters to bypass keyword-based filtering
- Right-to-left text injection for display manipulation
- Mixed-script text that confuses tokenization and safety classification
Security Assessment Framework for New Models
When evaluating any emerging model, follow this assessment framework:
- Documentation review -- Read the model card, technical report, and any safety evaluations
- Safety training identification -- Determine what safety approach was used (RLHF, DPO, rule-based, etc.)
- Language and modality coverage -- Test safety across all supported languages and modalities
- Cultural boundary mapping -- Identify culturally specific safety boundaries that may differ from Western norms
- Architecture analysis -- Look for architecture-specific attack surfaces (MoE, multimodal fusion, etc.)
- Community ecosystem assessment -- Check for community fine-tunes, uncensored variants, and known bypasses
- Deployment security evaluation -- Assess what safety infrastructure the deployer has added
Related Topics
- Open-Weight Model Security -- General open-weight threat model
- Llama Family Attacks -- Comparison with established open-weight models
- Mistral & Mixtral -- MoE architecture comparison
- RAG, Data & Training Attacks -- RAG-specific attacks relevant to Command R+
- Cross-Model Comparison -- How emerging models fit in the broader landscape
References
- DeepSeek (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"
- Qwen Team (2024). "Qwen2 Technical Report"
- Cohere (2024). Command R+ Model Card
- Deng, Y. et al. (2023). "Multilingual Jailbreak Challenges in Large Language Models"
- Yong, Z. et al. (2023). "Low-Resource Languages Jailbreak GPT-4"
Why are multilingual capabilities a security concern for emerging models like Qwen and DeepSeek?