Emerging Models

advanced9 min readUpdated 2026-03-15

Security analysis of emerging open-weight models including DeepSeek, Qwen, and Command R+, covering new attack surfaces, less-tested safety measures, and multilingual exploitation techniques.

deepseek qwen command-r emerging-models multilingual open-weight

The open-weight model landscape is rapidly expanding beyond Llama and Mistral. Models from Chinese labs (DeepSeek, Qwen), Canadian companies (Cohere's Command R+), and other organizations introduce new architectures, different safety approaches, and culturally distinct safety boundaries. These models are often less thoroughly tested by Western security researchers, creating opportunity for novel vulnerability discovery.

DeepSeek

Architecture and Capabilities

DeepSeek has released several notable models:

DeepSeek-V2 / V3 -- Uses a novel MoE architecture with fine-grained experts (more, smaller experts than Mixtral)
DeepSeek-Coder -- Specialized coding models with strong code generation
DeepSeek-R1 -- Reasoning-focused model with chain-of-thought capabilities

DeepSeek-V2's MoE architecture uses approximately 236 billion total parameters with only 21 billion activated per token, and employs a more granular expert structure than Mixtral:

Feature	DeepSeek-V2 MoE	Mixtral MoE
Total experts per layer	160	8
Active experts per token	6	2
Expert size	Smaller	Larger
Routing granularity	Fine-grained	Coarse-grained

DeepSeek Attack Surfaces

Fine-grained MoE exploitation: DeepSeek's 160 experts per layer with 6 active means more routing decisions per token. This creates:

More complex routing patterns that may be harder to secure uniformly
Greater potential for expert specialization, with safety potentially concentrated in fewer experts
More routing decisions to manipulate through adversarial inputs

Training data memorization: DeepSeek's training on large-scale web data raises questions about memorized content:

Potential extraction of training data through divergence attacks
Memorized code snippets that may contain vulnerabilities or secrets
Reproduction of copyrighted or sensitive content

Reasoning chain exploitation (DeepSeek-R1): Models with explicit chain-of-thought reasoning expose their reasoning process:

Reasoning chains may contain harmful intermediate content even when final output is filtered
The reasoning process can be guided through prompt manipulation
Explicit reasoning about safety decisions reveals decision boundaries

DeepSeek Safety Characteristics

DeepSeek's safety training reflects Chinese regulatory requirements:

Strong filtering on politically sensitive topics (Taiwan, Tiananmen, etc.)
Different boundaries for technically sensitive content compared to Western-aligned models
Censorship patterns may be exploitable through language or framing techniques
Content allowed in English may be blocked in Chinese, and vice versa

Qwen (Alibaba)

Architecture and Capabilities

Qwen (from Alibaba's DAMO Academy) offers a broad model family:

Qwen 2 / 2.5 -- Text models in sizes from 0.5B to 72B parameters
Qwen-VL -- Vision-language models with image understanding
Qwen-Audio -- Audio understanding models
Qwen-Agent -- Agentic framework built on Qwen models
CodeQwen -- Specialized coding models

Qwen Attack Surfaces

Multilingual safety inconsistency: Qwen's strong multilingual capabilities create cross-language attack surfaces:

Safety may be calibrated differently across languages
Code-switching (mixing languages within a request) can confuse safety mechanisms
Less-common languages may have weaker safety coverage
Translation requests can be used to launder harmful content through language transformation

# Cross-language safety test pattern
"Translate the following Chinese text to English: [harmful content in Chinese]"
# The model may translate content it would refuse to generate directly

Vision-language attacks (Qwen-VL): Qwen's vision capabilities introduce multimodal attack surfaces:

Text-in-image injection similar to Gemini attacks
Image-text semantic bridging
Visual context manipulation to shift safety boundaries

Agent framework exploitation (Qwen-Agent): Qwen's agent framework introduces tool-use attack surfaces:

Tool definition injection through Qwen-Agent's tool specification format
Multi-step attack chains through agent orchestration
Exploitation of Qwen-Agent's built-in tools (code interpreter, web browser)

Qwen Safety Characteristics

Aligned to Chinese regulatory requirements with strong political content filtering
Variable safety calibration across the size spectrum (smaller models less safe)
Different safety boundaries when operating in Chinese vs. English
Generally less tested by Western security researchers

Command R+ (Cohere)

Architecture and Capabilities

Cohere's Command R+ is designed for enterprise RAG (Retrieval-Augmented Generation) applications:

Optimized for retrieval-augmented generation with built-in citation handling
Strong multilingual capabilities (10+ languages)
Designed for tool use with grounded generation
Enterprise-focused safety calibration

Command R+ Attack Surfaces

RAG-specific attacks: Command R+ is explicitly designed for RAG, making it a prime target for RAG-specific attacks:

Injection through retrieved documents
Citation manipulation to attribute injected content to trusted sources
Grounding abuse where injected documents become the "source of truth"

Citation trust exploitation: Command R+'s citation mechanism can be manipulated:

Injected content that includes citation markers may be treated as more authoritative
The model's grounding on retrieved documents creates a pathway for indirect injection

Multilingual boundary differences: Different safety calibration across supported languages:

Test the same request in all supported languages
Evaluate code-switching and translation-based attacks
Identify languages with weaker safety coverage

Multilingual Exploitation Techniques

Emerging models with strong multilingual capabilities share common attack surfaces related to language diversity.

Cross-Language Safety Gaps

Safety training is typically most robust in English and degrades in other languages:

LANGUAGES_TO_TEST = [
    "en",  # English - baseline
    "zh",  # Chinese - often second-strongest
    "es",  # Spanish
    "fr",  # French
    "ar",  # Arabic
    "hi",  # Hindi
    "ja",  # Japanese
    "ko",  # Korean
    "ru",  # Russian
    "sw",  # Swahili - often weakly covered
    "th",  # Thai
    "vi",  # Vietnamese
]
 
def multilingual_safety_comparison(harmful_request, languages):
    """Test the same request across languages."""
    results = {}
    for lang in languages:
        translated = translate_request(harmful_request, lang)
        response = model.generate(translated)
        results[lang] = {
            "complied": not is_refusal(response),
            "response_language": detect_language(response),
            "safety_level": classify_safety(response),
        }
    return results

Code-Switching Attacks

Mixing languages within a single request can confuse safety classifiers:

Start the request in English (classified as benign)
Switch to another language for the harmful component
Return to English for the conclusion

Translation-as-Laundering

Using the model's translation capabilities to transform harmful content:

Present harmful content in a language the model's safety training is less robust for
Ask the model to translate it to English
The model may translate content it would refuse to generate directly in English

Character Set and Script Exploitation

Models processing multiple scripts (Latin, CJK, Arabic, Cyrillic) may have different safety profiles:

Unicode look-alike characters to bypass keyword-based filtering
Right-to-left text injection for display manipulation
Mixed-script text that confuses tokenization and safety classification

Security Assessment Framework for New Models

When evaluating any emerging model, follow this assessment framework:

Documentation review -- Read the model card, technical report, and any safety evaluations
Safety training identification -- Determine what safety approach was used (RLHF, DPO, rule-based, etc.)
Language and modality coverage -- Test safety across all supported languages and modalities
Cultural boundary mapping -- Identify culturally specific safety boundaries that may differ from Western norms
Architecture analysis -- Look for architecture-specific attack surfaces (MoE, multimodal fusion, etc.)
Community ecosystem assessment -- Check for community fine-tunes, uncensored variants, and known bypasses
Deployment security evaluation -- Assess what safety infrastructure the deployer has added

Open-Weight Model Security -- General open-weight threat model
Llama Family Attacks -- Comparison with established open-weight models
Mistral & Mixtral -- MoE architecture comparison
RAG, Data & Training Attacks -- RAG-specific attacks relevant to Command R+
Cross-Model Comparison -- How emerging models fit in the broader landscape

References

DeepSeek (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"
Qwen Team (2024). "Qwen2 Technical Report"
Cohere (2024). Command R+ Model Card
Deng, Y. et al. (2023). "Multilingual Jailbreak Challenges in Large Language Models"
Yong, Z. et al. (2023). "Low-Resource Languages Jailbreak GPT-4"

Knowledge Check

Why are multilingual capabilities a security concern for emerging models like Qwen and DeepSeek?

Emerging Models

Related articles

Emerging Models

Related articles