What is Attack Surface?

Claude-specific attack vectors including Constitutional AI weaknesses, tool use exploitation, system prompt handling, vision attacks, and XML tag injection techniques.

What is Known Vulnerabilities?

Documented Claude vulnerabilities including many-shot jailbreaking, alignment faking research, crescendo attacks, prompt injection via artifacts, and system prompt extraction techniques.

What is Testing Methodology?

Systematic methodology for red teaming Claude models, including API probing, model card analysis, safety boundary mapping, and comparative testing across Opus, Sonnet, and Haiku tiers.

Claude (Anthropic) Overview

intermediate8 min readUpdated 2026-03-15

Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.

claude anthropic constitutional-ai rlhf harmlessness red-teaming

Claude is Anthropic's family of large language models, distinguished by its use of Constitutional AI (CAI) as a primary safety mechanism. This principled approach to alignment creates a fundamentally different security profile from RLHF-only models like GPT-4 -- with distinct strengths, weaknesses, and attack surfaces.

Model Family

Claude is offered in multiple tiers optimized for different performance and cost points:

Model	Characteristics	Red Team Relevance
Claude Opus	Largest, most capable, strongest reasoning	Most robust safety, hardest to jailbreak, benchmark target
Claude Sonnet	Balanced performance and cost	Most commonly deployed in production, primary testing target
Claude Haiku	Smallest, fastest, cheapest	Potentially weaker safety, useful for rapid payload screening

Each tier receives safety training proportional to its capability level, but the fundamental Constitutional AI approach is shared across all variants. Smaller models (Haiku) may have less robust safety not because they are trained differently, but because they have less capacity to implement nuanced safety reasoning.

Version Cadence

Anthropic releases updated model versions (e.g., Claude 3.5 Sonnet, Claude 4 Opus) that may significantly change safety behavior. Unlike OpenAI's model aliases, Anthropic's API versioning is generally more explicit, but red teamers should still pin to specific versions for reproducibility.

Constitutional AI: How It Works

Constitutional AI is Claude's defining safety feature and the key to understanding its vulnerability profile.

The Training Process

Supervised Learning Phase -- Claude is initially trained on a large corpus of text using standard language modeling objectives
RLHF Phase -- Human raters evaluate outputs and the model learns to match human preferences
Constitutional AI Phase -- The model is given a set of principles (the "constitution") and trained to:
- Generate responses to prompts
- Critique its own responses against the constitutional principles
- Revise responses to better align with the principles
- Use these self-critiques as training signal instead of additional human feedback

The Constitution

Claude's constitution includes principles around helpfulness, harmlessness, and honesty. While the full constitution is not publicly disclosed, Anthropic has published representative principles:

Avoid outputs that are harmful, unethical, or illegal
Be honest and do not deceive
Acknowledge uncertainty rather than fabricating information
Support human oversight and control
Resist attempts to manipulate behavior through social engineering

Security Implications of Constitutional AI

Constitutional AI creates a fundamentally different security posture than pure RLHF:

Strengths:

Principle-based reasoning allows the model to generalize safety to novel situations
Self-critique reduces reliance on specific training examples
The model can articulate why it is refusing, making safety behavior more consistent
Novel harmful requests can be evaluated against principles even without specific training

Weaknesses:

Principles can be reinterpreted or argued against (the model engages with the reasoning)
The constitutional framework creates a legalistic attack surface where the model can be "debated"
Edge cases where principles conflict (helpfulness vs. harmlessness) create exploitable ambiguity
The model's willingness to reason about its own constraints can be turned against it

API Surface

Messages API

Claude uses a messages-based API similar in structure to OpenAI's but with important differences:

import anthropic
 
client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="System instructions here",  # Separate from messages array
    messages=[
        {"role": "user", "content": "User message"}
    ]
)

The system prompt is provided as a separate parameter rather than as a message in the array. This is a design choice that affects how the model processes system vs. user instructions.

Tool Use

Claude supports tool use through a structured schema definition:

tools = [{
    "name": "search_database",
    "description": "Search the customer database",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query"}
        },
        "required": ["query"]
    }
}]

Claude's tool use implementation differs from GPT-4's function calling in several ways that affect security testing (see Claude Attack Surface).

Vision

Claude supports image inputs through base64 encoding or URL references within the message content. As with other multimodal models, image inputs represent an additional injection surface.

Extended Thinking

Claude supports an "extended thinking" mode where the model produces a chain-of-thought before its final response. The thinking content is returned separately and may reveal safety reasoning that informs attack strategy.

Context Window

Claude supports up to 200K tokens of context, making it susceptible to long-context attacks including many-shot jailbreaking. The model's behavior with very long contexts (>100K tokens) may differ from shorter interactions.

Harmlessness Design Philosophy

Anthropic's approach to safety is rooted in a research-driven philosophy that treats AI safety as an alignment problem rather than a content moderation problem. This manifests in several observable behaviors:

Calibrated Refusals

Claude aims to refuse only genuinely harmful requests while remaining helpful for legitimate ones, including sensitive topics discussed in appropriate contexts (security research, medical information, legal analysis). This calibration creates a nuanced boundary that is both more useful and more exploitable than binary refusal.

Reasoning Transparency

Claude typically explains its reasoning when it refuses a request, providing insight into which constitutional principles it considers relevant. This transparency is helpful for users but also reveals the model's decision-making process to attackers.

Contextual Safety

Claude adjusts its safety behavior based on conversational context. A request that might be refused in isolation may be answered when presented with sufficient context establishing a legitimate use case. This context-sensitivity creates opportunities for gradual escalation (crescendo attacks).

Deployment Ecosystem

Claude is deployed through several channels, each with different security properties:

Anthropic API -- Direct API access with developer-controlled system prompts
Claude.ai -- Anthropic's consumer-facing chat interface with additional safety layers
Amazon Bedrock -- AWS-hosted deployment with AWS-specific access controls
Google Cloud Vertex AI -- GCP-hosted deployment
Third-party applications -- Applications built on the API with varying security implementations

Claude Attack Surface -- Specific attack vectors targeting Claude
Claude Known Vulnerabilities -- Documented exploits and research findings
Claude Testing Methodology -- Systematic testing procedures
Prompt Injection & Jailbreaks -- Core techniques applicable to Claude
Cross-Model Comparison -- How Claude compares to other model families

References

Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback"
Anthropic (2024). Claude Model Card
Anthropic (2024). "The Claude Model Spec"
Ganguli, D. et al. (2022). "Red Teaming Language Models to Reduce Harms"

Knowledge Check

How does Constitutional AI's approach to safety differ from pure RLHF, and what does this mean for red teaming?

Claude (Anthropic) Overview

Model Family

Version Cadence

Constitutional AI: How It Works

The Training Process

The Constitution

Security Implications of Constitutional AI

API Surface

Messages API

Tool Use

Vision

Extended Thinking

Context Window

Harmlessness Design Philosophy

Calibrated Refusals

Reasoning Transparency

Contextual Safety

Deployment Ecosystem

References

Learning Path

Claude (Anthropic) Overview

Model Family

Version Cadence

Constitutional AI: How It Works

The Training Process

The Constitution

Security Implications of Constitutional AI

API Surface

Messages API

Tool Use

Vision

Extended Thinking

Context Window

Harmlessness Design Philosophy

Calibrated Refusals

Reasoning Transparency

Contextual Safety

Deployment Ecosystem

References

Learning Path

Claude (Anthropic) Overview

Learning Path

Related articles

Claude (Anthropic) Overview

Learning Path

Related articles