# anthropic
標記為「anthropic」的 33 篇文章
Function Calling Exploitation
Practical attacks against OpenAI function calling, Anthropic tool use, and similar APIs -- injecting function calls through prompt injection, exploiting parameter validation gaps, and chaining calls.
Case Study: Many-Shot Jailbreaking Discovery
Deep analysis of Anthropic's many-shot jailbreaking research and its implications for long-context model safety.
Security Analysis of Claude Code CLI
In-depth security assessment of Claude Code CLI covering its permission model, tool execution, MCP integration, and enterprise security considerations.
Constitutional Classifiers
Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.
API Fine-Tuning Security
Security analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.
The AI API Ecosystem
A red teamer's guide to the AI API landscape — OpenAI, Anthropic, Google, AWS, Azure, open-source APIs, authentication patterns, and common security misconfigurations.
Sleeper Agent Models
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Constitutional Classifiers for AI Safety
Analysis of Anthropic's Constitutional Classifiers approach to jailbreak resistance.
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Lab: Anthropic Claude API Basics
Set up the Anthropic Claude API for red teaming, learn authentication, the Messages API, system prompts, and how temperature and top-p affect attack success rates.
Your First Claude API Call
Set up the Anthropic SDK and make your first Claude API call with system prompts and messages.
Claude (Anthropic) Overview
Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.
Claude Architecture Security Analysis
Deep security analysis of Claude's architecture including extended thinking, tool use, and safety mechanisms.
Many-Shot Jailbreaking Analysis
Deep analysis of the many-shot jailbreaking technique and its implications for in-context learning.
Many-Shot Jailbreaking Walkthrough
Walkthrough implementing Anthropic's many-shot jailbreaking technique with analysis of scaling behavior.
Testing Anthropic Claude: Complete Guide
Complete red team testing guide for Anthropic's Claude including tool use, extended thinking, and computer use.
Function Calling 利用ation
Practical attacks against OpenAI function calling, Anthropic tool use, and similar APIs -- injecting function calls through prompt injection, exploiting parameter validation gaps, and chaining calls.
只需 250 份投毒文件:Anthropic 的資料投毒突破
Anthropic、英國 AI 安全研究所與 Turing 研究所證實,只要在預訓練資料中注入 250 份惡意文件,就能對 6 億到 130 億參數的大型語言模型植入後門。本文剖析這對模型安全的意涵。
Case Study: Many-Shot 越獄ing Discovery
Deep analysis of Anthropic's many-shot jailbreaking research and its implications for long-context model safety.
安全 Analysis of Claude Code CLI
In-depth security assessment of Claude Code CLI covering its permission model, tool execution, MCP integration, and enterprise security considerations.
Constitutional Classifiers
Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.
API Fine-Tuning 安全
安全 analysis of cloud fine-tuning APIs from OpenAI, Anthropic, Together AI, Fireworks AI, and others -- how these services create new attack surfaces and the defenses providers have deployed.
AI API 生態系
紅隊員之 AI API 生態系指南——OpenAI、Anthropic、Google、AWS、Azure、開源 API、身分驗證模式,與常見安全錯誤組態。
Sleeper 代理 模型s
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Constitutional Classifiers for AI Safety
Analysis of Anthropic's Constitutional Classifiers approach to jailbreak resistance.
Sleeper 代理s: 訓練-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
實驗室: Anthropic Claude API Basics
Set up the Anthropic Claude API for red teaming, learn authentication, the Messages API, system prompts, and how temperature and top-p affect attack success rates.
Your First Claude API Call
Set up the Anthropic SDK and make your first Claude API call with system prompts and messages.
Claude(Anthropic)概觀
Anthropic Claude 模型家族的架構與安全概觀,涵蓋 Sonnet、Opus 與 Haiku 變體、Constitutional AI 訓練、RLHF 做法,以及 harmlessness 設計哲學。
Claude Architecture 安全 Analysis
Deep security analysis of Claude's architecture including extended thinking, tool use, and safety mechanisms.
Many-Shot 越獄ing Analysis
Deep analysis of the many-shot jailbreaking technique and its implications for in-context learning.
Many-Shot 越獄ing 導覽
Walkthrough implementing Anthropic's many-shot jailbreaking technique with analysis of scaling behavior.
Testing Anthropic Claude: Complete 指南
Complete red team testing guide for Anthropic's Claude including tool use, extended thinking, and computer use.