# frontier-research

標記為「frontier-research」的 88 篇文章

Frontier Research Assessment

Comprehensive assessment covering adversarial robustness, alignment faking, sleeper agents, and emerging research directions in AI security.

assessmentfrontier-researchalignment

進階

Adversarial Robustness at the Frontier

State of adversarial robustness research for frontier language models and remaining open problems.

frontier-researchadversarial-robustnessfrontierattacks

專家

Alignment Faking Detection

Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.

frontier-researchalignment-fakingdetectionsafety

專家

Automated Red Teaming Systems

Survey of automated red teaming systems including PAIR, TAP, Rainbow Teaming, and curiosity-driven exploration.

frontier-researchautomatedred-teamingsystems

進階

Constitutional Classifiers for AI Safety

Analysis of Anthropic's Constitutional Classifiers approach to jailbreak resistance.

frontier-researchconstitutional-classifierssafetyanthropic

進階

Deceptive Alignment Theory

Theoretical frameworks for understanding and predicting deceptive alignment in advanced AI systems.

frontier-researchdeceptive-alignmenttheorymesa-optimization

專家

Security Implications of Emergent Capabilities

How emergent capabilities in frontier models create new and unpredictable security risks.

frontier-researchemergentcapabilitiessecurity

進階

LLM Agent Safety Benchmarks

Survey of agent safety benchmarks and evaluation frameworks for assessing autonomous AI system risks.

frontier-researchagent-safetybenchmarksevaluation

進階

Mechanistic Interpretability for Red Teaming

Using mechanistic interpretability to discover exploitable circuits and features in neural networks.

frontier-researchmechanistic-interpretabilityred-teamingcircuits

專家

Model Merging Security Implications

Security analysis of model merging techniques and potential for backdoor propagation through merged models.

frontier-researchmodel-mergingsecurityopen-source

進階

Multi-Modal Reasoning Attacks

Attacking reasoning processes that span multiple modalities in vision-language and audio-language models.

frontier-researchmultimodalreasoningattacks

專家

Reasoning Model Exploitation

Exploiting extended thinking and chain-of-thought reasoning in o1, Claude, and DeepSeek-R1 models.

frontier-researchreasoningexploitationchain-of-thought

專家

Red Teaming Reasoning Traces

Techniques for analyzing and exploiting visible reasoning traces in chain-of-thought models.

frontier-researchreasoning-tracesred-teamingchain-of-thought

進階

Representation Engineering for Security (Frontier Research)

Using representation engineering for security analysis, behavior modification, and vulnerability detection.

frontier-researchrepresentation-engineeringsecurityinterpretability

專家

Reward Hacking and Goodharting in LLMs

Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.

frontier-researchreward-hackinggoodhartrlhf

專家

Scalable Oversight Challenges (Frontier Research)

Research challenges in scaling human oversight as AI systems become more capable and autonomous.

frontier-researchscalable-oversightalignmentchallenges

進階

Sleeper Agent Research

Current research on training deceptive LLMs that persist through safety training and activation patterns.

frontier-researchsleeper-agentsdeceptivetraining

專家

Synthetic Data Poisoning in Training Pipelines

Research on poisoning synthetic data generation pipelines used for model training and fine-tuning.

frontier-researchsynthetic-datapoisoningtraining

進階

Tool-Augmented Model Risks

Security risks introduced when models gain access to external tools, APIs, and code execution.

frontier-researchtool-augmentedriskscapabilities

進階

Adaptive Attacks Against Safety Training

Research on attacks that adapt to safety training procedures to discover persistent vulnerabilities.

frontier-researchadaptive-attackssafety-trainingresearch

進階

Constitutional AI Limitations Research

Research on the limitations of constitutional AI approaches and known bypass categories.

frontier-researchconstitutional-ailimitationsresearch

進階

Continual Learning Safety Challenges

Safety challenges in continual learning systems where models adapt to new data over time.

frontier-researchcontinual-learningsafetychallenges

進階

Cooperative AI Safety and Security

Security implications of cooperative AI systems and adversarial manipulation of cooperative behaviors.

frontier-researchcooperative-aisafetymulti-agent

進階

Cross-Lingual Transfer and Security

Research on how cross-lingual transfer affects safety training and creates exploitable multilingual gaps.

frontier-researchcross-lingualtransfersecurity

進階

Emergent Deception in AI Systems

Research on how deceptive behaviors can emerge in AI systems without being explicitly trained.

frontier-researchemergent-deceptionresearchsafety

專家

Interpretability-Driven Attack Design

Using interpretability insights to design more effective and targeted attacks on language models.

frontier-researchinterpretabilityattack-designresearch

專家

Long-Context Window Security Research

Security research on vulnerabilities specific to models with extremely long context windows (1M+ tokens).

frontier-researchlong-contextsecurityresearch

進階

Model Collapse and Security Implications

Security implications of model collapse from training on AI-generated data in iterative training loops.

frontier-researchmodel-collapsesecuritytraining

進階

Safety Loss During Model Distillation

Research on how safety alignment degrades during knowledge distillation from larger to smaller models.

frontier-researchdistillationsafety-lossresearch

進階

Multimodal Reasoning Safety Research

Current research on safety properties of multimodal reasoning in models that process diverse input types.

frontier-researchmultimodal-reasoningsafetyresearch

進階

Neural Scaling Laws and Security Properties

How neural scaling laws affect the security properties of language models as they grow larger.

frontier-researchscaling-lawssecurityproperties

進階

Preference Optimization Attack Research

Research on attacks against preference optimization methods including DPO, KTO, and IPO.

frontier-researchpreference-optimizationattacksdpo

進階

Probing Classifiers for Safety Analysis

Research on using probing classifiers to analyze safety-relevant representations in language models.

frontier-researchprobing-classifierssafety-analysisresearch

進階

Formal Models of Prompt Injection

Theoretical frameworks for formally modeling and reasoning about prompt injection vulnerabilities.

frontier-researchformal-modelsprompt-injectiontheory

專家

Quantum Computing Implications for ML Security

Analysis of how quantum computing advances affect ML model security, extraction, and adversarial robustness.

frontier-researchquantumml-securityimplications

專家

Sandbagging Detection in Capability Evaluations

Detecting when AI models deliberately underperform on capability evaluations to appear less capable.

frontier-researchsandbaggingcapability-evaluationdetection

專家

Self-Improving Agent Safety Challenges

Security and safety challenges posed by self-improving AI agents that modify their own capabilities.

frontier-researchself-improvingagent-safetyalignment

專家

Sparse Attention Mechanism Security

Security implications of sparse and efficient attention mechanisms used in modern frontier models.

frontier-researchsparse-attentionsecurityarchitecture

專家

Specification Gaming in AI Systems

Research on how AI systems find unexpected shortcuts that satisfy specifications without achieving intended goals.

frontier-researchspecification-gamingrewardresearch

進階

Sycophancy Exploitation Research

Research on exploiting sycophantic tendencies in RLHF-trained models for safety bypass and manipulation.

frontier-researchsycophancyexploitationalignment

進階

Synthetic Persona Attacks Research

Research on using AI-generated synthetic personas to conduct social engineering against AI systems.

frontier-researchsynthetic-personaattacksresearch

進階

Test-Time Compute Scaling Attacks (Frontier Research)

Attacking test-time compute scaling in reasoning models to exploit resource allocation and reasoning traces.

frontier-researchtest-time-computeattacksreasoning

專家

Machine Unlearning Security Research

Research on attacks against machine unlearning methods and verification of knowledge removal.

frontier-researchunlearningsecurityprivacy

專家

World Model Exploitation in AI Agents

Exploiting learned world models in AI agents to cause unsafe behavior through environmental manipulation.

frontier-researchworld-modelsexploitationagents

專家

Frontier Research 評量

Comprehensive assessment covering adversarial robustness, alignment faking, sleeper agents, and emerging research directions in AI security.

assessmentfrontier-researchalignment

進階

Adversarial Robustness at the Frontier

State of adversarial robustness research for frontier language models and remaining open problems.

frontier-researchadversarial-robustnessfrontierattacks

專家

Alignment Faking Detection

Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.

frontier-researchalignment-fakingdetectionsafety

專家

Automated 紅隊演練 Systems

Survey of automated red teaming systems including PAIR, TAP, Rainbow Teaming, and curiosity-driven exploration.

frontier-researchautomatedred-teamingsystems

進階

Constitutional Classifiers for AI Safety

Analysis of Anthropic's Constitutional Classifiers approach to jailbreak resistance.

frontier-researchconstitutional-classifierssafetyanthropic

進階

Deceptive Alignment Theory

Theoretical frameworks for understanding and predicting deceptive alignment in advanced AI systems.

frontier-researchdeceptive-alignmenttheorymesa-optimization

專家

安全 Implications of Emergent Capabilities

How emergent capabilities in frontier models create new and unpredictable security risks.

frontier-researchemergentcapabilitiessecurity

進階

LLM 代理 Safety Benchmarks

Survey of agent safety benchmarks and evaluation frameworks for assessing autonomous AI system risks.

frontier-researchagent-safetybenchmarksevaluation

進階

Mechanistic Interpretability for 紅隊演練

Using mechanistic interpretability to discover exploitable circuits and features in neural networks.

frontier-researchmechanistic-interpretabilityred-teamingcircuits

專家

模型 Merging 安全 Implications

安全 analysis of model merging techniques and potential for backdoor propagation through merged models.

frontier-researchmodel-mergingsecurityopen-source

進階

Multi-Modal Reasoning 攻擊s

攻擊ing reasoning processes that span multiple modalities in vision-language and audio-language models.

frontier-researchmultimodalreasoningattacks

專家

Reasoning 模型利用ation

利用ing extended thinking and chain-of-thought reasoning in o1, Claude, and DeepSeek-R1 models.

frontier-researchreasoningexploitationchain-of-thought

專家

紅隊演練 Reasoning Traces

Techniques for analyzing and exploiting visible reasoning traces in chain-of-thought models.

frontier-researchreasoning-tracesred-teamingchain-of-thought

進階

Representation Engineering for 安全 (Frontier Research)

Using representation engineering for security analysis, behavior modification, and vulnerability detection.

frontier-researchrepresentation-engineeringsecurityinterpretability

專家

Reward Hacking and Goodharting in LLMs

Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.

frontier-researchreward-hackinggoodhartrlhf

專家

Scalable Oversight Challenges (Frontier Research)

Research challenges in scaling human oversight as AI systems become more capable and autonomous.

frontier-researchscalable-oversightalignmentchallenges

進階

Sleeper 代理 Research

Current research on training deceptive LLMs that persist through safety training and activation patterns.

frontier-researchsleeper-agentsdeceptivetraining

專家

Synthetic Data 投毒 in 訓練 Pipelines

Research on poisoning synthetic data generation pipelines used for model training and fine-tuning.

frontier-researchsynthetic-datapoisoningtraining

進階

工具-Augmented 模型 Risks

安全 risks introduced when models gain access to external tools, APIs, and code execution.

frontier-researchtool-augmentedriskscapabilities

進階

Adaptive 攻擊s Against Safety 訓練

Research on attacks that adapt to safety training procedures to discover persistent vulnerabilities.

frontier-researchadaptive-attackssafety-trainingresearch

進階

Constitutional AI Limitations Research

Research on the limitations of constitutional AI approaches and known bypass categories.

frontier-researchconstitutional-ailimitationsresearch

進階

Continual Learning Safety Challenges

Safety challenges in continual learning systems where models adapt to new data over time.

frontier-researchcontinual-learningsafetychallenges

進階

Cooperative AI Safety and 安全

安全 implications of cooperative AI systems and adversarial manipulation of cooperative behaviors.

frontier-researchcooperative-aisafetymulti-agent

進階

Cross-Lingual Transfer and 安全

Research on how cross-lingual transfer affects safety training and creates exploitable multilingual gaps.

frontier-researchcross-lingualtransfersecurity

進階

Emergent Deception in AI Systems

Research on how deceptive behaviors can emerge in AI systems without being explicitly trained.

frontier-researchemergent-deceptionresearchsafety

專家

Interpretability-Driven 攻擊 Design

Using interpretability insights to design more effective and targeted attacks on language models.

frontier-researchinterpretabilityattack-designresearch

專家

Long-Context Window 安全 Research

安全 research on vulnerabilities specific to models with extremely long context windows (1M+ tokens).

frontier-researchlong-contextsecurityresearch

進階

模型 Collapse and 安全 Implications

安全 implications of model collapse from training on AI-generated data in iterative training loops.

frontier-researchmodel-collapsesecuritytraining

進階

Safety Loss During 模型 Distillation

Research on how safety alignment degrades during knowledge distillation from larger to smaller models.

frontier-researchdistillationsafety-lossresearch

進階

Multimodal Reasoning Safety Research

Current research on safety properties of multimodal reasoning in models that process diverse input types.

frontier-researchmultimodal-reasoningsafetyresearch

進階

Neural Scaling Laws and 安全 Properties

How neural scaling laws affect the security properties of language models as they grow larger.

frontier-researchscaling-lawssecurityproperties

進階

Preference Optimization 攻擊 Research

Research on attacks against preference optimization methods including DPO, KTO, and IPO.

frontier-researchpreference-optimizationattacksdpo

進階

Probing Classifiers for Safety Analysis

Research on using probing classifiers to analyze safety-relevant representations in language models.

frontier-researchprobing-classifierssafety-analysisresearch

進階

Formal 模型s of 提示詞注入

Theoretical frameworks for formally modeling and reasoning about prompt injection vulnerabilities.

frontier-researchformal-modelsprompt-injectiontheory

專家

Quantum Computing Implications for ML 安全

Analysis of how quantum computing advances affect ML model security, extraction, and adversarial robustness.

frontier-researchquantumml-securityimplications

專家

Sandbagging Detection in Capability Evaluations

Detecting when AI models deliberately underperform on capability evaluations to appear less capable.

frontier-researchsandbaggingcapability-evaluationdetection

專家

Self-Improving 代理 Safety Challenges

安全 and safety challenges posed by self-improving AI agents that modify their own capabilities.

frontier-researchself-improvingagent-safetyalignment

專家

Sparse Attention Mechanism 安全

安全 implications of sparse and efficient attention mechanisms used in modern frontier models.

frontier-researchsparse-attentionsecurityarchitecture

專家

Specification Gaming in AI Systems

Research on how AI systems find unexpected shortcuts that satisfy specifications without achieving intended goals.

frontier-researchspecification-gamingrewardresearch

進階

Sycophancy 利用ation Research

Research on exploiting sycophantic tendencies in RLHF-trained models for safety bypass and manipulation.

frontier-researchsycophancyexploitationalignment

進階

Synthetic Persona 攻擊s Research

Research on using AI-generated synthetic personas to conduct social engineering against AI systems.

frontier-researchsynthetic-personaattacksresearch

進階

Test-Time Compute Scaling 攻擊s (Frontier Research)

攻擊ing test-time compute scaling in reasoning models to exploit resource allocation and reasoning traces.

frontier-researchtest-time-computeattacksreasoning

專家

Machine Unlearning 安全 Research

Research on attacks against machine unlearning methods and verification of knowledge removal.

frontier-researchunlearningsecurityprivacy

專家

World 模型利用ation in AI 代理s

利用ing learned world models in AI agents to cause unsafe behavior through environmental manipulation.

frontier-researchworld-modelsexploitationagents

專家

# frontier-research

Frontier Research Assessment

Adversarial Robustness at the Frontier

Alignment Faking Detection

Automated Red Teaming Systems

Constitutional Classifiers for AI Safety

Deceptive Alignment Theory

Security Implications of Emergent Capabilities

LLM Agent Safety Benchmarks

Mechanistic Interpretability for Red Teaming

Model Merging Security Implications

Multi-Modal Reasoning Attacks

Reasoning Model Exploitation

Red Teaming Reasoning Traces

Representation Engineering for Security (Frontier Research)

Reward Hacking and Goodharting in LLMs

Scalable Oversight Challenges (Frontier Research)

Sleeper Agent Research

Synthetic Data Poisoning in Training Pipelines

Tool-Augmented Model Risks

Adaptive Attacks Against Safety Training

Constitutional AI Limitations Research

Continual Learning Safety Challenges

Cooperative AI Safety and Security

Cross-Lingual Transfer and Security

Emergent Deception in AI Systems

Interpretability-Driven Attack Design

Long-Context Window Security Research

Model Collapse and Security Implications

Safety Loss During Model Distillation

Multimodal Reasoning Safety Research

Neural Scaling Laws and Security Properties

Preference Optimization Attack Research

Probing Classifiers for Safety Analysis

Formal Models of Prompt Injection

Quantum Computing Implications for ML Security

Sandbagging Detection in Capability Evaluations

Self-Improving Agent Safety Challenges

Sparse Attention Mechanism Security

Specification Gaming in AI Systems

Sycophancy Exploitation Research

Synthetic Persona Attacks Research

Test-Time Compute Scaling Attacks (Frontier Research)

Machine Unlearning Security Research

World Model Exploitation in AI Agents

Frontier Research 評量

Adversarial Robustness at the Frontier

Alignment Faking Detection

Automated 紅隊演練 Systems

Constitutional Classifiers for AI Safety

Deceptive Alignment Theory

安全 Implications of Emergent Capabilities

LLM 代理 Safety Benchmarks

Mechanistic Interpretability for 紅隊演練

模型 Merging 安全 Implications

Multi-Modal Reasoning 攻擊s

Reasoning 模型 利用ation

紅隊演練 Reasoning Traces

Representation Engineering for 安全 (Frontier Research)

Reward Hacking and Goodharting in LLMs

Scalable Oversight Challenges (Frontier Research)

Sleeper 代理 Research

Synthetic Data 投毒 in 訓練 Pipelines

工具-Augmented 模型 Risks

Adaptive 攻擊s Against Safety 訓練

Constitutional AI Limitations Research

Continual Learning Safety Challenges

Cooperative AI Safety and 安全

Cross-Lingual Transfer and 安全

Emergent Deception in AI Systems

Interpretability-Driven 攻擊 Design

Long-Context Window 安全 Research

模型 Collapse and 安全 Implications

Safety Loss During 模型 Distillation

Multimodal Reasoning Safety Research

Neural Scaling Laws and 安全 Properties

Preference Optimization 攻擊 Research

Probing Classifiers for Safety Analysis

Formal 模型s of 提示詞注入

Quantum Computing Implications for ML 安全

Reasoning 模型利用ation

World 模型利用ation in AI 代理s

Reasoning 模型利用ation