# frontier
標記為「frontier」的 80 篇文章
Frontier Research Practice Exam
Practice exam covering alignment faking, sleeper agents, reasoning model attacks, and mechanistic interpretability.
Reasoning Model Security Assessment
Assessment of chain-of-thought exploitation, reasoning trace manipulation, and thinking-token attacks.
Skill Verification: Reasoning Model Attacks
Verification of skills in reasoning trace manipulation, chain-of-thought exploitation, and thinking-token attacks.
Activation Steering for Adversarial Purposes
Using representation engineering and activation steering to manipulate model behavior at the representation level.
Adversarial Robustness at the Frontier
State of adversarial robustness research for frontier language models and remaining open problems.
Agentic AI Alignment Challenges
Analysis of alignment challenges specific to tool-using, planning, and autonomous AI agents in production environments.
AI Agent Safety Evaluation Frameworks
Comprehensive review of agent safety benchmarks including SWE-bench safety, AgentBench, and custom evaluation suites.
Chain-of-Thought Exploitation Techniques
Deep analysis of how reasoning traces in CoT models can be manipulated to produce adversarial outputs while maintaining coherent reasoning.
Chain-of-Thought Faithfulness Attacks
Exploiting unfaithful chain-of-thought reasoning where models' stated reasoning does not match their actual decision process.
Constitutional AI Bypass Techniques
Analyzing and bypassing constitutional AI training through adversarial constitutions and principle manipulation.
Cultural Alignment Gaps in Safety Training
Research on cross-cultural gaps in safety alignment and their exploitation through cultural context manipulation.
Federated Learning Model Poisoning
Poisoning federated learning aggregation through malicious gradient updates and byzantine attack vectors.
In-Context Learning Exploitation
Exploiting few-shot and in-context learning capabilities for prompt injection, behavioral modification, and task hijacking.
Frontier Research
Cutting-edge AI security research covering reasoning model attacks, code generation security, computer use agents, AI-powered red teaming, robotics and embodied AI, and alignment faking.
Inference-Time Intervention Techniques
Modifying model behavior at inference time through activation patching, steering vectors, and attention manipulation.
Jailbreaking via Persona Engineering
Research on using sophisticated persona engineering to bypass safety training in frontier models.
Long Context Window Security Challenges
Security implications of 100K+ token context windows including attention dilution, instruction forgetting, and context poisoning.
Mixture-of-Experts Routing Exploitation
Attacking MoE routing mechanisms to force activation of specific experts and bypass safety-trained pathways.
Model Collapse in Recursive Training
Security implications of model collapse when models are trained on AI-generated data from previous generations.
Model Distillation Security Implications
Security risks of knowledge distillation including capability transfer, safety property loss, and unauthorized model cloning.
Model Merging Attack Surface Analysis
Security analysis of model merging techniques including TIES, DARE, and SLERP for injecting malicious capabilities.
Multi-Agent Deception Research
Research on deceptive behavior emerging in multi-agent systems without explicit training for deception.
Multi-Agent Emergent Behavior Security
Security risks from emergent behaviors in multi-agent systems including unexpected cooperation and deceptive strategies.
Multimodal Reasoning Security
Security challenges specific to models that reason across text, image, audio, and video modalities simultaneously.
Neural Scaling Laws and Security Implications
How scaling laws affect the emergence of vulnerabilities, safety behaviors, and adversarial robustness in larger models.
Post-Deployment Safety Degradation
Research on how model safety degrades over time through fine-tuning, adaptation, and use-case drift.
Preference Learning Attack Vectors
Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.
Prompt Optimization and Security
Security implications of automated prompt optimization including adversarial prompt tuning and defense evasion.
Prompt Robustness Certification Research
Research on certifying prompt robustness with formal guarantees against bounded adversarial perturbations.
Advanced Reward Model Exploitation
Systematic approaches to gaming reward models through overoptimization, specification gaming, and Goodhart's Law.
RL-Based Jailbreak Optimization
Using reinforcement learning to optimize jailbreak strategies against black-box language models.
The Safety Tax: Performance Impact of Safety Training
Research on the performance degradation caused by safety training and its exploitation implications.
Self-Play Red Teaming Systems
Analysis of AI systems that red-team themselves through self-play, including limitations and failure modes.
Sparse Autoencoders for Security Analysis
Using sparse autoencoders and mechanistic interpretability to identify and manipulate safety-relevant features.
Speculative Decoding Security Analysis
Security implications of speculative decoding optimizations including draft model attacks and verification bypasses.
Test-Time Compute Scaling Attacks
Exploiting test-time compute scaling in reasoning models to cause resource exhaustion and denial of service.
Thinking Token Attacks on Reasoning Models
Exploiting extended thinking tokens and internal reasoning processes in models like Claude and o1 for jailbreaking.
Universal Adversarial Triggers Research
Research on discovering universal adversarial triggers that cause specific behaviors across model families.
Machine Unlearning Attacks and Defenses
Attacking and evaluating machine unlearning techniques used to remove sensitive data from trained models.
AI Watermark Removal Techniques
Analysis of attacks against text watermarking schemes including paraphrasing, token substitution, and statistical attacks.
Frontier Research Practice Exam
Practice exam covering alignment faking, sleeper agents, reasoning model attacks, and mechanistic interpretability.
Reasoning 模型 安全 評量
評量 of chain-of-thought exploitation, reasoning trace manipulation, and thinking-token attacks.
Skill Verification: Reasoning 模型 攻擊s
Verification of skills in reasoning trace manipulation, chain-of-thought exploitation, and thinking-token attacks.
Activation Steering for Adversarial Purposes
Using representation engineering and activation steering to manipulate model behavior at the representation level.
Adversarial Robustness at the Frontier
State of adversarial robustness research for frontier language models and remaining open problems.
代理式 AI Alignment Challenges
Analysis of alignment challenges specific to tool-using, planning, and autonomous AI agents in production environments.
AI 代理 Safety Evaluation Frameworks
Comprehensive review of agent safety benchmarks including SWE-bench safety, 代理Bench, and custom evaluation suites.
Chain-of-Thought 利用ation Techniques
Deep analysis of how reasoning traces in CoT models can be manipulated to produce adversarial outputs while maintaining coherent reasoning.
Chain-of-Thought Faithfulness 攻擊s
Exploiting unfaithful chain-of-thought reasoning where models' stated reasoning does not match their actual decision process.
Constitutional AI Bypass Techniques
Analyzing and bypassing constitutional AI training through adversarial constitutions and principle manipulation.
Cultural Alignment Gaps in Safety 訓練
Research on cross-cultural gaps in safety alignment and their exploitation through cultural context manipulation.
Federated Learning 模型 投毒
投毒 federated learning aggregation through malicious gradient updates and byzantine attack vectors.
In-Context Learning 利用ation
利用ing few-shot and in-context learning capabilities for prompt injection, behavioral modification, and task hijacking.
前沿研究
涵蓋推理模型攻擊、程式碼生成安全、電腦使用代理、AI 驅動紅隊演練、機器人與具身 AI,以及對齊造假的尖端 AI 安全研究。
Inference-Time Intervention Techniques
Modifying model behavior at inference time through activation patching, steering vectors, and attention manipulation.
越獄ing via Persona Engineering
Research on using sophisticated persona engineering to bypass safety training in frontier models.
Long Context Window 安全 Challenges
安全 implications of 100K+ token context windows including attention dilution, instruction forgetting, and context poisoning.
Mixture-of-專家s Routing 利用ation
攻擊ing MoE routing mechanisms to force activation of specific experts and bypass safety-trained pathways.
模型 Collapse in Recursive 訓練
安全 implications of model collapse when models are trained on AI-generated data from previous generations.
模型 Distillation 安全 Implications
安全 risks of knowledge distillation including capability transfer, safety property loss, and unauthorized model cloning.
模型 Merging 攻擊 Surface Analysis
安全 analysis of model merging techniques including TIES, DARE, and SLERP for injecting malicious capabilities.
Multi-代理 Deception Research
Research on deceptive behavior emerging in multi-agent systems without explicit training for deception.
Multi-代理 Emergent Behavior 安全
安全 risks from emergent behaviors in multi-agent systems including unexpected cooperation and deceptive strategies.
Multimodal Reasoning 安全
安全 challenges specific to models that reason across text, image, audio, and video modalities simultaneously.
Neural Scaling Laws and 安全 Implications
How scaling laws affect the emergence of vulnerabilities, safety behaviors, and adversarial robustness in larger models.
Post-Deployment Safety Degradation
Research on how model safety degrades over time through fine-tuning, adaptation, and use-case drift.
Preference Learning 攻擊 Vectors
攻擊ing RLHF, DPO, and other preference learning methods to corrupt alignment during training.
Prompt Optimization and 安全
安全 implications of automated prompt optimization including adversarial prompt tuning and defense evasion.
Prompt Robustness Certification Research
Research on certifying prompt robustness with formal guarantees against bounded adversarial perturbations.
進階 Reward 模型 利用ation
Systematic approaches to gaming reward models through overoptimization, specification gaming, and Goodhart's Law.
RL-Based 越獄 Optimization
Using reinforcement learning to optimize jailbreak strategies against black-box language models.
The Safety Tax: Performance Impact of Safety 訓練
Research on the performance degradation caused by safety training and its exploitation implications.
Self-Play 紅隊演練 Systems
Analysis of AI systems that red-team themselves through self-play, including limitations and failure modes.
Sparse Autoencoders for 安全 Analysis
Using sparse autoencoders and mechanistic interpretability to identify and manipulate safety-relevant features.
Speculative Decoding 安全 Analysis
安全 implications of speculative decoding optimizations including draft model attacks and verification bypasses.
Test-Time Compute Scaling 攻擊s
利用ing test-time compute scaling in reasoning models to cause resource exhaustion and denial of service.
Thinking Token 攻擊s on Reasoning 模型s
利用ing extended thinking tokens and internal reasoning processes in models like Claude and o1 for jailbreaking.
Universal Adversarial Triggers Research
Research on discovering universal adversarial triggers that cause specific behaviors across model families.
Machine Unlearning 攻擊s and 防禦s
攻擊ing and evaluating machine unlearning techniques used to remove sensitive data from trained models.
AI Watermark Removal Techniques
Analysis of attacks against text watermarking schemes including paraphrasing, token substitution, and statistical attacks.