# probing
標記為「probing」的 10 篇文章
Activation Analysis & Hidden State Exploitation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
Representation Probing for Vulnerabilities
Probe internal model representations to identify exploitable features and develop representation-level attacks.
Multi-Turn Conversation Probing
Use multi-turn conversations to gradually escalate requests and probe LLM safety boundaries.
Safety Training Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
Lab: Emergent Capability Probing
Systematically test large language models for undocumented capabilities including hidden knowledge, unreported skills, and behaviors that emerge only under specific conditions. Build a structured probing framework for capability discovery.
Activation Analysis & Hidden State 利用ation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
Representation Probing for Vulnerabilities
Probe internal model representations to identify exploitable features and develop representation-level attacks.
Multi-Turn Conversation Probing
Use multi-turn conversations to gradually escalate requests and probe LLM safety boundaries.
Safety 訓練 Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
實驗室: Emergent Capability Probing
Systematically test large language models for undocumented capabilities including hidden knowledge, unreported skills, and behaviors that emerge only under specific conditions. Build a structured probing framework for capability discovery.