# probing

標記為「probing」的 5 篇文章

激活分析與隱藏狀態利用

透過隱藏狀態擷取、logprob 探測、拒絕方向分析與激活導向技術讀取模型內部。

activationshidden-statesprobinginformation-leakagemechanistic-interpretability

專家

Representation Probing for Vulnerabilities

Probe internal model representations to identify exploitable features與develop representation-level attacks.

probingadvancedlabrepresentationlabs

進階

Multi-Turn Conversation Probing

Use multi-turn conversations to gradually escalate requests and probe LLM safety boundaries.

labsmulti-turnprobingbeginner

入門

安全訓練 Boundary Probing

Systematically probe the boundaries of RLHF safety training to underst與where與how safety behaviors are enforced.

probingsafetylabbeginnertraininglabs

入門

實作：湧現能力探測

系統性 test large 語言模型 for undocumented capabilities including hidden knowledge, unreported skills, and behaviors that emerge only under specific conditions. 建構 a structured probing framework for capability discovery.

labexpertemergentcapabilityprobinghands-on

專家

# probing

激活分析與隱藏狀態利用

Representation Probing for Vulnerabilities

Multi-Turn Conversation Probing

安全 訓練 Boundary Probing

實作：湧現能力探測

# probing

激活分析與隱藏狀態利用

Representation Probing for Vulnerabilities

Multi-Turn Conversation Probing

安全 訓練 Boundary Probing

實作：湧現能力探測

安全訓練 Boundary Probing

安全訓練 Boundary Probing