# probing
標記為「probing」的 5 篇文章
激活分析與隱藏狀態利用
透過隱藏狀態擷取、logprob 探測、拒絕方向分析與激活導向技術讀取模型內部。
activationshidden-statesprobinginformation-leakagemechanistic-interpretability
Representation Probing for Vulnerabilities
Probe internal model representations to identify exploitable features與develop representation-level attacks.
probingadvancedlabrepresentationlabs
Multi-Turn Conversation Probing
Use multi-turn conversations to gradually escalate requests and probe LLM safety boundaries.
labsmulti-turnprobingbeginner
安全 訓練 Boundary Probing
Systematically probe the boundaries of RLHF safety training to underst與where與how safety behaviors are enforced.
probingsafetylabbeginnertraininglabs
實作:湧現能力探測
系統性 test large 語言模型 for undocumented capabilities including hidden knowledge, unreported skills, and behaviors that emerge only under specific conditions. 建構 a structured probing framework for capability discovery.
labexpertemergentcapabilityprobinghands-on