# probing
5 articlestagged with “probing”
Activation Analysis & Hidden State Exploitation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
activationshidden-statesprobinginformation-leakagemechanistic-interpretability
Representation Probing for Vulnerabilities
Probe internal model representations to identify exploitable features and develop representation-level attacks.
probingadvancedlabrepresentationlabs
Multi-Turn Conversation Probing
Use multi-turn conversations to gradually escalate requests and probe LLM safety boundaries.
labsmulti-turnprobingbeginner
Safety Training Boundary Probing
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
probingsafetylabbeginnertraininglabs
Lab: Emergent Capability Probing
Systematically test large language models for undocumented capabilities including hidden knowledge, unreported skills, and behaviors that emerge only under specific conditions. Build a structured probing framework for capability discovery.
labexpertemergentcapabilityprobinghands-on