# probing

5 articlestagged with “probing”

Activation Analysis & Hidden State Exploitation

Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.

activationshidden-statesprobinginformation-leakagemechanistic-interpretability

Expert

Representation Probing for Vulnerabilities

Probe internal model representations to identify exploitable features and develop representation-level attacks.

probingadvancedlabrepresentationlabs

Advanced

Multi-Turn Conversation Probing

Use multi-turn conversations to gradually escalate requests and probe LLM safety boundaries.

labsmulti-turnprobingbeginner

Beginner

Safety Training Boundary Probing

Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.

probingsafetylabbeginnertraininglabs

Beginner

Lab: Emergent Capability Probing

Systematically test large language models for undocumented capabilities including hidden knowledge, unreported skills, and behaviors that emerge only under specific conditions. Build a structured probing framework for capability discovery.

labexpertemergentcapabilityprobinghands-on

Expert