Activation Analysis & Hidden State Exploitation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
activationshidden-statesprobinginformation-leakagemechanistic-interpretability