1 articletagged with “hidden-states”
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.