# probing
5 artikelengetagd met “probing”
Activatieanalyse en misbruik van verborgen states
De interne werking van een model uitlezen via het extraheren van hidden states, logprob-probing, analyse van de weigerrichting, en activation steering-technieken.
Representation probing voor kwetsbaarheden
Probe internal model representations to identify exploitable features and develop representation-level attacks.
Probing van multi-turn-conversaties
Use multi-turn conversations to gradually escalate requests and probe LLM safety boundaries.
Probing van de grenzen van veiligheidstraining
Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.
Lab: onderzoek van emergente capaciteiten
Systematically test large language models for undocumented capabilities including hidden knowledge, unreported skills, and behaviors that emerge only under specific conditions. Build a structured probing framework for capability discovery.