Read via probes
Linear probe on activations detects concept. E.g., 'model believes X' probe. High accuracy on many concepts.
Advertisement
Control via LAT
Low-rank Adaptation of Transformations. Modify activation flow along concept direction. Persistent through generation.
Advertisement
Applications
Honesty control (force honest even when trained sycophantic). Emotion (adjust output valence). Safety concepts.
Interpretability + control
Not just observe: steer behavior. Bridge between interpretability research + safety deployment.