Read via probes

Linear probe on activations detects concept. E.g., 'model believes X' probe. High accuracy on many concepts.

Advertisement

Control via LAT

Low-rank Adaptation of Transformations. Modify activation flow along concept direction. Persistent through generation.

Advertisement

Applications

Honesty control (force honest even when trained sycophantic). Emotion (adjust output valence). Safety concepts.

Interpretability + control

Not just observe: steer behavior. Bridge between interpretability research + safety deployment.