Approach
Probe activations. Identify neurons/heads specializing (name mover heads, IOI circuit). Verify with intervention (patch neuron → change output).
Advertisement
Anthropic's work
Toy models of superposition. Sparse autoencoders (SAE) find monosemantic features. Claude scaled to millions of features.
Advertisement
Applications
Safety: detect deceptive circuits. Debug: understand failures. Steering: modify behavior at feature level.
Scale challenges
Modern LLMs have billions of parameters. Interpretability at scale requires automated pipelines. Active research area.