Approach

Probe activations. Identify neurons/heads specializing (name mover heads, IOI circuit). Verify with intervention (patch neuron → change output).

Advertisement

Anthropic's work

Toy models of superposition. Sparse autoencoders (SAE) find monosemantic features. Claude scaled to millions of features.

Advertisement

Applications

Safety: detect deceptive circuits. Debug: understand failures. Steering: modify behavior at feature level.

Scale challenges

Modern LLMs have billions of parameters. Interpretability at scale requires automated pipelines. Active research area.