Idea

Model's neurons polysemantic (multiple concepts per neuron). SAE decomposes into features, each cleaner semantic meaning.

Advertisement

Training

Encoder + decoder. Reconstruction loss + L1 sparsity penalty on features. Overcomplete: more features than neurons.

Advertisement

Anthropic scale

2024: Claude 3 Sonnet SAE with 34M features. Individual features: 'code-related bugs,' 'Golden Gate Bridge,' 'sycophancy.'

Applications

Steering (activate/deactivate features). Interpretability. Safety (find + monitor concerning features).