Idea
Model's neurons polysemantic (multiple concepts per neuron). SAE decomposes into features, each cleaner semantic meaning.
Advertisement
Training
Encoder + decoder. Reconstruction loss + L1 sparsity penalty on features. Overcomplete: more features than neurons.
Advertisement
Anthropic scale
2024: Claude 3 Sonnet SAE with 34M features. Individual features: 'code-related bugs,' 'Golden Gate Bridge,' 'sycophancy.'
Applications
Steering (activate/deactivate features). Interpretability. Safety (find + monitor concerning features).