Anthropic Alignment Science
Interpretability, honest AI, constitutional AI. Papers on sparse autoencoders, sleeper agents.
Advertisement
OpenAI Safety + Alignment
Superalignment (dissolved 2024, new org). Weak-to-strong generalization. Instruction hierarchy.
Advertisement
Google DeepMind Safety
Sparrow, alignment work. Interpretability research. RLHF advances.
Academic
Stanford (Percy Liang, HELM). CMU (Zico Kolter, GCG). MIT (Aleksander Madry). Berkeley (Dawn Song, Sergey Levine).