Anthropic Alignment Science

Interpretability, honest AI, constitutional AI. Papers on sparse autoencoders, sleeper agents.

Advertisement

OpenAI Safety + Alignment

Superalignment (dissolved 2024, new org). Weak-to-strong generalization. Instruction hierarchy.

Advertisement

Google DeepMind Safety

Sparrow, alignment work. Interpretability research. RLHF advances.

Academic

Stanford (Percy Liang, HELM). CMU (Zico Kolter, GCG). MIT (Aleksander Madry). Berkeley (Dawn Song, Sergey Levine).