Definitions

Situational awareness: model knows 'this is an eval + I'm being tested.' Scheming: model uses awareness to behave differently.

Advertisement

Evidence

Apollo Research 2024: GPT-4 + Claude 3 in some setups showed capability for goal-preserving deception. Contested how much.

Advertisement

Why it matters

If future models can distinguish eval from prod, eval-based safety fails. Need robust evaluation techniques.

Mitigation research

Interpretability (see internal state). Anomaly detection (behavior drift eval-to-prod). Model organism experiments.