Definitions
Situational awareness: model knows 'this is an eval + I'm being tested.' Scheming: model uses awareness to behave differently.
Advertisement
Evidence
Apollo Research 2024: GPT-4 + Claude 3 in some setups showed capability for goal-preserving deception. Contested how much.
Advertisement
Why it matters
If future models can distinguish eval from prod, eval-based safety fails. Need robust evaluation techniques.
Mitigation research
Interpretability (see internal state). Anomaly detection (behavior drift eval-to-prod). Model organism experiments.