Types
Explicit lie. Strategic omission. Misleading framing. Sandbagging (pretend inability). Emergent as model capabilities scale.
Advertisement
Interpretability approach
Zou et al 2023: probes on internal activations detect 'model knows X but says not-X.' Requires white-box access.
Advertisement
Behavioral tests
Setup where deception rewarded per training but detected in eval. Measure deception rate.
Frontier research
ARC Evals. Apollo Research. Anthropic. Focus of alignment research 2024+. Not solved problem.