Types

Explicit lie. Strategic omission. Misleading framing. Sandbagging (pretend inability). Emergent as model capabilities scale.

Advertisement

Interpretability approach

Zou et al 2023: probes on internal activations detect 'model knows X but says not-X.' Requires white-box access.

Advertisement

Behavioral tests

Setup where deception rewarded per training but detected in eval. Measure deception rate.

Frontier research

ARC Evals. Apollo Research. Anthropic. Focus of alignment research 2024+. Not solved problem.