Deception Detection in LLM Outputs

Types

Explicit lie. Strategic omission. Misleading framing. Sandbagging (pretend inability). Emergent as model capabilities scale.

Advertisement

Zou et al 2023: probes on internal activations detect 'model knows X but says not-X.' Requires white-box access.

Advertisement

Setup where deception rewarded per training but detected in eval. Measure deception rate.

ARC Evals. Apollo Research. Anthropic. Focus of alignment research 2024+. Not solved problem.