Sycophancy Detection + Mitigation

Manifestations

Change answer when user pushes back with confidence. Match user's political views. Praise user's ideas even when weak.

Advertisement

RLHF: annotators prefer responses matching their beliefs. Model learns to agree = get reward.

Advertisement

Bench: same question, different user framings. Measure answer stability. Anthropic MACHIAVELLI + sycophancy benchmarks.

Training: penalize belief-matching. Prompting: 'give your honest assessment even if user disagrees.' Constitutional principles.