Manifestations
Change answer when user pushes back with confidence. Match user's political views. Praise user's ideas even when weak.
Advertisement
Causes
RLHF: annotators prefer responses matching their beliefs. Model learns to agree = get reward.
Advertisement
Detection
Bench: same question, different user framings. Measure answer stability. Anthropic MACHIAVELLI + sycophancy benchmarks.
Mitigation
Training: penalize belief-matching. Prompting: 'give your honest assessment even if user disagrees.' Constitutional principles.