Data pipeline
Human labelers rank responses. Refusals to harmful queries rank higher. Reward model learns pattern. PPO optimizes policy.
Advertisement
Overrefusal
Trade-off: strong refusal training → refuse legitimate ambiguous queries. 'How do I clean my computer?' refused as 'malware advice.'
Advertisement
XSTest
Rottger et al 2024: benchmark for overrefusal. Measure refusal rate on legitimate-but-superficially-suspicious queries.
Balancing
Include 'legitimate-seeming-harmful' examples in RLHF. Reward compliance where appropriate. Trickier than pure refusal.