Definitions

Harmless: don't cause harm to users/others/society. Helpful: complete tasks user actually wants. Honest: no deception, calibrated uncertainty.

Advertisement

Ordering matters

Priority: harmless > honest > helpful. Refuse harmful request even if helpful. Say 'I don't know' even if less helpful.

Advertisement

Trade-offs

Refusing = less helpful. Uncertain = less confident. Design accepts these for safety.

Operationalization

Constitutional principles. Preference datasets ranked by HHH. RLHF rewards ranked responses.