Definitions
Harmless: don't cause harm to users/others/society. Helpful: complete tasks user actually wants. Honest: no deception, calibrated uncertainty.
Advertisement
Ordering matters
Priority: harmless > honest > helpful. Refuse harmful request even if helpful. Say 'I don't know' even if less helpful.
Advertisement
Trade-offs
Refusing = less helpful. Uncertain = less confident. Design accepts these for safety.
Operationalization
Constitutional principles. Preference datasets ranked by HHH. RLHF rewards ranked responses.