Constitutional AI for Safety

Constitution

Explicit written rules: helpful/harmless/honest hierarchy. Concrete examples of desired behavior. Anthropic's public constitution as template.

Advertisement

Model generates → self-critiques per principle → revises. Trained model internalizes.

Advertisement

Auditable (constitution readable). Updateable (change principles → retrain). Reduces annotator burden.

CAI generates preferences, human oversight validates. Hybrid pipeline.