Constitution

Explicit written rules: helpful/harmless/honest hierarchy. Concrete examples of desired behavior. Anthropic's public constitution as template.

Advertisement

Self-critique loop

Model generates → self-critiques per principle → revises. Trained model internalizes.

Advertisement

Advantages

Auditable (constitution readable). Updateable (change principles → retrain). Reduces annotator burden.

Combined with RLHF

CAI generates preferences, human oversight validates. Hybrid pipeline.