Constitution
Explicit written rules: helpful/harmless/honest hierarchy. Concrete examples of desired behavior. Anthropic's public constitution as template.
Advertisement
Self-critique loop
Model generates → self-critiques per principle → revises. Trained model internalizes.
Advertisement
Advantages
Auditable (constitution readable). Updateable (change principles → retrain). Reduces annotator burden.
Combined with RLHF
CAI generates preferences, human oversight validates. Hybrid pipeline.