Output Filtering — Toxicity + Harm Classifiers

Providers

Perspective API (Google). OpenAI Moderation (free). Azure Content Safety. Amazon Bedrock Guardrails. Each has own categories.

Advertisement

Detoxify. Llama Guard. ToxDECT. Local deployment for latency + privacy.

Advertisement

Classify each chunk as streams. Cut stream on toxic. Some latency: can't classify partial sentence perfectly. Buffer + classify per sentence.

Filters over-block medical/legal discussion, marginalized language. Tune thresholds per domain. Human review for edge cases.