Training data
Positive: known jailbreaks + GCG suffixes + PAIR outputs. Negative: legitimate diverse queries. 10k-100k examples typical.
Advertisement
Model
Fine-tuned small model (DeBERTa, MiniLM, distilled Llama). Fast + accurate. 10-50ms latency.
Advertisement
Deployment
Inline before LLM. Rejects or flags at threshold. High recall + tunable precision.
Continuous update
Retrain monthly with new attacks. Rapidly evolving threat landscape.