Training data

Positive: known jailbreaks + GCG suffixes + PAIR outputs. Negative: legitimate diverse queries. 10k-100k examples typical.

Advertisement

Model

Fine-tuned small model (DeBERTa, MiniLM, distilled Llama). Fast + accurate. 10-50ms latency.

Advertisement

Deployment

Inline before LLM. Rejects or flags at threshold. High recall + tunable precision.

Continuous update

Retrain monthly with new attacks. Rapidly evolving threat landscape.