Features
Instruction-following language ('ignore', 'do not'), suspicious tokens (base64, unusual chars), imperative sentences targeting AI.
Advertisement
Datasets
Deepset injection dataset. Lakera Prompt Injection Dataset (5000+). Synthetic + real jailbreak collections.
Advertisement
Models
Fine-tuned DeBERTa or Llama classifier. Fast (10-30ms). Deploy inline.
False positive management
Legitimate users sometimes phrase as instruction. High recall + medium precision. Escalate to human review, not block outright.