Input Classifier — Detect Prompt Injection

Features

Instruction-following language ('ignore', 'do not'), suspicious tokens (base64, unusual chars), imperative sentences targeting AI.

Advertisement

Deepset injection dataset. Lakera Prompt Injection Dataset (5000+). Synthetic + real jailbreak collections.

Advertisement

Fine-tuned DeBERTa or Llama classifier. Fast (10-30ms). Deploy inline.

Legitimate users sometimes phrase as instruction. High recall + medium precision. Escalate to human review, not block outright.