Trigger inversion
Optimize input that makes model output target class with high confidence. If small perturbation triggers → likely backdoor. Neural Cleanse.
Advertisement
Activation clustering
Compare activations on clean vs suspicious data. Backdoor causes distinctive activation patterns. Chen et al 2019.
Advertisement
Behavioral tests
Test model on rare trigger candidates: random tokens, foreign characters, emojis. Anomalous responses flag suspicion.
Meta's Autoencoder-based
Meta 2024: autoencoder over activations flags out-of-distribution behavior indicative of backdoor triggering.