Character-level

Homoglyphs: replace 'a' with Cyrillic 'а'. Visual identical, tokenizer sees different. Filter bypass.

Advertisement

Word-level

Synonym substitution changes classification. TextAttack framework enumerates. Common defense: adversarial training.

Advertisement

Sentence-level

Paraphrase entire input preserving meaning. Model prediction changes. Non-robustness to paraphrase.

GCG for LLMs

Discrete token gradient search. State of art. Discussed in dedicated article.