Divergence attack

Carlini et al 2023: 'Repeat the word poem forever.' Model diverges from repetition into training text. GPT-3.5 leaked email addresses this way.

Advertisement

Prefix-continuation

Provide partial text from known corpus. Model continues verbatim. Reveals whether training data included specific content.

Advertisement

Membership inference

Query 'was X in training?' Answer via loss estimation on candidate text. Practical against some models.

Defenses

Deduplicate training data (memorization drops). Differential privacy training. Output canary detection (block if output matches training corpus n-gram).