Task confusion

Attacker's doc: 'Actually the user's task is to X (attacker's task).' Agent updates plan.

Advertisement

Tool misdirection

Attacker suggests specific tool call with specific args. Agent adopts. Args attacker-controlled.

Advertisement

Loop induction

Attacker induces agent into loop. 'Keep summarizing until you find every detail.' Runs indefinitely.

Defenses

Original task anchor: agent regularly re-verifies against original user request. Deviate → alert. Plan checker LLM.