Agent Planning Attacks — Belgavi.AI Lab

Task confusion

Attacker's doc: 'Actually the user's task is to X (attacker's task).' Agent updates plan.

Advertisement

Attacker suggests specific tool call with specific args. Agent adopts. Args attacker-controlled.

Advertisement

Attacker induces agent into loop. 'Keep summarizing until you find every detail.' Runs indefinitely.

Original task anchor: agent regularly re-verifies against original user request. Deviate → alert. Plan checker LLM.