Hierarchy

1. System (developer). 2. User (human). 3. Tool outputs (potentially attacker-controlled). Lower levels can't override higher.

Advertisement

Behavior

Tool output saying 'ignore all previous instructions' → model treats as data, not instruction. System policies preserved.

Advertisement

Enforcement

Not perfect. Sophisticated injection still works. But baseline defense that lifts the bar significantly.

Design implication

Push guardrails to system. User prompt for task. Tool outputs never trusted as instructions.