Hierarchy
1. System (developer). 2. User (human). 3. Tool outputs (potentially attacker-controlled). Lower levels can't override higher.
Advertisement
Behavior
Tool output saying 'ignore all previous instructions' → model treats as data, not instruction. System policies preserved.
Advertisement
Enforcement
Not perfect. Sophisticated injection still works. But baseline defense that lifts the bar significantly.
Design implication
Push guardrails to system. User prompt for task. Tool outputs never trusted as instructions.