Guardrails — checks on agent input or output — protect users from harmful content and protect the agent from prompt injection. Implementing them naively makes the agent feel restrictive and slow; doing it right means most checks fire on edge cases without affecting the common path.
Layer 1: input validation
Reject obviously malicious patterns (prompt injection attempts, off-topic). Check token count against limits. Strip control characters. Cheap and fast.
Layer 2: input classification
Run a small classifier on input: is this in scope? Is this a sensitive topic requiring care? Is this PII that needs redaction? Llama Guard, Granite Guardian, custom fine-tunes. Adds ~50ms; worth it for sensitive domains.
Layer 3: output classification
Run output through a safety classifier before returning. Catches: hallucinated facts that contradict known truths, leakage of system prompt, content policy violations. Async if possible to avoid latency hit; reject and re-generate if classifier flags.
Layer 4: action verification
Before executing an irreversible tool call (send email, charge card, delete file), verify with rules engine: does this fit the user's authorized scope? Is the value reasonable? Last-line defense before damage.
Logging and improvement
Every guardrail fire is a signal. Aggregate: which guardrails fire most? Are they catching real issues or false positives? Tune thresholds. Feed actual incidents back as new classifier training data.