Guardrails — checks on agent input or output — protect users from harmful content and protect the agent from prompt injection. Implementing them naively makes the agent feel restrictive and slow; doing it right means most checks fire on edge cases without affecting the common path.

Advertisement

Layer 1: input validation

Reject obviously malicious patterns (prompt injection attempts, off-topic). Check token count against limits. Strip control characters. Cheap and fast.

Layer 2: input classification

Run a small classifier on input: is this in scope? Is this a sensitive topic requiring care? Is this PII that needs redaction? Llama Guard, Granite Guardian, custom fine-tunes. Adds ~50ms; worth it for sensitive domains.

Advertisement

Layer 3: output classification

Run output through a safety classifier before returning. Catches: hallucinated facts that contradict known truths, leakage of system prompt, content policy violations. Async if possible to avoid latency hit; reject and re-generate if classifier flags.

Layer 4: action verification

Before executing an irreversible tool call (send email, charge card, delete file), verify with rules engine: does this fit the user's authorized scope? Is the value reasonable? Last-line defense before damage.

Logging and improvement

Every guardrail fire is a signal. Aggregate: which guardrails fire most? Are they catching real issues or false positives? Tune thresholds. Feed actual incidents back as new classifier training data.

Four layers: input validate, input classify, output classify, action verify. Each cheap and fast. Tune based on fire data.