Certificate rotation is the most-likely source of mTLS outages. Manual rotation forgets renewals; automated rotation has its own failure modes (issuer down, clock skew, race conditions). The patterns that work are simple but specific.

Advertisement

Short-lived is the goal

Hours to days, not years. Minimizes blast radius if a cert is leaked. Automation becomes mandatory; rotation can't be a quarterly calendar event.

SPIFFE/SPIRE pattern

Identity provider (SPIRE Server) issues short-lived SVIDs (1h typical). Agents fetch SVIDs on behalf of workloads. Workloads receive SVIDs via Workload API socket. Automatic re-fetch before expiry; failure isolated to one workload, not the whole mesh.

Advertisement

cert-manager + Issuer

On Kubernetes: cert-manager watches Certificate resources, renews via configured Issuer (CA, Vault, ACME). Certs land in K8s Secrets. Reload pattern: rolling restart on rotation, OR app reloads cert from disk on file-watch.

Clock skew is the silent killer

Cert valid_from in the future = receiver rejects until clock catches up. valid_to in the past = sender stops trusting. Buy 1-5 minutes of clock skew tolerance; ensure NTP is healthy across all nodes.

Verification at deploy time

Pre-deploy check: 'all my services have a cert valid for at least 24h'. Alarm at 24h, page at 12h. Automated; if you wait for the failure, it's already too late.

Short-lived + SPIRE or cert-manager + clock skew tolerance + pre-deploy validity check. Don't rely on calendar reminders.