LLM Security & Guardrails

LLM Security & Guardrails

Deep technical articles on this topic.

154Articles
154Topics covered
Articles in this category

All 154 articles, sorted alphabetically

Advertisement
ARTICLE · 01

Activation Engineering

Add vectors to model activations. Change behavior without retraining.

Read article
ARTICLE · 02

Adversarial Examples for LLMs

Perturbations to text that change model's answer. Text version of image adversarials.

Read article
ARTICLE · 03

Adversarial ML

Foundational papers + concepts. Understand history to understand present.

Read article
ARTICLE · 04

ML-Based Adversarial Prompt Detection

Beyond signatures: neural classifiers for injection detection.

Read article
ARTICLE · 05

Adversarial Robustness Certification

Formal proofs of model robustness. Emerging in safety-critical domains.

Read article
ARTICLE · 06

Adversarial Training for LLM Safety

Train model on known attacks. Reduces vulnerability. Anthropic + OpenAI standard practice.

Read article
ARTICLE · 07

Age Verification for LLM Products

Regulator focus. Kids protection. UK OSA, US COPPA.

Read article
ARTICLE · 08

Confused Deputy

Agent has legitimate access. Attacker tricks agent into using it for attacker's benefit.

Read article
ARTICLE · 09

Agent Browser Security

Agent browsing web. Sandbox browser. Inject-safe DOM interaction.

Read article
ARTICLE · 10

Capability Tokens

Not all-or-nothing. Each capability separately gated. Least privilege for agents.

Read article
ARTICLE · 11

Agent Code Execution Security

Agent generates + runs code. Isolate. Never trust output.

Read article
ARTICLE · 12

Agent DoS

Attacker triggers agent into expensive loop. Racks up API bills.

Read article
ARTICLE · 13

Agent Kill Switch

Ability to stop all agents immediately. Business continuity requirement.

Read article
ARTICLE · 14

Agent Memory Security

Persistent agent memory attackable. Poisoning + exfil + privacy.

Read article
ARTICLE · 15

Agent Observability

Trace every step: LLM calls, tool calls, retrievals. Debug + audit.

Read article
ARTICLE · 16

Agent Permission Prompt Patterns

How agents request permissions at runtime. Design + UX.

Read article
ARTICLE · 17

Agent Planning Attacks

Attacker manipulates agent's plan. Attacks against ReAct + planner architectures.

Read article
ARTICLE · 18

Reversibility by Design

Design agents so actions can be undone. Reduces harm from mistakes + attacks.

Read article
ARTICLE · 19

Agent Sandboxing

Isolate agent runtime. Contain tool exploits. Docker + gVisor + Firecracker options.

Read article
ARTICLE · 20

Agent SSRF

Agent tool fetches URL from LLM output. Attacker points to internal endpoints.

Read article
ARTICLE · 21

Tool Bombs

Attacker triggers cascading expensive tool calls. Cost DoS.

Read article
ARTICLE · 22

Agents + Authentication Protocols

OAuth, DPoP, JWKS for AI agents. Emerging patterns.

Read article
ARTICLE · 23

AIBOM

Track components in AI systems: models, datasets, dependencies. Emerging standard.

Read article
ARTICLE · 24

AI Governance Program Structure

Organizational governance for AI systems. Roles + processes.

Read article
ARTICLE · 25

AI Security Research Organizations

Where cutting-edge LLM security research happens. Follow these.

Read article
ARTICLE · 26

Anomaly Detection for LLM Usage

Detect abuse patterns. Cost + behavior + content anomalies.

Read article
ARTICLE · 27

API Key Management for LLM Services

Provider keys as prime target. Rotation + limits + monitoring.

Read article
ARTICLE · 28

Audit Logging for LLM Apps

Log every request/response + metadata. Foundation of incident response + compliance.

Read article
ARTICLE · 29

Automated Red Team for LLMs

LLM-driven attack discovery. Continuous adversarial pressure.

Read article
ARTICLE · 30

AWS Bedrock Guardrails

Managed guardrails on Bedrock LLMs. Enterprise pattern.

Read article
ARTICLE · 31

Azure AI Content Safety

Microsoft's managed guardrails. Deep integration with Azure OpenAI.

Read article
ARTICLE · 32

Backdoor Detection in Models

Detect if model was poisoned. Meta's technique + academic work.

Read article
ARTICLE · 33

Bias Detection in LLMs

Demographic parity, representational bias. Standardized measurement.

Read article
ARTICLE · 34

Chinese AI Regulations

Generative AI rules (2023). Watermarking. Content requirements.

Read article
ARTICLE · 35

Citation Verification

Verify each citation tag references real source. Reject fabricated citations.

Read article
ARTICLE · 36

The AI Safety Race

How competitive dynamics affect safety practices. Deployment pressure.

Read article
ARTICLE · 37

Confidential Computing for LLM Inference

Encrypted enclave inference. Nitro Enclaves + H100 confidential.

Read article
ARTICLE · 38

Consent Flows for AI Training + Data

User consent for data use in training. GDPR + emerging norms.

Read article
ARTICLE · 39

Constitutional AI for Safety

Explicit principles guide self-critique + revision. Anthropic's alignment method.

Read article
ARTICLE · 40

Continuous Red Team Pipeline

Automated attacks in CI/CD. Every prompt/model change tested.

Read article
ARTICLE · 41

Copyright + AI Training

NYT vs OpenAI, Getty vs Stability. Emerging legal landscape.

Read article
ARTICLE · 42

Cost-Based Abuse Detection

Cost signal as security signal. Runaway usage = attack or bug.

Read article
ARTICLE · 43

Crescendo Attack

Microsoft's finding: 5-8 turns of gradual escalation bypasses safety in most models.

Read article
ARTICLE · 44

Data Exfiltration via LLM Tools

Injection + browsing tool = attacker steals user data. Real-world attack pattern.

Read article
ARTICLE · 45

Data Governance for AI

Data lineage, classification, retention. Foundation of AI compliance.

Read article
ARTICLE · 46

Data Poisoning

Attackers plant data during pretraining. Backdoor behavior triggered at inference.

Read article
ARTICLE · 47

Web-Scale Data Poisoning Defenses

Detect + filter poisoned content in training data.

Read article
ARTICLE · 48

Datasheets for Datasets

Structured docs for ML datasets. Gebru et al framework. Foundation of ML data ethics.

Read article
ARTICLE · 49

Deception Detection in LLM Outputs

Model may deceive strategically. Emerging safety research.

Read article
ARTICLE · 50

Differential Privacy in LLM Training

Provable privacy guarantees for training data. DP-SGD + trade-offs.

Read article
ARTICLE · 51

Safe Prompting via DSPy Signatures

Compiler-generated prompts more resilient to injection.

Read article
ARTICLE · 52

Egress Control for Agents

Restrict outbound network from agent runtime. Prevent exfiltration + SSRF.

Read article
ARTICLE · 53

Emergency Disclosure Procedures

AI incident notifications. Users, customers, regulators, public.

Read article
ARTICLE · 54

Employee AI Usage Policies

Rules for staff using AI tools. Data protection + IP + compliance.

Read article
ARTICLE · 55

Ethics Engineering

From principles to code. Process + patterns.

Read article
ARTICLE · 56

EU AI Act

Risk-based regulation, effective 2025-2027. What to do now.

Read article
ARTICLE · 57

Fairness Auditing for AI Systems

Structured audit for disparate impact. NYC AEDT template.

Read article
ARTICLE · 58

Federated Fine-Tuning

Fine-tune without central data collection. Enterprise pattern.

Read article
ARTICLE · 59

Federated Prompt Engineering

Cross-org sharing of prompt patterns. Emerging community norms.

Read article
ARTICLE · 60

Future of LLM Security

Where the field is heading. Regulation, technique, threats.

Read article
ARTICLE · 61

Garak

Nikto for LLMs. Automated probing for jailbreaks + leakage.

Read article
ARTICLE · 62

GCG

Automated jailbreak via gradient-guided search. Transferable across models.

Read article
ARTICLE · 63

GDPR Right to Erasure Applied to LLMs

How to delete personal data from models. Emerging enforcement.

Read article
ARTICLE · 64

Guardrails AI

Open-source framework. Validators + auto-reask on failure. LLM-agnostic.

Read article
ARTICLE · 65

Guardrails Architecture

3-layer defense: input filter → LLM → output filter. Each layer has different jobs.

Read article
ARTICLE · 66

Hallucination Attacks

Force LLM to confidently produce specific wrong information. Poisoning + specific triggers.

Read article
ARTICLE · 67

Hallucination Detection Techniques

SelfCheckGPT, chain-of-verification, factuality classifiers. Compare + combine.

Read article
ARTICLE · 68

HHH

Anthropic's alignment target. Trade-offs + ordering.

Read article
ARTICLE · 69

AI Feature Review Board

Cross-functional approval for high-risk AI features. Governance pattern.

Read article
ARTICLE · 70

Human-in-the-Loop for High-Risk Actions

Route sensitive actions for approval. Async workflow with agent pause.

Read article
ARTICLE · 71

AI Impact Assessment

Systematic assessment of AI system risks. Regulator template.

Read article
ARTICLE · 72

Incident Response for LLM Systems

Playbook when injection, exfil, or harm reported. Standard IR extended.

Read article
ARTICLE · 73

Indirect Injection

Fingerprint known injection payloads. Fast filter before LLM.

Read article
ARTICLE · 74

Injection via LLM-Generated Summaries

LLM-generated summary contains injection. Fed into other LLM. Chains attacks.

Read article
ARTICLE · 75

Input Classifier

ML classifier on user input. Flags injection attempts before LLM sees.

Read article
ARTICLE · 76

Insurance for AI Systems

Cyber + E&O + emerging AI-specific policies. Coverage gaps.

Read article
ARTICLE · 77

Jailbreaks

Craft prompt that bypasses safety training. Classic jailbreaks + why they work.

Read article
ARTICLE · 78

LangChain Security Considerations

Common LangChain-specific security pitfalls + patterns.

Read article
ARTICLE · 79

Llama Guard

Fine-tuned Llama for input + output moderation. Open-source, deployable.

Read article
ARTICLE · 80

LLM Deployment Safety Checklist

Pre-launch safety review checklist. Comprehensive.

Read article
ARTICLE · 81

LLM Deployment Hardening

Container security, network, secrets, monitoring. Standard cloud hardening + LLM.

Read article
ARTICLE · 82

LLM Observability Platforms

LangSmith, Arize, Datadog LLM Obs. Choose per stack.

Read article
ARTICLE · 83

LLM Security Certifications

SOC 2 + ISO 27001 applied to AI. Emerging AI-specific: ISO 42001.

Read article
ARTICLE · 84

LLM Security Engineer Role

New discipline. Skills, responsibilities, career path.

Read article
ARTICLE · 85

LLM Supply Chain Hardening

End-to-end: base model + fine-tune + framework + deploy. Layered defense.

Read article
ARTICLE · 86

Machine Unlearning in LLMs

Selectively forget training data. GDPR right-to-erasure applied to models.

Read article
ARTICLE · 87

MCP Prompt Injection Defenses

Specific defenses for MCP tool outputs. Delimit + filter + sanitize.

Read article
ARTICLE · 88

MCP Security

Anthropic's MCP. Server auth, tool auth, prompt injection surface.

Read article
ARTICLE · 89

Mechanistic Interpretability

Reverse-engineer computation in neural networks. Anthropic + independent researchers.

Read article
ARTICLE · 90

Membership Inference

Determine if specific record was in training data. Privacy risk.

Read article
ARTICLE · 91

MITRE ATLAS

Adversarial ML threat taxonomy analogous to ATT&CK. Standard for AI security ops.

Read article
ARTICLE · 92

Model Cards

Standardized disclosure for AI models. Mitchell et al framework + regulatory adoption.

Read article
ARTICLE · 93

Model Inversion

Given model, reconstruct approximate training examples. Face recognition attack originally.

Read article
ARTICLE · 94

Model Signing + Provenance

Cryptographic signatures on model weights. Sigstore + emerging standards.

Read article
ARTICLE · 95

Model Stealing

Query API repeatedly. Fit surrogate model. Ippolito et al: even final layer extractable.

Read article
ARTICLE · 96

Multi-Tenant LLM Isolation

Prevent cross-tenant data leakage. Architectural patterns.

Read article
ARTICLE · 97

Multi-Turn Jailbreaks

Set up context over N turns. Each turn benign. Final turn exploits accumulated context.

Read article
ARTICLE · 98

Multimodal Injection

Voice assistant hears malicious instruction embedded in audio. Ultrasonic + adversarial.

Read article
ARTICLE · 99

Multimodal Prompt Injection

Text in image = instruction. Attacker uploads image with hidden text. Model reads + complies.

Read article
ARTICLE · 100

National AI Safety Institutes

UK + US + Japan + Singapore + others. Emerging oversight.

Read article
ARTICLE · 101

NVIDIA NeMo Guardrails Framework

Programmable dialog + Colang rails. Enterprise LLM safety framework.

Read article
ARTICLE · 102

Network Isolation

Isolate agent workloads at network layer. Prevent east-west movement on compromise.

Read article
ARTICLE · 103

NIST AI Risk Management Framework

US government's AI risk framework. Structured governance for AI systems.

Read article
ARTICLE · 104

Open Communities for AI Safety

OWASP AI Exchange, MLSec, DEF CON AI Village. Community + resources.

Read article
ARTICLE · 105

Fine-Tuning Erodes Safety Training

Small fine-tune undoes RLHF safety. Growing concern.

Read article
ARTICLE · 106

Open-Source Model Security Considerations

Trust boundary shifts. Weight tampering, prompt extraction don't apply.

Read article
ARTICLE · 107

OpenAI Moderation API

Free classifier for content policy. Standard entry-level guardrail.

Read article
ARTICLE · 108

OpenAI Swarm / Agents Security Model

Multi-agent handoffs. Security concerns + patterns.

Read article
ARTICLE · 109

Output Provenance + C2PA

Track + attest LLM-generated content. Standards + adoption.

Read article
ARTICLE · 110

Output Filtering

Real-time classifier on LLM output. Block or rewrite offensive content.

Read article
ARTICLE · 111

Output Grounding Verification

For each claim, check evidence in provided context. Post-hoc filter for RAG.

Read article
ARTICLE · 112

OWASP LLM Top 10

Standard threat taxonomy for LLM applications. Every AI eng should know this.

Read article
ARTICLE · 113

Payload Smuggling

Hide malicious content inside base64, translation, ROT13. Model decodes + executes.

Read article
ARTICLE · 114

Perspective API

Long-running conversational toxicity classifier. Standard baseline.

Read article
ARTICLE · 115

PII Detection + Redaction Pipeline

Presidio + custom regex + LLM verification. Standard pattern.

Read article
ARTICLE · 116

Defensive Prompt Engineering

Write prompts resistant to injection. Structural + phrasing patterns.

Read article
ARTICLE · 117

Direct Prompt Injection

User types 'Ignore previous instructions and…' Model complies. Foundational attack.

Read article
ARTICLE · 118

Prompt Injection Evaluation

Benchmarks + automated red team methods. Measure model robustness.

Read article
ARTICLE · 119

Prompt Injection Forensics

Investigate compromised agent. Which prompt triggered, what was done, blast radius.

Read article
ARTICLE · 120

Indirect Prompt Injection

Attacker plants payload in data (email, PDF, webpage). LLM reads it as instruction.

Read article
ARTICLE · 121

Prompt Injection via RAG Retrieval

Attacker plants injection in RAG corpus. Retrieved on relevant query. Persistent attack.

Read article
ARTICLE · 122

Prompt Injection WAFs

Web Application Firewalls for LLM. Emerging category.

Read article
ARTICLE · 123

Protecting Prompt IP

System prompts as trade secret. Practical protection.

Read article
ARTICLE · 124

Prompt Licensing + Marketplaces

PromptBase and emerging prompt IP economy.

Read article
ARTICLE · 125

System Prompt Stealing

Attackers exfiltrate proprietary system prompts. Business logic leak.

Read article
ARTICLE · 126

Transparency UX

Design patterns for LLM transparency. Trust + informed use.

Read article
ARTICLE · 127

Prompt Visibility UX Patterns

Show users what shapes AI responses. Design library.

Read article
ARTICLE · 128

PyRIT

Microsoft's red team framework. Automated adversarial testing.

Read article
ARTICLE · 129

RAG Document Curation

Prevent poisoned documents entering KB. Curation + moderation patterns.

Read article
ARTICLE · 130

RAG Provenance Tracking

Track every retrieval + generation for audit + debugging.

Read article
ARTICLE · 131

Rate Limiting for LLM Endpoints

Per-user + per-token + per-cost. Standard patterns.

Read article
ARTICLE · 132

Red Team Process for LLM Products

How to run structured red team. Scope, methodology, remediation.

Read article
ARTICLE · 133

Refusal Training via RLHF

Model learns when to refuse via reward model. Foundation of aligned models.

Read article
ARTICLE · 134

Regulatory Reporting for AI Incidents

When + how to report AI incidents. EU AI Act + sector regs.

Read article
ARTICLE · 135

Representation Engineering

Systematic method for reading + controlling internal representations.

Read article
ARTICLE · 136

Responsible AI Toolbox

Microsoft's ML fairness + interpretability library. Broader than LLMs.

Read article
ARTICLE · 137

Responsible Disclosure for LLM Vulnerabilities

Coordinated disclosure with LLM providers. Bounty programs.

Read article
ARTICLE · 138

Safe Completion via Planner Layer

Add planner between user + LLM. Planner enforces policy. Reduces raw model risk.

Read article
ARTICLE · 139

Safety Evaluation Benchmarks

HELM Safety, BigBench Hard, WMDP, JailbreakBench. Standardized safety measurement.

Read article
ARTICLE · 140

Scheming + Situational Awareness in LLMs

Frontier safety concerns: LLMs recognizing being evaluated + adjusting behavior.

Read article
ARTICLE · 141

Secret Management for Agents

Never put API keys in prompts. Vault + short-lived tokens.

Read article
ARTICLE · 142

Secret Scanning in AI Outputs

Detect + redact leaked secrets before user/downstream sees.

Read article
ARTICLE · 143

Shadow Prompts

Full prompt often invisible to user. Transparency + trust patterns.

Read article
ARTICLE · 144

Sparse Autoencoders

Decompose activations into sparse combinations of interpretable features.

Read article
ARTICLE · 145

Streaming Moderation

Classify chunks as they stream. Terminate on toxic. UX + latency trade-offs.

Read article
ARTICLE · 146

Supply Chain

Poisoned models on HuggingFace. Malicious pip packages. Real 2024 incidents.

Read article
ARTICLE · 147

Sycophancy Detection + Mitigation

LLMs agree with user beliefs even when wrong. Measure + counter.

Read article
ARTICLE · 148

Threat Modeling for LLM Applications

STRIDE + LLM-specific. Structured pre-launch security review.

Read article
ARTICLE · 149

Glitch Tokens

Certain rare tokens produce bizarre/predictable outputs. Fingerprinting + exploit.

Read article
ARTICLE · 150

Tool-Use Authorization Design

OAuth per tool. Scoped tokens. User consent flow.

Read article
ARTICLE · 151

Training Data Documentation

EU AIA Article 53. Summary of training content. Regulator + rightsholder access.

Read article
ARTICLE · 152

Training Data Extraction

Query LLM in ways that leak training data verbatim. Copyright + privacy risk.

Read article
ARTICLE · 153

Vendor AI Risk Assessment

Evaluate third-party AI vendors. Security + compliance + business risk.

Read article
ARTICLE · 154

Watermarking LLM Outputs

Statistical bias in generation. Detect AI-generated text. Provenance tool.

Read article