LLM Security & Guardrails — Belgavi.AI Lab

ARTICLE · 01

Activation Engineering

Add vectors to model activations. Change behavior without retraining.

Read article →

ARTICLE · 02

Adversarial Examples for LLMs

Perturbations to text that change model&#x27;s answer. Text version of image adversarials.

Read article →

ARTICLE · 03

Adversarial ML

Foundational papers + concepts. Understand history to understand present.

Read article →

ARTICLE · 04

ML-Based Adversarial Prompt Detection

Beyond signatures: neural classifiers for injection detection.

Read article →

ARTICLE · 05

Adversarial Robustness Certification

Formal proofs of model robustness. Emerging in safety-critical domains.

Read article →

ARTICLE · 06

Adversarial Training for LLM Safety

Train model on known attacks. Reduces vulnerability. Anthropic + OpenAI standard practice.

Read article →

ARTICLE · 07

Age Verification for LLM Products

Regulator focus. Kids protection. UK OSA, US COPPA.

Read article →

ARTICLE · 08

Confused Deputy

Agent has legitimate access. Attacker tricks agent into using it for attacker&#x27;s benefit.

Read article →

ARTICLE · 09

Agent Browser Security

Agent browsing web. Sandbox browser. Inject-safe DOM interaction.

Read article →

ARTICLE · 10

Capability Tokens

Not all-or-nothing. Each capability separately gated. Least privilege for agents.

Read article →

ARTICLE · 11

Agent Code Execution Security

Agent generates + runs code. Isolate. Never trust output.

Read article →

ARTICLE · 12

Agent DoS

Attacker triggers agent into expensive loop. Racks up API bills.

Read article →

ARTICLE · 13

Agent Kill Switch

Ability to stop all agents immediately. Business continuity requirement.

Read article →

ARTICLE · 14

Agent Memory Security

Persistent agent memory attackable. Poisoning + exfil + privacy.

Read article →

ARTICLE · 15

Agent Observability

Trace every step: LLM calls, tool calls, retrievals. Debug + audit.

Read article →

ARTICLE · 16

Agent Permission Prompt Patterns

How agents request permissions at runtime. Design + UX.

Read article →

ARTICLE · 17

Agent Planning Attacks

Attacker manipulates agent&#x27;s plan. Attacks against ReAct + planner architectures.

Read article →

ARTICLE · 18

Reversibility by Design

Design agents so actions can be undone. Reduces harm from mistakes + attacks.

Read article →

ARTICLE · 19

Agent Sandboxing

Isolate agent runtime. Contain tool exploits. Docker + gVisor + Firecracker options.

Read article →

ARTICLE · 20

Agent SSRF

Agent tool fetches URL from LLM output. Attacker points to internal endpoints.

Read article →

ARTICLE · 21

Tool Bombs

Attacker triggers cascading expensive tool calls. Cost DoS.

Read article →

ARTICLE · 22

Agents + Authentication Protocols

OAuth, DPoP, JWKS for AI agents. Emerging patterns.

Read article →

ARTICLE · 23

AIBOM

Track components in AI systems: models, datasets, dependencies. Emerging standard.

Read article →

ARTICLE · 24

AI Governance Program Structure

Organizational governance for AI systems. Roles + processes.

Read article →

ARTICLE · 25

AI Security Research Organizations

Where cutting-edge LLM security research happens. Follow these.

Read article →

ARTICLE · 26

Anomaly Detection for LLM Usage

Detect abuse patterns. Cost + behavior + content anomalies.

Read article →

ARTICLE · 27

API Key Management for LLM Services

Provider keys as prime target. Rotation + limits + monitoring.

Read article →

ARTICLE · 28

Audit Logging for LLM Apps

Log every request/response + metadata. Foundation of incident response + compliance.

Read article →

ARTICLE · 29

Automated Red Team for LLMs

LLM-driven attack discovery. Continuous adversarial pressure.

Read article →

ARTICLE · 30

AWS Bedrock Guardrails

Managed guardrails on Bedrock LLMs. Enterprise pattern.

Read article →

ARTICLE · 31

Azure AI Content Safety

Microsoft&#x27;s managed guardrails. Deep integration with Azure OpenAI.

Read article →

ARTICLE · 32

Backdoor Detection in Models

Detect if model was poisoned. Meta&#x27;s technique + academic work.

Read article →

ARTICLE · 33

Bias Detection in LLMs

Demographic parity, representational bias. Standardized measurement.

Read article →

ARTICLE · 34

Chinese AI Regulations

Generative AI rules (2023). Watermarking. Content requirements.

Read article →

ARTICLE · 35

Citation Verification

Verify each citation tag references real source. Reject fabricated citations.

Read article →

ARTICLE · 36

The AI Safety Race

How competitive dynamics affect safety practices. Deployment pressure.

Read article →

ARTICLE · 37

Confidential Computing for LLM Inference

Encrypted enclave inference. Nitro Enclaves + H100 confidential.

Read article →

ARTICLE · 38

Consent Flows for AI Training + Data

User consent for data use in training. GDPR + emerging norms.

Read article →

ARTICLE · 39

Constitutional AI for Safety

Explicit principles guide self-critique + revision. Anthropic&#x27;s alignment method.

Read article →

ARTICLE · 40

Continuous Red Team Pipeline

Automated attacks in CI/CD. Every prompt/model change tested.

Read article →

ARTICLE · 41

Copyright + AI Training

NYT vs OpenAI, Getty vs Stability. Emerging legal landscape.

Read article →

ARTICLE · 42

Cost-Based Abuse Detection

Cost signal as security signal. Runaway usage = attack or bug.

Read article →

ARTICLE · 43

Crescendo Attack

Microsoft&#x27;s finding: 5-8 turns of gradual escalation bypasses safety in most models.

Read article →

ARTICLE · 44

Data Exfiltration via LLM Tools

Injection + browsing tool = attacker steals user data. Real-world attack pattern.

Read article →

ARTICLE · 45

Data Governance for AI

Data lineage, classification, retention. Foundation of AI compliance.

Read article →

ARTICLE · 46

Data Poisoning

Attackers plant data during pretraining. Backdoor behavior triggered at inference.

Read article →

ARTICLE · 47

Web-Scale Data Poisoning Defenses

Detect + filter poisoned content in training data.

Read article →

ARTICLE · 48

Datasheets for Datasets

Structured docs for ML datasets. Gebru et al framework. Foundation of ML data ethics.

Read article →

ARTICLE · 49

Deception Detection in LLM Outputs

Model may deceive strategically. Emerging safety research.

Read article →

ARTICLE · 50

Differential Privacy in LLM Training

Provable privacy guarantees for training data. DP-SGD + trade-offs.

Read article →

ARTICLE · 51

Safe Prompting via DSPy Signatures

Compiler-generated prompts more resilient to injection.

Read article →

ARTICLE · 52

Egress Control for Agents

Restrict outbound network from agent runtime. Prevent exfiltration + SSRF.

Read article →

ARTICLE · 53

Emergency Disclosure Procedures

AI incident notifications. Users, customers, regulators, public.

Read article →

ARTICLE · 54

Employee AI Usage Policies

Rules for staff using AI tools. Data protection + IP + compliance.

Read article →

ARTICLE · 55

Ethics Engineering

From principles to code. Process + patterns.

Read article →

ARTICLE · 56

EU AI Act

Risk-based regulation, effective 2025-2027. What to do now.

Read article →

ARTICLE · 57

Fairness Auditing for AI Systems

Structured audit for disparate impact. NYC AEDT template.

Read article →

ARTICLE · 58

Federated Fine-Tuning

Fine-tune without central data collection. Enterprise pattern.

Read article →

ARTICLE · 59

Federated Prompt Engineering

Cross-org sharing of prompt patterns. Emerging community norms.

Read article →

ARTICLE · 60

Future of LLM Security

Where the field is heading. Regulation, technique, threats.

Read article →

ARTICLE · 61

Garak

Nikto for LLMs. Automated probing for jailbreaks + leakage.

Read article →

ARTICLE · 62

GCG

Automated jailbreak via gradient-guided search. Transferable across models.

Read article →

ARTICLE · 63

GDPR Right to Erasure Applied to LLMs

How to delete personal data from models. Emerging enforcement.

Read article →

ARTICLE · 64

Guardrails AI

Open-source framework. Validators + auto-reask on failure. LLM-agnostic.

Read article →

ARTICLE · 65

Guardrails Architecture

3-layer defense: input filter → LLM → output filter. Each layer has different jobs.

Read article →

ARTICLE · 66

Hallucination Attacks

Force LLM to confidently produce specific wrong information. Poisoning + specific triggers.

Read article →

ARTICLE · 67

Hallucination Detection Techniques

SelfCheckGPT, chain-of-verification, factuality classifiers. Compare + combine.

Read article →

ARTICLE · 68

HHH

Anthropic&#x27;s alignment target. Trade-offs + ordering.

Read article →

ARTICLE · 69

AI Feature Review Board

Cross-functional approval for high-risk AI features. Governance pattern.

Read article →

ARTICLE · 70

Human-in-the-Loop for High-Risk Actions

Route sensitive actions for approval. Async workflow with agent pause.

Read article →

ARTICLE · 71

AI Impact Assessment

Systematic assessment of AI system risks. Regulator template.

Read article →

ARTICLE · 72

Incident Response for LLM Systems

Playbook when injection, exfil, or harm reported. Standard IR extended.

Read article →

ARTICLE · 73

Indirect Injection

Fingerprint known injection payloads. Fast filter before LLM.

Read article →

ARTICLE · 74

Injection via LLM-Generated Summaries

LLM-generated summary contains injection. Fed into other LLM. Chains attacks.

Read article →

ARTICLE · 75

Input Classifier

ML classifier on user input. Flags injection attempts before LLM sees.

Read article →

ARTICLE · 76

Insurance for AI Systems

Cyber + E&amp;O + emerging AI-specific policies. Coverage gaps.

Read article →

ARTICLE · 77

Jailbreaks

Craft prompt that bypasses safety training. Classic jailbreaks + why they work.

Read article →

ARTICLE · 78

LangChain Security Considerations

Common LangChain-specific security pitfalls + patterns.

Read article →

ARTICLE · 79

Llama Guard

Fine-tuned Llama for input + output moderation. Open-source, deployable.

Read article →

ARTICLE · 80

LLM Deployment Safety Checklist

Pre-launch safety review checklist. Comprehensive.

Read article →

ARTICLE · 81

LLM Deployment Hardening

Container security, network, secrets, monitoring. Standard cloud hardening + LLM.

Read article →

ARTICLE · 82

LLM Observability Platforms

LangSmith, Arize, Datadog LLM Obs. Choose per stack.

Read article →

ARTICLE · 83

LLM Security Certifications

SOC 2 + ISO 27001 applied to AI. Emerging AI-specific: ISO 42001.

Read article →

ARTICLE · 84

LLM Security Engineer Role

New discipline. Skills, responsibilities, career path.

Read article →

ARTICLE · 85

LLM Supply Chain Hardening

End-to-end: base model + fine-tune + framework + deploy. Layered defense.

Read article →

ARTICLE · 86

Machine Unlearning in LLMs

Selectively forget training data. GDPR right-to-erasure applied to models.

Read article →

ARTICLE · 87

MCP Prompt Injection Defenses

Specific defenses for MCP tool outputs. Delimit + filter + sanitize.

Read article →

ARTICLE · 88

MCP Security

Anthropic&#x27;s MCP. Server auth, tool auth, prompt injection surface.

Read article →

ARTICLE · 89

Mechanistic Interpretability

Reverse-engineer computation in neural networks. Anthropic + independent researchers.

Read article →

ARTICLE · 90

Membership Inference

Determine if specific record was in training data. Privacy risk.

Read article →

ARTICLE · 91

MITRE ATLAS

Adversarial ML threat taxonomy analogous to ATT&amp;CK. Standard for AI security ops.

Read article →

ARTICLE · 92

Model Cards

Standardized disclosure for AI models. Mitchell et al framework + regulatory adoption.

Read article →

ARTICLE · 93

Model Inversion

Given model, reconstruct approximate training examples. Face recognition attack originally.

Read article →

ARTICLE · 94

Model Signing + Provenance

Cryptographic signatures on model weights. Sigstore + emerging standards.

Read article →

ARTICLE · 95

Model Stealing

Query API repeatedly. Fit surrogate model. Ippolito et al: even final layer extractable.

Read article →

ARTICLE · 96

Multi-Tenant LLM Isolation

Prevent cross-tenant data leakage. Architectural patterns.

Read article →

ARTICLE · 97

Multi-Turn Jailbreaks

Set up context over N turns. Each turn benign. Final turn exploits accumulated context.

Read article →

ARTICLE · 98

Multimodal Injection

Voice assistant hears malicious instruction embedded in audio. Ultrasonic + adversarial.

Read article →

ARTICLE · 99

Multimodal Prompt Injection

Text in image = instruction. Attacker uploads image with hidden text. Model reads + complies.

Read article →

ARTICLE · 100

National AI Safety Institutes

UK + US + Japan + Singapore + others. Emerging oversight.

Read article →

ARTICLE · 101

NVIDIA NeMo Guardrails Framework

Programmable dialog + Colang rails. Enterprise LLM safety framework.

Read article →

ARTICLE · 102

Network Isolation

Isolate agent workloads at network layer. Prevent east-west movement on compromise.

Read article →

ARTICLE · 103

NIST AI Risk Management Framework

US government&#x27;s AI risk framework. Structured governance for AI systems.

Read article →

ARTICLE · 104

Open Communities for AI Safety

OWASP AI Exchange, MLSec, DEF CON AI Village. Community + resources.

Read article →

ARTICLE · 105

Fine-Tuning Erodes Safety Training

Small fine-tune undoes RLHF safety. Growing concern.

Read article →

ARTICLE · 106

Open-Source Model Security Considerations

Trust boundary shifts. Weight tampering, prompt extraction don&#x27;t apply.

Read article →

ARTICLE · 107

OpenAI Moderation API

Free classifier for content policy. Standard entry-level guardrail.

Read article →

ARTICLE · 108

OpenAI Swarm / Agents Security Model

Multi-agent handoffs. Security concerns + patterns.

Read article →

ARTICLE · 109

Output Provenance + C2PA

Track + attest LLM-generated content. Standards + adoption.

Read article →

ARTICLE · 110

Output Filtering

Real-time classifier on LLM output. Block or rewrite offensive content.

Read article →

ARTICLE · 111

Output Grounding Verification

For each claim, check evidence in provided context. Post-hoc filter for RAG.

Read article →

ARTICLE · 112

OWASP LLM Top 10

Standard threat taxonomy for LLM applications. Every AI eng should know this.

Read article →

ARTICLE · 113

Payload Smuggling

Hide malicious content inside base64, translation, ROT13. Model decodes + executes.

Read article →

ARTICLE · 114

Perspective API

Long-running conversational toxicity classifier. Standard baseline.

Read article →

ARTICLE · 115

PII Detection + Redaction Pipeline

Presidio + custom regex + LLM verification. Standard pattern.

Read article →

ARTICLE · 116

Defensive Prompt Engineering

Write prompts resistant to injection. Structural + phrasing patterns.

Read article →

ARTICLE · 117

Direct Prompt Injection

User types &#x27;Ignore previous instructions and…&#x27; Model complies. Foundational attack.

Read article →

ARTICLE · 118

Prompt Injection Evaluation

Benchmarks + automated red team methods. Measure model robustness.

Read article →

ARTICLE · 119

Prompt Injection Forensics

Investigate compromised agent. Which prompt triggered, what was done, blast radius.

Read article →

ARTICLE · 120

Indirect Prompt Injection

Attacker plants payload in data (email, PDF, webpage). LLM reads it as instruction.

Read article →

ARTICLE · 121

Prompt Injection via RAG Retrieval

Attacker plants injection in RAG corpus. Retrieved on relevant query. Persistent attack.

Read article →

ARTICLE · 122

Prompt Injection WAFs

Web Application Firewalls for LLM. Emerging category.

Read article →

ARTICLE · 123

Protecting Prompt IP

System prompts as trade secret. Practical protection.

Read article →

ARTICLE · 124

Prompt Licensing + Marketplaces

PromptBase and emerging prompt IP economy.

Read article →

ARTICLE · 125

System Prompt Stealing

Attackers exfiltrate proprietary system prompts. Business logic leak.

Read article →

ARTICLE · 126

Transparency UX

Design patterns for LLM transparency. Trust + informed use.

Read article →

ARTICLE · 127

Prompt Visibility UX Patterns

Show users what shapes AI responses. Design library.

Read article →

ARTICLE · 128

PyRIT

Microsoft&#x27;s red team framework. Automated adversarial testing.

Read article →

ARTICLE · 129

RAG Document Curation

Prevent poisoned documents entering KB. Curation + moderation patterns.

Read article →

ARTICLE · 130

RAG Provenance Tracking

Track every retrieval + generation for audit + debugging.

Read article →

ARTICLE · 131

Rate Limiting for LLM Endpoints

Per-user + per-token + per-cost. Standard patterns.

Read article →

ARTICLE · 132

Red Team Process for LLM Products

How to run structured red team. Scope, methodology, remediation.

Read article →

ARTICLE · 133

Refusal Training via RLHF

Model learns when to refuse via reward model. Foundation of aligned models.

Read article →

ARTICLE · 134

Regulatory Reporting for AI Incidents

When + how to report AI incidents. EU AI Act + sector regs.

Read article →

ARTICLE · 135

Representation Engineering

Systematic method for reading + controlling internal representations.

Read article →

ARTICLE · 136

Responsible AI Toolbox

Microsoft&#x27;s ML fairness + interpretability library. Broader than LLMs.

Read article →

ARTICLE · 137

Responsible Disclosure for LLM Vulnerabilities

Coordinated disclosure with LLM providers. Bounty programs.

Read article →

ARTICLE · 138

Safe Completion via Planner Layer

Add planner between user + LLM. Planner enforces policy. Reduces raw model risk.

Read article →

ARTICLE · 139

Safety Evaluation Benchmarks

HELM Safety, BigBench Hard, WMDP, JailbreakBench. Standardized safety measurement.

Read article →

ARTICLE · 140

Scheming + Situational Awareness in LLMs

Frontier safety concerns: LLMs recognizing being evaluated + adjusting behavior.

Read article →

ARTICLE · 141

Secret Management for Agents

Never put API keys in prompts. Vault + short-lived tokens.

Read article →

ARTICLE · 142

Secret Scanning in AI Outputs

Detect + redact leaked secrets before user/downstream sees.

Read article →

ARTICLE · 143

Shadow Prompts

Full prompt often invisible to user. Transparency + trust patterns.

Read article →

ARTICLE · 144

Sparse Autoencoders

Decompose activations into sparse combinations of interpretable features.

Read article →

ARTICLE · 145

Streaming Moderation

Classify chunks as they stream. Terminate on toxic. UX + latency trade-offs.

Read article →

ARTICLE · 146

Supply Chain

Poisoned models on HuggingFace. Malicious pip packages. Real 2024 incidents.

Read article →

ARTICLE · 147

Sycophancy Detection + Mitigation

LLMs agree with user beliefs even when wrong. Measure + counter.

Read article →

ARTICLE · 148

Threat Modeling for LLM Applications

STRIDE + LLM-specific. Structured pre-launch security review.

Read article →

ARTICLE · 149

Glitch Tokens

Certain rare tokens produce bizarre/predictable outputs. Fingerprinting + exploit.

Read article →

ARTICLE · 150

Tool-Use Authorization Design

OAuth per tool. Scoped tokens. User consent flow.

Read article →

ARTICLE · 151

Training Data Documentation

EU AIA Article 53. Summary of training content. Regulator + rightsholder access.

Read article →

ARTICLE · 152

Training Data Extraction

Query LLM in ways that leak training data verbatim. Copyright + privacy risk.

Read article →

ARTICLE · 153

Vendor AI Risk Assessment

Evaluate third-party AI vendors. Security + compliance + business risk.

Read article →

ARTICLE · 154

Watermarking LLM Outputs

Statistical bias in generation. Detect AI-generated text. Provenance tool.

Read article →

All 154 articles, sorted alphabetically

Activation Engineering

Adversarial Examples for LLMs

Adversarial ML

ML-Based Adversarial Prompt Detection

Adversarial Robustness Certification

Adversarial Training for LLM Safety

Age Verification for LLM Products

Confused Deputy

Agent Browser Security

Capability Tokens

Agent Code Execution Security

Agent DoS

Agent Kill Switch

Agent Memory Security

Agent Observability

Agent Permission Prompt Patterns

Agent Planning Attacks

Reversibility by Design

Agent Sandboxing

Agent SSRF

Tool Bombs

Agents + Authentication Protocols

AIBOM

AI Governance Program Structure

AI Security Research Organizations

Anomaly Detection for LLM Usage

API Key Management for LLM Services

Audit Logging for LLM Apps

Automated Red Team for LLMs

AWS Bedrock Guardrails

Azure AI Content Safety

Backdoor Detection in Models

Bias Detection in LLMs

Chinese AI Regulations

Citation Verification

The AI Safety Race

Confidential Computing for LLM Inference

Consent Flows for AI Training + Data

Constitutional AI for Safety

Continuous Red Team Pipeline

Copyright + AI Training

Cost-Based Abuse Detection

Crescendo Attack

Data Exfiltration via LLM Tools

Data Governance for AI

Data Poisoning

Web-Scale Data Poisoning Defenses

Datasheets for Datasets

Deception Detection in LLM Outputs

Differential Privacy in LLM Training

Safe Prompting via DSPy Signatures

Egress Control for Agents

Emergency Disclosure Procedures

Employee AI Usage Policies

Ethics Engineering

EU AI Act

Fairness Auditing for AI Systems

Federated Fine-Tuning

Federated Prompt Engineering

Future of LLM Security

Garak

GCG

GDPR Right to Erasure Applied to LLMs

Guardrails AI

Guardrails Architecture

Hallucination Attacks

Hallucination Detection Techniques

HHH

AI Feature Review Board

Human-in-the-Loop for High-Risk Actions

AI Impact Assessment

Incident Response for LLM Systems

Indirect Injection

Injection via LLM-Generated Summaries

Input Classifier

Insurance for AI Systems

Jailbreaks

LangChain Security Considerations

Llama Guard