All 154 articles, sorted alphabetically
Activation Engineering
Add vectors to model activations. Change behavior without retraining.
Read article →Adversarial Examples for LLMs
Perturbations to text that change model's answer. Text version of image adversarials.
Read article →Adversarial ML
Foundational papers + concepts. Understand history to understand present.
Read article →ML-Based Adversarial Prompt Detection
Beyond signatures: neural classifiers for injection detection.
Read article →Adversarial Robustness Certification
Formal proofs of model robustness. Emerging in safety-critical domains.
Read article →Adversarial Training for LLM Safety
Train model on known attacks. Reduces vulnerability. Anthropic + OpenAI standard practice.
Read article →Age Verification for LLM Products
Regulator focus. Kids protection. UK OSA, US COPPA.
Read article →Confused Deputy
Agent has legitimate access. Attacker tricks agent into using it for attacker's benefit.
Read article →Agent Browser Security
Agent browsing web. Sandbox browser. Inject-safe DOM interaction.
Read article →Capability Tokens
Not all-or-nothing. Each capability separately gated. Least privilege for agents.
Read article →Agent Code Execution Security
Agent generates + runs code. Isolate. Never trust output.
Read article →Agent DoS
Attacker triggers agent into expensive loop. Racks up API bills.
Read article →Agent Kill Switch
Ability to stop all agents immediately. Business continuity requirement.
Read article →Agent Memory Security
Persistent agent memory attackable. Poisoning + exfil + privacy.
Read article →Agent Observability
Trace every step: LLM calls, tool calls, retrievals. Debug + audit.
Read article →Agent Permission Prompt Patterns
How agents request permissions at runtime. Design + UX.
Read article →Agent Planning Attacks
Attacker manipulates agent's plan. Attacks against ReAct + planner architectures.
Read article →Reversibility by Design
Design agents so actions can be undone. Reduces harm from mistakes + attacks.
Read article →Agent Sandboxing
Isolate agent runtime. Contain tool exploits. Docker + gVisor + Firecracker options.
Read article →Agent SSRF
Agent tool fetches URL from LLM output. Attacker points to internal endpoints.
Read article →Tool Bombs
Attacker triggers cascading expensive tool calls. Cost DoS.
Read article →Agents + Authentication Protocols
OAuth, DPoP, JWKS for AI agents. Emerging patterns.
Read article →AIBOM
Track components in AI systems: models, datasets, dependencies. Emerging standard.
Read article →AI Governance Program Structure
Organizational governance for AI systems. Roles + processes.
Read article →AI Security Research Organizations
Where cutting-edge LLM security research happens. Follow these.
Read article →Anomaly Detection for LLM Usage
Detect abuse patterns. Cost + behavior + content anomalies.
Read article →API Key Management for LLM Services
Provider keys as prime target. Rotation + limits + monitoring.
Read article →Audit Logging for LLM Apps
Log every request/response + metadata. Foundation of incident response + compliance.
Read article →Automated Red Team for LLMs
LLM-driven attack discovery. Continuous adversarial pressure.
Read article →AWS Bedrock Guardrails
Managed guardrails on Bedrock LLMs. Enterprise pattern.
Read article →Azure AI Content Safety
Microsoft's managed guardrails. Deep integration with Azure OpenAI.
Read article →Backdoor Detection in Models
Detect if model was poisoned. Meta's technique + academic work.
Read article →Bias Detection in LLMs
Demographic parity, representational bias. Standardized measurement.
Read article →Chinese AI Regulations
Generative AI rules (2023). Watermarking. Content requirements.
Read article →Citation Verification
Verify each citation tag references real source. Reject fabricated citations.
Read article →The AI Safety Race
How competitive dynamics affect safety practices. Deployment pressure.
Read article →Confidential Computing for LLM Inference
Encrypted enclave inference. Nitro Enclaves + H100 confidential.
Read article →Consent Flows for AI Training + Data
User consent for data use in training. GDPR + emerging norms.
Read article →Constitutional AI for Safety
Explicit principles guide self-critique + revision. Anthropic's alignment method.
Read article →Continuous Red Team Pipeline
Automated attacks in CI/CD. Every prompt/model change tested.
Read article →Copyright + AI Training
NYT vs OpenAI, Getty vs Stability. Emerging legal landscape.
Read article →Cost-Based Abuse Detection
Cost signal as security signal. Runaway usage = attack or bug.
Read article →Crescendo Attack
Microsoft's finding: 5-8 turns of gradual escalation bypasses safety in most models.
Read article →Data Exfiltration via LLM Tools
Injection + browsing tool = attacker steals user data. Real-world attack pattern.
Read article →Data Governance for AI
Data lineage, classification, retention. Foundation of AI compliance.
Read article →Data Poisoning
Attackers plant data during pretraining. Backdoor behavior triggered at inference.
Read article →Web-Scale Data Poisoning Defenses
Detect + filter poisoned content in training data.
Read article →Datasheets for Datasets
Structured docs for ML datasets. Gebru et al framework. Foundation of ML data ethics.
Read article →Deception Detection in LLM Outputs
Model may deceive strategically. Emerging safety research.
Read article →Differential Privacy in LLM Training
Provable privacy guarantees for training data. DP-SGD + trade-offs.
Read article →Safe Prompting via DSPy Signatures
Compiler-generated prompts more resilient to injection.
Read article →Egress Control for Agents
Restrict outbound network from agent runtime. Prevent exfiltration + SSRF.
Read article →Emergency Disclosure Procedures
AI incident notifications. Users, customers, regulators, public.
Read article →Employee AI Usage Policies
Rules for staff using AI tools. Data protection + IP + compliance.
Read article →Ethics Engineering
From principles to code. Process + patterns.
Read article →EU AI Act
Risk-based regulation, effective 2025-2027. What to do now.
Read article →Fairness Auditing for AI Systems
Structured audit for disparate impact. NYC AEDT template.
Read article →Federated Fine-Tuning
Fine-tune without central data collection. Enterprise pattern.
Read article →Federated Prompt Engineering
Cross-org sharing of prompt patterns. Emerging community norms.
Read article →Future of LLM Security
Where the field is heading. Regulation, technique, threats.
Read article →Garak
Nikto for LLMs. Automated probing for jailbreaks + leakage.
Read article →GCG
Automated jailbreak via gradient-guided search. Transferable across models.
Read article →GDPR Right to Erasure Applied to LLMs
How to delete personal data from models. Emerging enforcement.
Read article →Guardrails AI
Open-source framework. Validators + auto-reask on failure. LLM-agnostic.
Read article →Guardrails Architecture
3-layer defense: input filter → LLM → output filter. Each layer has different jobs.
Read article →Hallucination Attacks
Force LLM to confidently produce specific wrong information. Poisoning + specific triggers.
Read article →Hallucination Detection Techniques
SelfCheckGPT, chain-of-verification, factuality classifiers. Compare + combine.
Read article →HHH
Anthropic's alignment target. Trade-offs + ordering.
Read article →AI Feature Review Board
Cross-functional approval for high-risk AI features. Governance pattern.
Read article →Human-in-the-Loop for High-Risk Actions
Route sensitive actions for approval. Async workflow with agent pause.
Read article →AI Impact Assessment
Systematic assessment of AI system risks. Regulator template.
Read article →Incident Response for LLM Systems
Playbook when injection, exfil, or harm reported. Standard IR extended.
Read article →Indirect Injection
Fingerprint known injection payloads. Fast filter before LLM.
Read article →Injection via LLM-Generated Summaries
LLM-generated summary contains injection. Fed into other LLM. Chains attacks.
Read article →Input Classifier
ML classifier on user input. Flags injection attempts before LLM sees.
Read article →Insurance for AI Systems
Cyber + E&O + emerging AI-specific policies. Coverage gaps.
Read article →Jailbreaks
Craft prompt that bypasses safety training. Classic jailbreaks + why they work.
Read article →LangChain Security Considerations
Common LangChain-specific security pitfalls + patterns.
Read article →Llama Guard
Fine-tuned Llama for input + output moderation. Open-source, deployable.
Read article →LLM Deployment Safety Checklist
Pre-launch safety review checklist. Comprehensive.
Read article →LLM Deployment Hardening
Container security, network, secrets, monitoring. Standard cloud hardening + LLM.
Read article →LLM Observability Platforms
LangSmith, Arize, Datadog LLM Obs. Choose per stack.
Read article →LLM Security Certifications
SOC 2 + ISO 27001 applied to AI. Emerging AI-specific: ISO 42001.
Read article →LLM Security Engineer Role
New discipline. Skills, responsibilities, career path.
Read article →LLM Supply Chain Hardening
End-to-end: base model + fine-tune + framework + deploy. Layered defense.
Read article →Machine Unlearning in LLMs
Selectively forget training data. GDPR right-to-erasure applied to models.
Read article →MCP Prompt Injection Defenses
Specific defenses for MCP tool outputs. Delimit + filter + sanitize.
Read article →MCP Security
Anthropic's MCP. Server auth, tool auth, prompt injection surface.
Read article →Mechanistic Interpretability
Reverse-engineer computation in neural networks. Anthropic + independent researchers.
Read article →Membership Inference
Determine if specific record was in training data. Privacy risk.
Read article →MITRE ATLAS
Adversarial ML threat taxonomy analogous to ATT&CK. Standard for AI security ops.
Read article →Model Cards
Standardized disclosure for AI models. Mitchell et al framework + regulatory adoption.
Read article →Model Inversion
Given model, reconstruct approximate training examples. Face recognition attack originally.
Read article →Model Signing + Provenance
Cryptographic signatures on model weights. Sigstore + emerging standards.
Read article →Model Stealing
Query API repeatedly. Fit surrogate model. Ippolito et al: even final layer extractable.
Read article →Multi-Tenant LLM Isolation
Prevent cross-tenant data leakage. Architectural patterns.
Read article →Multi-Turn Jailbreaks
Set up context over N turns. Each turn benign. Final turn exploits accumulated context.
Read article →Multimodal Injection
Voice assistant hears malicious instruction embedded in audio. Ultrasonic + adversarial.
Read article →Multimodal Prompt Injection
Text in image = instruction. Attacker uploads image with hidden text. Model reads + complies.
Read article →National AI Safety Institutes
UK + US + Japan + Singapore + others. Emerging oversight.
Read article →NVIDIA NeMo Guardrails Framework
Programmable dialog + Colang rails. Enterprise LLM safety framework.
Read article →Network Isolation
Isolate agent workloads at network layer. Prevent east-west movement on compromise.
Read article →NIST AI Risk Management Framework
US government's AI risk framework. Structured governance for AI systems.
Read article →Open Communities for AI Safety
OWASP AI Exchange, MLSec, DEF CON AI Village. Community + resources.
Read article →Fine-Tuning Erodes Safety Training
Small fine-tune undoes RLHF safety. Growing concern.
Read article →Open-Source Model Security Considerations
Trust boundary shifts. Weight tampering, prompt extraction don't apply.
Read article →OpenAI Moderation API
Free classifier for content policy. Standard entry-level guardrail.
Read article →OpenAI Swarm / Agents Security Model
Multi-agent handoffs. Security concerns + patterns.
Read article →Output Provenance + C2PA
Track + attest LLM-generated content. Standards + adoption.
Read article →Output Filtering
Real-time classifier on LLM output. Block or rewrite offensive content.
Read article →Output Grounding Verification
For each claim, check evidence in provided context. Post-hoc filter for RAG.
Read article →OWASP LLM Top 10
Standard threat taxonomy for LLM applications. Every AI eng should know this.
Read article →Payload Smuggling
Hide malicious content inside base64, translation, ROT13. Model decodes + executes.
Read article →Perspective API
Long-running conversational toxicity classifier. Standard baseline.
Read article →PII Detection + Redaction Pipeline
Presidio + custom regex + LLM verification. Standard pattern.
Read article →Defensive Prompt Engineering
Write prompts resistant to injection. Structural + phrasing patterns.
Read article →Direct Prompt Injection
User types 'Ignore previous instructions and…' Model complies. Foundational attack.
Read article →Prompt Injection Evaluation
Benchmarks + automated red team methods. Measure model robustness.
Read article →Prompt Injection Forensics
Investigate compromised agent. Which prompt triggered, what was done, blast radius.
Read article →Indirect Prompt Injection
Attacker plants payload in data (email, PDF, webpage). LLM reads it as instruction.
Read article →Prompt Injection via RAG Retrieval
Attacker plants injection in RAG corpus. Retrieved on relevant query. Persistent attack.
Read article →Prompt Injection WAFs
Web Application Firewalls for LLM. Emerging category.
Read article →Protecting Prompt IP
System prompts as trade secret. Practical protection.
Read article →Prompt Licensing + Marketplaces
PromptBase and emerging prompt IP economy.
Read article →System Prompt Stealing
Attackers exfiltrate proprietary system prompts. Business logic leak.
Read article →Transparency UX
Design patterns for LLM transparency. Trust + informed use.
Read article →Prompt Visibility UX Patterns
Show users what shapes AI responses. Design library.
Read article →PyRIT
Microsoft's red team framework. Automated adversarial testing.
Read article →RAG Document Curation
Prevent poisoned documents entering KB. Curation + moderation patterns.
Read article →RAG Provenance Tracking
Track every retrieval + generation for audit + debugging.
Read article →Rate Limiting for LLM Endpoints
Per-user + per-token + per-cost. Standard patterns.
Read article →Red Team Process for LLM Products
How to run structured red team. Scope, methodology, remediation.
Read article →Refusal Training via RLHF
Model learns when to refuse via reward model. Foundation of aligned models.
Read article →Regulatory Reporting for AI Incidents
When + how to report AI incidents. EU AI Act + sector regs.
Read article →Representation Engineering
Systematic method for reading + controlling internal representations.
Read article →Responsible AI Toolbox
Microsoft's ML fairness + interpretability library. Broader than LLMs.
Read article →Responsible Disclosure for LLM Vulnerabilities
Coordinated disclosure with LLM providers. Bounty programs.
Read article →Safe Completion via Planner Layer
Add planner between user + LLM. Planner enforces policy. Reduces raw model risk.
Read article →Safety Evaluation Benchmarks
HELM Safety, BigBench Hard, WMDP, JailbreakBench. Standardized safety measurement.
Read article →Scheming + Situational Awareness in LLMs
Frontier safety concerns: LLMs recognizing being evaluated + adjusting behavior.
Read article →Secret Management for Agents
Never put API keys in prompts. Vault + short-lived tokens.
Read article →Secret Scanning in AI Outputs
Detect + redact leaked secrets before user/downstream sees.
Read article →Shadow Prompts
Full prompt often invisible to user. Transparency + trust patterns.
Read article →Sparse Autoencoders
Decompose activations into sparse combinations of interpretable features.
Read article →Streaming Moderation
Classify chunks as they stream. Terminate on toxic. UX + latency trade-offs.
Read article →Supply Chain
Poisoned models on HuggingFace. Malicious pip packages. Real 2024 incidents.
Read article →Sycophancy Detection + Mitigation
LLMs agree with user beliefs even when wrong. Measure + counter.
Read article →Threat Modeling for LLM Applications
STRIDE + LLM-specific. Structured pre-launch security review.
Read article →Glitch Tokens
Certain rare tokens produce bizarre/predictable outputs. Fingerprinting + exploit.
Read article →Tool-Use Authorization Design
OAuth per tool. Scoped tokens. User consent flow.
Read article →Training Data Documentation
EU AIA Article 53. Summary of training content. Regulator + rightsholder access.
Read article →Training Data Extraction
Query LLM in ways that leak training data verbatim. Copyright + privacy risk.
Read article →Vendor AI Risk Assessment
Evaluate third-party AI vendors. Security + compliance + business risk.
Read article →Watermarking LLM Outputs
Statistical bias in generation. Detect AI-generated text. Provenance tool.
Read article →