'LLM cache' refers to three different things at three different layers. Each saves different costs. Conflating them produces bad architecture choices.

Advertisement

KV cache (per-request)

Transformer caches past key/value tensors to avoid recomputing attention for already-seen tokens. Automatic inside the model. Memory ~2 × layers × seq_len × hidden_dim × batch. Dominant memory cost during inference.

Prefix cache (cross-request)

Same system prompt or context across many requests. Cache the KV tensors for that shared prefix. vLLM, SGLang, Anthropic's prompt caching — all do this. 5-10x cost reduction for cacheable workloads.

Advertisement

Semantic cache (cross-query)

Embed query, search for semantically similar past query, return cached answer if hit. Saves model call entirely. Tools: GPTCache. Useful for FAQ-like deterministic workloads, useless when queries are unique.

KV cache: automatic. Prefix cache: huge ROI for shared prompts. Semantic cache: niche. Don't conflate.