The KV cache is the single most important inference optimization for transformers. It turns O(T²) decode into O(T·N). Knowing its math and memory cost helps you reason about long-context inference.

Advertisement

Why cache K and V (not Q)?

# At decode step t, attention is:
Q_t = x_t · W_Q                  # 1 vector (just new token)
K = stack(K_0, K_1, ..., K_t)     # all keys so far
V = stack(V_0, V_1, ..., V_t)     # all values so far

out_t = softmax(Q_t · Kᵀ / sqrt(d)) · V

Q for past positions is never used again (they're already done). K and V for past positions are used at every future step. Cache them; recompute only the new Q.

Cache append, not recompute

# After processing new token t:
k_t = x_t · W_K            # 1 vector
v_t = x_t · W_V            # 1 vector
k_cache = concat(k_cache, k_t)   # grows by 1
v_cache = concat(v_cache, v_t)

Each step adds one row to k_cache and v_cache. The rest of the cache is reused. Implementation: pre-allocate a fixed buffer for max_seq, write into position [t].

Advertisement

Memory cost

# Per layer:
K_size = h * d_head * seq * batch * bytes
V_size = same as K

# Total across L layers:
total = 2 * L * h * d_head * seq * batch * bytes

For Llama 3 8B (L=32, h=8 KV heads — GQA, d_head=128, BF16=2 bytes): per token: 32 · 8 · 128 · 2 · 2 = 128 KB. At seq=8192, batch=1: 1 GB. At seq=32768: 4 GB.

GQA reduces it

Grouped Query Attention shares K, V across query heads. Llama 3 has 32 query heads but only 8 KV heads — 4× cache reduction. MLA (DeepSeek): compressed latent K, V — additional ~4× reduction. Without these, long-context inference would be infeasible.

Quantizing the KV cache

# vLLM and TensorRT-LLM: FP8 KV cache
# - 2x memory reduction vs BF16
# - <1% quality drop

# llama.cpp: per-tensor INT4 KV
# - 4x reduction vs BF16
# - 1-2% quality drop

KV cache quantization is one of the highest-impact inference optimizations. Lets you serve longer contexts or bigger batches at the same memory budget. Standard in production inference engines.

KV cache = past K, V tensors. Reuse instead of recompute. Often biggest memory cost at long context. Quantize for serving.