KV Cache Memory Growth — Belgavi.AI Lab

Advertisement

Layers KV heads d_head Context 8192 Batch Precision

KV cache = 2 * L * kv_heads * d_head * ctx * batch * bytes_per_val.

Long-context KV cache often exceeds model weights. Quantize to FP8 or INT4 for serving.

★ KEY TAKEAWAY

KV cache scales linearly with context length. At long context, it often exceeds model weights — biggest memory cost in serving.

▶ WHAT TO TRY

Drag Context from 1K to 128K — see the curve grow.
Switch Precision to FP8 or INT4 — instant 2× / 4× memory reduction.
This is why production engines (vLLM, TensorRT-LLM) ship FP8 KV cache by default.