Advertisement
KV cache = 2 * L * kv_heads * d_head * ctx * batch * bytes_per_val.
What you're seeing
Long-context KV cache often exceeds model weights. Quantize to FP8 or INT4 for serving.
★ KEY TAKEAWAY
KV cache scales linearly with context length. At long context, it often exceeds model weights — biggest memory cost in serving.
▶ WHAT TO TRY
- Drag Context from 1K to 128K — see the curve grow.
- Switch Precision to FP8 or INT4 — instant 2× / 4× memory reduction.
- This is why production engines (vLLM, TensorRT-LLM) ship FP8 KV cache by default.