At long context, the KV cache often exceeds model weights. Quantizing it is one of the highest-leverage inference optimizations. The trade-off: quality vs memory vs throughput. Modern engines support FP8 and INT4 cache.

Advertisement

Why it's huge

# Llama 3 70B at seq=32K, batch=1, FP16:
# Per layer: 2 (K+V) * 8 (KV heads) * 128 (d_head) * 32K * 2 bytes = 32 MB
# 80 layers: 2.5 GB just for KV cache
# At batch=8: 20 GB

Already at moderate batch and context, KV cache rivals or exceeds the model weights. Quantizing it relaxes the memory constraint for serving.

FP8 cache (vLLM, TensorRT-LLM)

# Store K and V as FP8 (E4M3 or E5M2)
# 2x memory reduction vs FP16
# Quality drop: typically <0.5% on benchmarks

Standard production quantization for KV cache in 2026. Hardware support on H100+ and recent CPUs (AMX). Trivial code change in vLLM via --kv-cache-dtype fp8.

Advertisement

INT4 cache (llama.cpp)

# Store K and V as INT4 with per-block scaling
# 4x memory reduction
# Quality drop: 1-3% on benchmarks, sometimes more

More aggressive. Used in llama.cpp for memory-constrained CPU inference. Quality varies by task; long-context coherence sometimes degrades. Verify on your workload.

Per-channel vs per-tensor

Per-tensor: one scale for the whole cache. Fast but inaccurate. Per-channel (per head_dim): more accurate, slightly more storage for scales. Modern engines use per-channel for INT4 cache.

When to skip cache quant

Short-context use cases (<2K) where cache is small. Quality-critical tasks where you can't afford 1% drop. Memory-rich systems where compute (not memory) is the bottleneck. Quantize cache when long-context or memory-constrained.

FP8 KV cache is the production default. INT4 for memory-constrained CPU. Per-channel scaling &gt; per-tensor.