FP8 — 8-bit floating point — landed with NVIDIA Hopper (H100) and is now the inference precision of choice for cutting-edge serving. It combines INT8's speed with floating point's dynamic range, eliminating the outlier-clipping problem that plagues INT8 activations.
Two formats: E4M3 and E5M2
E4M3: 4-bit exponent, 3-bit mantissa. Higher precision. Used for forward activations. E5M2: 5-bit exponent, 2-bit mantissa. Wider dynamic range. Used for backward gradients during training.
Why it beats INT8 for activations
INT8 needs careful per-channel scaling to avoid clipping outliers. FP8's exponent absorbs outliers naturally. Quality nearly matches FP16 without SmoothQuant-style tricks.
Throughput vs FP16
On H100/H200: 2x throughput. On Blackwell B100+: 4x throughput (native FP8 tensor cores). Memory bandwidth halved. Real speedups, not benchmark theater.
KV cache in FP8
Storing K and V tensors in FP8 cuts KV memory by 2x vs FP16, with minimal quality drop (<1%). Enables longer contexts or larger batches at the same memory budget. Standard in vLLM 0.6+, TensorRT-LLM.
When to use FP8 vs INT8
FP8: easier to deploy, near-zero quality drop, requires H100+. INT4 (with Marlin or Blackwell native): smaller memory than FP8, slight quality cost, broader hardware support. Both have a place; FP8 is the safer default on supported hardware.