FP8 Inference Explained — Belgavi.AI Lab

FP8 — 8-bit floating point — landed with NVIDIA Hopper (H100) and is now the inference precision of choice for cutting-edge serving. It combines INT8's speed with floating point's dynamic range, eliminating the outlier-clipping problem that plagues INT8 activations.

Advertisement

Two formats: E4M3 and E5M2

E4M3: 4-bit exponent, 3-bit mantissa. Higher precision. Used for forward activations. E5M2: 5-bit exponent, 2-bit mantissa. Wider dynamic range. Used for backward gradients during training.

Why it beats INT8 for activations

INT8 needs careful per-channel scaling to avoid clipping outliers. FP8's exponent absorbs outliers naturally. Quality nearly matches FP16 without SmoothQuant-style tricks.

Advertisement

Throughput vs FP16

On H100/H200: 2x throughput. On Blackwell B100+: 4x throughput (native FP8 tensor cores). Memory bandwidth halved. Real speedups, not benchmark theater.

KV cache in FP8

Storing K and V tensors in FP8 cuts KV memory by 2x vs FP16, with minimal quality drop (<1%). Enables longer contexts or larger batches at the same memory budget. Standard in vLLM 0.6+, TensorRT-LLM.

When to use FP8 vs INT8

FP8: easier to deploy, near-zero quality drop, requires H100+. INT4 (with Marlin or Blackwell native): smaller memory than FP8, slight quality cost, broader hardware support. Both have a place; FP8 is the safer default on supported hardware.

FP8 = INT8 speed + FP dynamic range. Default precision on H100+. KV cache in FP8 is the biggest memory win.