Mixed-Precision Inference — Belgavi.AI Lab

Modern LLM inference is not 'one precision throughout'. Different layers and tensors use different precisions to balance speed, memory, and quality. Knowing what runs at what precision helps you reason about quality regressions and pick deployment configurations.

Advertisement

The typical mix

Weights: INT4 (most params). KV cache: FP8 (compressing the dominant memory cost). Activations: FP8 or BF16 (depending on stack). Output projection: BF16 (numerical stability). Logits: FP32 (for sampling stability). Five precisions in one forward pass; standard.

Why not all one precision

All-INT4 hurts quality on certain layers (especially output, certain attention paths). All-FP16 wastes memory on quantizable weights. The mixed precision is the cost-quality optimum; pure precisions are corners.

Advertisement

Stack support

vLLM 0.6+, SGLang, TensorRT-LLM all do mixed precision automatically per layer. You configure target weight precision; they pick the right activation precision per op. Manual fine-tuning rarely needed.

Hardware constraints

Hopper (H100/H200): FP8 tensor cores. INT4 via Marlin kernels. Ampere (A100): FP16/BF16 tensor cores, INT4 via dequantize-then-FP16. Blackwell (B100/B200): native FP4. Pick precisions your hardware accelerates; mismatch wastes performance.

Debugging

If a quantized model is producing garbage: bisect — which layer's precision change broke it. Tools (NVIDIA's nv-cusparselt, llm-compressor's per-layer eval) make this manageable. Don't blame the whole quantization recipe; one layer often is the culprit.

Five precisions in one model is normal. Configure target; stack picks per-layer. Hardware sets ceiling. Bisect on quality regressions.