LLM quantization went mainstream when GPT-class models needed to fit on consumer GPUs. INT8 vs INT4 vs lower is no longer 'research vs prod'; it's the daily inference decision. The quality/size tradeoff is well-characterized now.

Advertisement

INT8: nearly-free downsizing

Weight + activation quantization to 8-bit. ~0.5% quality drop on benchmarks. ~2x smaller memory, ~1.5-2x faster inference on modern GPUs (Tensor Core int8 paths). The default choice for production inference unless you're memory-constrained.

INT4: aggressive but viable for weights

Weights to 4-bit, activations stay 8 or 16 bit (mixed precision). ~1-3% quality drop with GPTQ/AWQ. ~4x smaller weights. Right for fitting 70B models on 48GB GPUs.

Advertisement

INT3, INT2, binary: not free

Below 4-bit, quality drops sharply. INT2 with QuIP or similar can preserve ~80% performance, but most workloads can't tolerate that drop. Right for research, not production.

Calibration matters more than algorithm

GPTQ, AWQ, SmoothQuant — algorithms differ at the margin. The bigger lever is calibration data: 1024-2048 samples representative of your inference distribution. Wrong calibration = big quality hit regardless of algorithm.

Practical guidance

Production: INT8 if you have GPU memory, INT4 if you don't. Use AWQ or GPTQ as the algorithm. Calibrate on your domain. Validate on your evals, not generic benchmarks — quantization sometimes hurts long-context or reasoning tasks more than benchmark numbers show.

INT8 default. INT4 when fitting bigger models. Always calibrate on your data and eval on your tasks.