INT8 weight quantization is nearly free. INT8 activation quantization is where it gets interesting — activations have outliers that destroy naive quantization quality. SmoothQuant solved this; the technique is now standard but worth understanding.

Advertisement

The outlier problem

LLM activations have a small number of channels with much larger magnitudes than the rest (often 100x). Naive per-tensor INT8 quantization clips these outliers, destroying quality. Per-channel quantization is computationally awkward.

SmoothQuant's trick

Migrate the difficulty from activations (hard) to weights (easy) via a mathematical equivalence: scale activations down by S, scale weights up by S. Activation outliers smoothed; weight quantization handles the new magnitude fine. No quality loss.

Advertisement

Calibration data choice

Use ~1000 samples representative of your inference workload. Random web text is OK as a starting baseline; domain data matters for domain-fine-tuned models. Calibration is O(seconds) — quick to iterate.

Where it sits in the stack

Post-training, after model is trained. Implemented in optimum-intel, vLLM, TensorRT-LLM. Activations: INT8 at inference. Weights: INT8 (often + INT4 mixed-precision variants).

Limits

Doesn't help with attention scores quantization (which has its own outlier story; FP8 K/V cache is the modern answer). Doesn't help with very low bit-width activations (INT4 activations still hard).

SmoothQuant migrates outliers from activations to weights. INT8 W+A becomes free. Standard in modern inference servers.