Quantization-Aware Training

Post-training quantization (PTQ) is fast and good enough for most LLM use cases. Quantization-aware training (QAT) is slow and overkill for most. But for aggressive quantization (INT3, INT2, ternary) or specific domains, QAT recovers quality PTQ can't.

Advertisement

PTQ — fast and usually good

Quantize after training, calibrate with ~1K samples. Hours to apply. ~1-2% quality drop for INT4 on standard LLMs. The default for LLM inference; QLoRA fine-tuning uses PTQ on the base.

QAT — train with fake-quantize ops

During training, simulate quantization on forward pass while keeping gradients in FP. Model learns weights robust to the quantization error. Slower training (1.5-2x) but recovers most quality.

Advertisement

When QAT wins

Aggressive bit widths (INT3, INT2, ternary): PTQ loses 10%+; QAT recovers most. Edge inference where every percent of accuracy matters. Specific domains where PTQ calibration data underrepresents.

LLM-specific complications

Pretraining-time QAT is impractical (training is expensive enough). Fine-tuning-time QAT works: take a pretrained model, fine-tune with fake-quantize ops. Loses some quality vs FP fine-tune but quantization-ready at end.

Tooling state

PyTorch FX-quantization, Intel's NNCF, NVIDIA's TensorRT-LLM all support QAT. Less mature than PTQ but workable. Most teams will never need QAT; if you do, plan for the longer iteration cycle.

PTQ for INT8/INT4 (default). QAT for INT3/INT2 or accuracy-critical edge. QAT during fine-tune, not pretrain.