Post-training quantization (PTQ) is fast and good enough for most LLM use cases. Quantization-aware training (QAT) is slow and overkill for most. But for aggressive quantization (INT3, INT2, ternary) or specific domains, QAT recovers quality PTQ can't.
PTQ — fast and usually good
Quantize after training, calibrate with ~1K samples. Hours to apply. ~1-2% quality drop for INT4 on standard LLMs. The default for LLM inference; QLoRA fine-tuning uses PTQ on the base.
QAT — train with fake-quantize ops
During training, simulate quantization on forward pass while keeping gradients in FP. Model learns weights robust to the quantization error. Slower training (1.5-2x) but recovers most quality.
When QAT wins
Aggressive bit widths (INT3, INT2, ternary): PTQ loses 10%+; QAT recovers most. Edge inference where every percent of accuracy matters. Specific domains where PTQ calibration data underrepresents.
LLM-specific complications
Pretraining-time QAT is impractical (training is expensive enough). Fine-tuning-time QAT works: take a pretrained model, fine-tune with fake-quantize ops. Loses some quality vs FP fine-tune but quantization-ready at end.
Tooling state
PyTorch FX-quantization, Intel's NNCF, NVIDIA's TensorRT-LLM all support QAT. Less mature than PTQ but workable. Most teams will never need QAT; if you do, plan for the longer iteration cycle.