AWQ (Activation-aware Weight Quantization) and GPTQ are the two dominant algorithms for post-training 4-bit quantization. Both are excellent; they fit different workloads. The 2026 picture is clearer than when they were both new.

Advertisement

GPTQ — column-wise greedy

Quantizes weights column-by-column, propagating reconstruction error to later columns. Calibration-data-driven; ~1-2 hour run for 70B. Good quality, broad tooling support, mature ecosystem.

AWQ — protect important channels

Identifies salient activation channels and scales weights to protect them before quantization. Faster to apply (~30 min on 70B). Slightly better quality than GPTQ on most benchmarks.

Advertisement

Quality comparison

On standard LLM benchmarks: AWQ ~0.5% better than GPTQ on average, with some tasks (long-context, reasoning) showing larger gaps. Both are within 2% of FP16 baseline.

Inference speed

AWQ has faster inference kernels (no per-output-channel scale lookup). On GPU, AWQ is 10-30% faster. For latency-critical serving, AWQ is the better default in 2026.

Tooling notes

vLLM, TensorRT-LLM, lmdeploy all support both. HuggingFace's optimum library wraps both. Calibration: use your domain data, not C4. ~1000 samples is enough.

AWQ for new deployments — faster inference, slightly better quality. GPTQ when you already have a pipeline built around it.