QLoRA combines 4-bit base model quantization with LoRA adapters: fine-tune a 70B model on a single 48GB GPU. The technique made domain fine-tuning accessible to small teams and is now the standard for cost-conscious adaptation.
The base + adapter pattern
Base model frozen at 4-bit. LoRA adapters (low-rank trainable matrices) attached to attention layers. Training updates only adapters: 0.1-1% of total parameters. Memory savings: ~10x vs full fine-tuning.
NF4: the better 4-bit format
NormalFloat4 — quantization optimized for normally-distributed weights (which transformer weights mostly are). Better quality than uniform INT4 for the same memory. Default in bitsandbytes.
Double quantization
Quantizes the quantization constants themselves (the scales used per-block). ~0.4 GB savings on a 70B model. Free quality (no measurable drop).
Paged optimizers
Optimizer states (Adam keeps 2 state tensors per param) are paged between GPU and CPU memory like virtual memory. Smooths over training spikes; avoids OOM crashes during long training runs.
Where it falls short
Pretraining-from-scratch (use full precision). Aggressive fine-tuning that wants base-weight updates too (DoRA, full fine-tune). Tasks requiring INT4 inference but FP16 training (quantize separately after fine-tuning).