QLoRA Fine-Tuning Explained

QLoRA combines 4-bit base model quantization with LoRA adapters: fine-tune a 70B model on a single 48GB GPU. The technique made domain fine-tuning accessible to small teams and is now the standard for cost-conscious adaptation.

Advertisement

The base + adapter pattern

Base model frozen at 4-bit. LoRA adapters (low-rank trainable matrices) attached to attention layers. Training updates only adapters: 0.1-1% of total parameters. Memory savings: ~10x vs full fine-tuning.

NF4: the better 4-bit format

NormalFloat4 — quantization optimized for normally-distributed weights (which transformer weights mostly are). Better quality than uniform INT4 for the same memory. Default in bitsandbytes.

Advertisement

Double quantization

Quantizes the quantization constants themselves (the scales used per-block). ~0.4 GB savings on a 70B model. Free quality (no measurable drop).

Paged optimizers

Optimizer states (Adam keeps 2 state tensors per param) are paged between GPU and CPU memory like virtual memory. Smooths over training spikes; avoids OOM crashes during long training runs.

Where it falls short

Pretraining-from-scratch (use full precision). Aggressive fine-tuning that wants base-weight updates too (DoRA, full fine-tune). Tasks requiring INT4 inference but FP16 training (quantize separately after fine-tuning).

QLoRA = 4-bit base + LoRA + NF4 + paged optimizer. Standard recipe for fine-tuning 30B-70B models on consumer/prosumer GPUs.