Classification — labeling inputs into discrete categories — is one of the highest-volume LLM use cases (intent detection, content moderation, ticket routing). For this narrow task, a small model fine-tuned on domain data beats GPT-4 zero-shot on cost-adjusted quality almost always.
Pre-LLM classification still has a place
DistilBERT, MiniLM fine-tuned for classification: ~3M params, 1ms inference. Fast and cheap. Right answer for high-volume, simple-label classification. Don't reach for LLMs reflexively.
When small generative LLM wins
Complex labels (multi-class with subtle distinctions). Few-shot or zero-shot needed (no training data). Reasoning required ('this looks like fraud because...'). 1-3B LLM with prompting beats DistilBERT here.
Constrained output for accuracy
Force the LLM to output one of the valid labels via constrained decoding (or logit bias). Eliminates 'something close to but not quite a label' errors. Standard in inference servers; underused.
Fine-tuning is cheap
1-3B model + 1K labeled examples + QLoRA = 2-4 hours on one GPU. Accuracy jumps 10-20 points over zero-shot. ROI is days for any classifier processing >10K items/day.
Confidence calibration
Get token probabilities at the label position; that's confidence. Calibrate (Platt scaling, isotonic) against held-out data. Useful for routing: high-confidence → auto-act, low-confidence → human review.