Small Models for Classification

Classification — labeling inputs into discrete categories — is one of the highest-volume LLM use cases (intent detection, content moderation, ticket routing). For this narrow task, a small model fine-tuned on domain data beats GPT-4 zero-shot on cost-adjusted quality almost always.

Advertisement

Pre-LLM classification still has a place

DistilBERT, MiniLM fine-tuned for classification: ~3M params, 1ms inference. Fast and cheap. Right answer for high-volume, simple-label classification. Don't reach for LLMs reflexively.

When small generative LLM wins

Complex labels (multi-class with subtle distinctions). Few-shot or zero-shot needed (no training data). Reasoning required ('this looks like fraud because...'). 1-3B LLM with prompting beats DistilBERT here.

Advertisement

Constrained output for accuracy

Force the LLM to output one of the valid labels via constrained decoding (or logit bias). Eliminates 'something close to but not quite a label' errors. Standard in inference servers; underused.

Fine-tuning is cheap

1-3B model + 1K labeled examples + QLoRA = 2-4 hours on one GPU. Accuracy jumps 10-20 points over zero-shot. ROI is days for any classifier processing >10K items/day.

Confidence calibration

Get token probabilities at the label position; that's confidence. Calibrate (Platt scaling, isotonic) against held-out data. Useful for routing: high-confidence → auto-act, low-confidence → human review.

Simple labels: DistilBERT. Complex labels: 1-3B LLM fine-tuned. Constrain decoding. Calibrate confidence. Cheap, accurate, in production.