Model leaderboards are loud signal but bad decision-criterion. The actual choice depends on latency, cost at your traffic, terms of service, language support, and feature support (tool use, vision, streaming). Here's the practical framework.

Advertisement

Frontier vs cheap

If your task is genuinely hard (multi-step reasoning, long context, code generation that runs first-try): frontier (Claude, GPT-4-class). If it's narrow and high-volume (classification, extraction, summarization): cheaper (Haiku, Mini-class, or self-hosted SLM).

Self-host vs API

API: fastest to ship, predictable pricing per call. Self-host: cheaper at very high volume (>1M calls/day), full data control, latency control. Hybrid: API for hard tasks, self-host for high-volume narrow tasks.

Advertisement

Things teams forget

Output token cost is usually 3-10x input. Streaming reduces apparent latency but not total cost. Provider's data policy (are your inputs used for training?). Rate limits at the SKU you can actually buy.

Match model to task difficulty. Self-host at high volume. Re-check data policy and rate limits before committing.