Agent architectures often spend most of their cost on simple decisions: which tool to call, with what arguments, when to stop. Small models fine-tuned on tool-calling format can match GPT-4-class on this narrow task — at 1% the cost and 5x the latency improvement.

Advertisement

The pattern

Reasoner uses a strong model (GPT-4, Claude, Llama 70B) for hard planning. Tool-router uses a small model (Phi-3, Qwen 1.5B) for: 'given this state, what's the next tool call?'. Output is structured JSON.

Fine-tuning for tool format

Hercules, ToolBench, Glaive datasets — 10K-100K tool-call traces. QLoRA fine-tunes a 3B in hours. Output JSON consistency goes from 80% to 99%+. Domain-specific tool catalogs need domain-specific data.

Advertisement

Constrained decoding

Force-decode JSON schema with outlines, lm-format-enforcer, or vLLM's guided decoding. Eliminates malformed-JSON failure mode entirely. Small models with constrained decoding match big models with prompting on tool-call format compliance.

Why it works

Tool calling is a narrow distribution: structured output, finite tool set, short context. Small models excel at narrow distributions when trained on them. The 'general intelligence' of GPT-4 isn't needed for picking between 'search' and 'fetch'.

Architecture

Reasoner LLM → emit goal → Router (small, fine-tuned) → tool call → execute → result → Router (continue or hand back). Big model only when reasoning is needed. Most traces use big model 2-3 times, router 10-20 times.

Use a big model for reasoning, a fine-tuned small model for tool-call dispatch. Constrained decoding for JSON. 10-100x cost reduction.