Structured Output with Small Models

Small models (1-3B) historically struggled with structured output: malformed JSON, missing fields, made-up enum values. Constrained decoding changed this. With the right setup, a 3B model produces 99%+ valid structured output — competitive with frontier models for narrow extraction tasks.

Advertisement

Why naive prompting fails

Small models have weaker instruction-following. 'Output JSON with fields X, Y, Z' fails 10-30% of the time at 1-3B scale: missing braces, wrong field names, mixed in prose. Production use needs structural enforcement, not prompt prayers.

Constrained decoding tools

outlines, lm-format-enforcer, guidance, llamacpp's grammar — all enforce structure at decode time. Token-by-token, only tokens valid for the next position are sampled. JSON schema becomes a hard constraint.

Advertisement

Effect on quality

Constrained decoding eliminates parse errors (0% malformed JSON). Quality of CONTENT (right values in right fields) remains a function of model capability. So: structure free, semantics still needs prompting + training.

Inference servers that support it

vLLM with guided decoding (outlines integration). SGLang with grammar constraints. llama.cpp with JSON grammar mode. TensorRT-LLM with grammar plugin. Standard feature now.

Schema design tips

Required fields > optional + post-validation. Enums over free-text where possible (restricts decoding tighter). Short field names (less for model to copy correctly). Avoid deeply nested schemas at 3B; flatter is more reliable.

Constrained decoding makes small models produce 99%+ valid structured output. Standard in modern inference servers. Schema design matters.