Small models (1-3B) historically struggled with structured output: malformed JSON, missing fields, made-up enum values. Constrained decoding changed this. With the right setup, a 3B model produces 99%+ valid structured output — competitive with frontier models for narrow extraction tasks.
Why naive prompting fails
Small models have weaker instruction-following. 'Output JSON with fields X, Y, Z' fails 10-30% of the time at 1-3B scale: missing braces, wrong field names, mixed in prose. Production use needs structural enforcement, not prompt prayers.
Constrained decoding tools
outlines, lm-format-enforcer, guidance, llamacpp's grammar — all enforce structure at decode time. Token-by-token, only tokens valid for the next position are sampled. JSON schema becomes a hard constraint.
Effect on quality
Constrained decoding eliminates parse errors (0% malformed JSON). Quality of CONTENT (right values in right fields) remains a function of model capability. So: structure free, semantics still needs prompting + training.
Inference servers that support it
vLLM with guided decoding (outlines integration). SGLang with grammar constraints. llama.cpp with JSON grammar mode. TensorRT-LLM with grammar plugin. Standard feature now.
Schema design tips
Required fields > optional + post-validation. Enums over free-text where possible (restricts decoding tighter). Short field names (less for model to copy correctly). Avoid deeply nested schemas at 3B; flatter is more reliable.