Training Data for SLMs — Belgavi.AI Lab

Big LLMs train on trillions of tokens of broad web data. SLMs trained from scratch on CPU can use only billions. Data choice becomes critical. Curated high-quality data beats raw scale at small scales — Phi proved this.

Advertisement

Open datasets

FineWeb (24T tokens, filtered web). RedPajama (~2T, broad LLM data). The Pile (~800B, mixed text/code). For CPU training: even 30B-50B tokens is enough. Pick a manageable subset.

Phi's recipe — synthetic high-quality

Microsoft generated textbooks-grade explanations + Q&A using GPT-4. Trained Phi-1/2/3 on this. ~1B-3B params Phi models compete with 7-13B Llama. The lesson: curated > scaled. Reproducible with any teacher LLM access.

Advertisement

Tokenize once, reuse forever

# Tokenizer + dataset → tokens.bin file (int32)
# Train loop just mmap-reads spans of int32
# No tokenization on critical path

Online tokenization wastes CPU cycles. Pre-tokenize the whole training corpus to a binary file. Use mmap to read random spans. Standard for serious SLM training; ~10× faster than online.

Data ordering

Pretraining: shuffled to remove batch correlations. Some recent work: curriculum learning (easy → hard) helps stability. Fine-tuning: keep examples in batch IID; don't accidentally group by class.

Filtering quality

Deduplication (exact + near-duplicates). Profanity/PII filters if relevant. Length filtering (drop ultra-short). Language filtering (if mono). FineWeb's quality filters are open source; reuse.

30-50B curated tokens > 1T raw. Phi proved this. Pre-tokenize. Standard quality filtering (dedup, length).

Open datasets

Phi&#x27;s recipe — synthetic high-quality

Tokenize once, reuse forever

Data ordering

Filtering quality

Phi's recipe — synthetic high-quality