Big LLMs train on trillions of tokens of broad web data. SLMs trained from scratch on CPU can use only billions. Data choice becomes critical. Curated high-quality data beats raw scale at small scales — Phi proved this.
Open datasets
FineWeb (24T tokens, filtered web). RedPajama (~2T, broad LLM data). The Pile (~800B, mixed text/code). For CPU training: even 30B-50B tokens is enough. Pick a manageable subset.
Phi's recipe — synthetic high-quality
Microsoft generated textbooks-grade explanations + Q&A using GPT-4. Trained Phi-1/2/3 on this. ~1B-3B params Phi models compete with 7-13B Llama. The lesson: curated > scaled. Reproducible with any teacher LLM access.
Tokenize once, reuse forever
# Tokenizer + dataset → tokens.bin file (int32)
# Train loop just mmap-reads spans of int32
# No tokenization on critical pathOnline tokenization wastes CPU cycles. Pre-tokenize the whole training corpus to a binary file. Use mmap to read random spans. Standard for serious SLM training; ~10× faster than online.
Data ordering
Pretraining: shuffled to remove batch correlations. Some recent work: curriculum learning (easy → hard) helps stability. Fine-tuning: keep examples in batch IID; don't accidentally group by class.
Filtering quality
Deduplication (exact + near-duplicates). Profanity/PII filters if relevant. Length filtering (drop ultra-short). Language filtering (if mono). FineWeb's quality filters are open source; reuse.