Putting it all together: train and serve a small (50M-350M) language model entirely on a CPU workstation. Practical and educational, even if you wouldn't deploy this at scale. Here's the full path.
Hardware target
# Reasonable workstation:
# - 32 GB RAM
# - 16-core CPU with AVX-512 (Intel 12th gen+ or AMD 7000+)
# - 1 TB NVMe SSD
# - No GPU (the point)
# What it can train:
# - 50-125M params from scratch (small but works)
# - 350M with grad checkpointing + BF16
#
# What it can run inference on:
# - Up to 7B Q4 quantized (~4 GB RAM)Modest hardware. CPU training is feasible for SLMs you can practice the full lifecycle on. Inference scales well past training size with quantization.
Training stack
# PyTorch with CPU backend
# - torch.compile for kernel fusion
# - torch.autocast(device='cpu', dtype=torch.bfloat16) for mixed precision
# - AdamW optimizer
# - cosine schedule with warmup
# - gradient checkpointing
#
# Data: streaming from disk via dataloader workers
# Tokenizer: tiktoken or SentencePiece pretrainedDon't roll your own framework. Use PyTorch with proper BF16 and OpenBLAS/MKL. Stream data; don't load all in RAM. Save checkpoints every N steps to disk.
Reference budget — 125M params
# Train tinyllama-style 125M on Wikipedia + Stories
# - batch_micro = 1, accumulation = 32, effective batch 32
# - seq = 1024
# - 3e-4 peak LR, 200-step warmup
# - 100K steps ~ 3 days on 16-core CPU
# - Final perplexity: ~25 on WikiText-103Won't be GPT-4; will produce coherent paragraph-length text. Educational value high. Inference for prompting: 10-30 tokens/sec at INT4 on the same hardware.
Inference stack
# Convert PyTorch weights to GGUF:
# python convert-hf-to-gguf.py model_dir/
# Quantize:
# ./llama-quantize model.gguf model-q4_k_m.gguf q4_k_m
# Serve:
# ./llama-server -m model-q4_k_m.gguf -ngl 0 -c 4096llama.cpp is the production-quality CPU inference path. Convert + quantize + serve in three commands. OpenAI-compatible API exposed. Good enough for prototypes, personal assistants, prototypes.
Beyond — where GPUs become necessary
Above ~350M params, CPU training time becomes weeks. Above 7B for inference (even Q4), CPU is too slow for interactive use. The crossover: rent a GPU hour ($1-5) when CPU training stalls. CPU is for: small-model lifecycle, edge inference, learning. GPU for: scale.