Advertisement
Per-token latency lower bound = model_bytes / RAM_bandwidth.
What you're seeing
At batch=1, must read all weights per token. RAM bandwidth is the floor on inference speed.
★ KEY TAKEAWAY
CPU SLM inference is memory-bandwidth-bound. Per-token time ≈ model_size / RAM_BW.
▶ WHAT TO TRY
- Pick a model size — see the BW-bound lower latency bound.
- Compare DDR5 (70 GB/s) vs Apple M3 Max (200 GB/s) — proportional speedup.
- This is why INT4 (4× smaller weights) is the standard for CPU inference.