CPU Inference Latency Breakdown

Advertisement

Model size RAM BW

Per-token latency lower bound = model_bytes / RAM_bandwidth.

At batch=1, must read all weights per token. RAM bandwidth is the floor on inference speed.

★ KEY TAKEAWAY

CPU SLM inference is memory-bandwidth-bound. Per-token time ≈ model_size / RAM_BW.

▶ WHAT TO TRY