▶ Interactive Lab

Cache Hits and Misses

L1/L2/L3 latency stacked up.

Advertisement
Tiny working set: L1 fast. Mid: L2/L3. Big: RAM — slow.

What you're seeing

L1 ~32 KB, ~1 ns. L2 ~512 KB, ~3 ns. L3 ~32 MB, ~10 ns. RAM ~80 ns. Model weights bigger than L3 → every read pays RAM cost.

★ KEY TAKEAWAY
CPU memory hierarchy: L1 (32KB, 1ns) → L2 (512KB, 3ns) → L3 (32MB, 10ns) → RAM (80ns). SLM weights ≫ L3 → every read pays RAM cost.
▶ WHAT TO TRY
  • Slide Working set from 1KB to 1GB.
  • Notice how data 'lives' in successively slower levels.
  • This is why CPU SLM inference is memory-bandwidth-bound at batch=1.