▶ Interactive Lab

Forward vs Backward FLOPs

Backward is ~2× forward. Total training ~3× forward.

Advertisement
Forward ≈ 2·params·seq FLOPs. Backward ~2× more.

What you're seeing

One training step ~3× the inference compute. Plus optimizer step.

★ KEY TAKEAWAY
Forward FLOPs ≈ 2·params·seq. Backward is 2× more. Total step ≈ 3× forward. Plus the optimizer step.
▶ WHAT TO TRY
  • Pick a model size and seq length.
  • At 50 GFLOPS (CPU): training step takes seconds even for small models.