▶ Interactive Lab

FFN Expansion + Activation

d → d_ff → d. Two matmuls with an activation in between.

Advertisement
FFN: expand to d_ff, activate, project back. ~2/3 of all transformer params.

What you're seeing

Standard FFN: x → linear(d, d_ff) → activation → linear(d_ff, d). Params = 2·d·d_ff.

SwiGLU: 3 projections (gate, up, down). d_ff reduced to ~2.67× to match param budget.

★ KEY TAKEAWAY
FFN expands hidden_dim by ~4× (or ~2.67× for SwiGLU), applies activation, projects back. Holds 2/3 of all transformer params.
▶ WHAT TO TRY
  • Toggle between ReLU / GELU / SwiGLU — see the activation curve.
  • Increase d_ff multiplier to see how params scale linearly with hidden size.
  • SwiGLU adds a 3rd projection (gate) but is empirically slightly better.