▶ Interactive Lab

Mixture of Experts Routing

Router picks K experts per token. See activation patterns.

Advertisement
Each token activates K of N experts. Total params high; compute low.

What you're seeing

MoE (Mixture of Experts): N specialist sub-networks. Per token, a router picks top-K (usually 2). Only those K participate in the forward pass.

Wins: total params can be 10× a dense model, but compute matches the smaller dense model. Memory cost stays high (all experts loaded). Used by Mixtral, DeepSeek V3, GPT-4.

★ KEY TAKEAWAY
Router picks top-K experts per token. Memory: all experts loaded. Compute: only K active per token.
▶ WHAT TO TRY
  • Increase Experts and Top-K.
  • Click Route 12 tokens — see load balance pattern.