Mixture of Experts Routing

Advertisement

Experts Top-K

Each token activates K of N experts. Total params high; compute low.

What you're seeing

MoE (Mixture of Experts): N specialist sub-networks. Per token, a router picks top-K (usually 2). Only those K participate in the forward pass.

Wins: total params can be 10× a dense model, but compute matches the smaller dense model. Memory cost stays high (all experts loaded). Used by Mixtral, DeepSeek V3, GPT-4.

★ KEY TAKEAWAY

Router picks top-K experts per token. Memory: all experts loaded. Compute: only K active per token.

▶ WHAT TO TRY

Increase Experts and Top-K.
Click Route 12 tokens — see load balance pattern.