Advertisement
Each token activates K of N experts. Total params high; compute low.
What you're seeing
MoE (Mixture of Experts): N specialist sub-networks. Per token, a router picks top-K (usually 2). Only those K participate in the forward pass.
Wins: total params can be 10× a dense model, but compute matches the smaller dense model. Memory cost stays high (all experts loaded). Used by Mixtral, DeepSeek V3, GPT-4.
★ KEY TAKEAWAY
Router picks top-K experts per token. Memory: all experts loaded. Compute: only K active per token.
▶ WHAT TO TRY
- Increase Experts and Top-K.
- Click Route 12 tokens — see load balance pattern.