Advertisement
Each model picks different depth/width/heads ratios.
What you're seeing
Phi: dense, no GQA. Qwen: deep with aggressive GQA. Gemma: huge vocab.
★ KEY TAKEAWAY
Phi: dense, no GQA. Qwen: deep + 8× GQA. Gemma: huge vocab (256K). Same training recipe; different bets.
▶ WHAT TO TRY
- Compare hyperparams: depth, heads, vocab.
- Architecture has stabilized — data and post-training now differentiate the leaders.