Advertisement
Tied: lm_head = embedding.T. Saves V·d params.
What you're seeing
For SLMs (small d), the saving is significant. For 70B models, often kept untied.
★ KEY TAKEAWAY
Tied embeddings: lm_head = embedding.T. Saves V·d params (often ~10% of an SLM). Standard for small models.
▶ WHAT TO TRY
- Slide vocab V from 8K to 200K — bigger vocab = bigger savings.
- Llama keeps them untied (has params to spare); Phi/Qwen/Gemma tie them.