Small language models cannot afford to have separate input embeddings (V·d) and output projection (d·V). Most tie them: output = input embedding transposed. Saves d·V parameters — often 5-10% of total model size. The math justifies it; the cost is small.
Untied: separate W_out
E ∈ ℝ^(V × d) # input embedding
W_out ∈ ℝ^(d × V) # output projection (separate weights)
# logits[i] = h[i] · W_out
# Embedding cost: V * d
# W_out cost: V * d
# Total: 2 * V * dUsed by Llama 2/3 (which has enough params to spare). Slightly higher quality. Wastes V·d params that aren't strictly needed for many tasks.
Tied: shared transposed
# Use the embedding matrix for output:
logits[i] = h[i] · Eᵀ
# Same matrix; one storage; one set of gradients
# Total params: V * d (not 2*V*d)Used by GPT-2, Phi, Qwen, T5. Saves substantial params. Slight quality cost (negligible for most tasks). For Phi-3 (V=32K, d=3072): saves ~98M params (~2.5% of model).
Why does it work?
Input embedding maps token → semantic vector. Output projection maps semantic vector → token. These are inverse operations; the matrices should be related. Tying enforces this exactly. The empirical evidence: tied models trained on same data score within 0.5% of untied counterparts.
Other parameter sharing
ALBERT (BERT variant): share weights across all transformer layers. Massive param savings, modest quality cost. Not common in modern decoder-only LLMs. RecurrentGemma: shares some FFN weights. Mostly research-stage in 2026.
Storage implication
Tied embeddings means the file holds one matrix instead of two. Both Hugging Face and GGUF formats handle this with metadata: tensor reference, not duplicated storage. Verify with config.tie_word_embeddings in HF; or look for missing lm_head.weight.