Small language models cannot afford to have separate input embeddings (V·d) and output projection (d·V). Most tie them: output = input embedding transposed. Saves d·V parameters — often 5-10% of total model size. The math justifies it; the cost is small.

Advertisement

Untied: separate W_out

E       ∈ ℝ^(V × d)    # input embedding
W_out   ∈ ℝ^(d × V)    # output projection (separate weights)

# logits[i] = h[i] · W_out
# Embedding cost: V * d
# W_out cost:    V * d
# Total:        2 * V * d

Used by Llama 2/3 (which has enough params to spare). Slightly higher quality. Wastes V·d params that aren't strictly needed for many tasks.

Tied: shared transposed

# Use the embedding matrix for output:
logits[i] = h[i] · Eᵀ

# Same matrix; one storage; one set of gradients
# Total params: V * d  (not 2*V*d)

Used by GPT-2, Phi, Qwen, T5. Saves substantial params. Slight quality cost (negligible for most tasks). For Phi-3 (V=32K, d=3072): saves ~98M params (~2.5% of model).

Advertisement

Why does it work?

Input embedding maps token → semantic vector. Output projection maps semantic vector → token. These are inverse operations; the matrices should be related. Tying enforces this exactly. The empirical evidence: tied models trained on same data score within 0.5% of untied counterparts.

Other parameter sharing

ALBERT (BERT variant): share weights across all transformer layers. Massive param savings, modest quality cost. Not common in modern decoder-only LLMs. RecurrentGemma: shares some FFN weights. Mostly research-stage in 2026.

Storage implication

Tied embeddings means the file holds one matrix instead of two. Both Hugging Face and GGUF formats handle this with metadata: tensor reference, not duplicated storage. Verify with config.tie_word_embeddings in HF; or look for missing lm_head.weight.

Tied embeddings: W_out = Eᵀ. Saves V·d params. Standard for SLMs. Verify in config.tie_word_embeddings.