After L transformer blocks, you have a hidden state h ∈ ℝ^d per token. To predict next tokens, project to vocabulary size and softmax. This last step is the largest matmul in a single inference forward pass — often a serving bottleneck.
Linear projection to vocab
h_final ∈ ℝ^(N × d) (after L blocks + final norm)
logits = h_final · W_out ∈ ℝ^(N × V)
W_out ∈ ℝ^(d × V)V is the vocabulary size — 32K to 130K for modern LLMs. d is 768 to 4096. So W_out can be 4096 × 130000 = 530M parameters. Often the single largest weight tensor in the model.
Logits to probabilities
for each token position i:
probs[i] = softmax(logits[i]) ∈ ℝ^V
loss[i] = -log(probs[i, target_i])During training, compute loss at every position. During inference, only need logits at the LAST position (since others are already known). This is a big win: V·d compute on 1 token vs N tokens.
Tied embeddings — share E and W_out
# Standard untied:
W_out (d × V) # separate from input embedding E (V × d)
# Tied:
W_out = Eᵀ # same matrix, transposedTying eliminates the W_out parameters entirely. Saves d·V params. For Phi-3 with d=3072, V=32K: saves ~98M params (~3% of model). Required for most SLMs to fit memory budgets.
Sampling vs argmax
Greedy: token = argmax(logits). Deterministic, fast, sometimes repetitive. Sampling: token = sample from softmax(logits/T). With top-k or top-p truncation. Adds randomness; better creative output. Choice is per-deployment; quality wars are typically about sampling defaults.
CPU-specific cost
V·d matmul per step. For Phi-3: 3072 × 32064 ≈ 100M multiplies per generation step. Comparable to a single transformer block's FFN cost. On CPU: dominates if not quantized. INT4 quantization brings it in line with the attention layers.