Each transformer block has an attention sub-block and an MLP (feed-forward) sub-block. The MLP is two linear layers with a nonlinearity. Despite simplicity, it holds the majority of model parameters and does much of the compute.

Advertisement

Standard MLP

FFN(x) = W_2 · activation(W_1 · x + b_1) + b_2

W_1 ∈ ℝ^(d × d_ff)
W_2 ∈ ℝ^(d_ff × d)
d_ff usually = 4 * d

Project to a hidden dimension d_ff (typically 4× d_model), apply nonlinearity, project back. Two matmuls + activation. The hidden dim's size determines model expressiveness in the per-token feature mixing.

Activation choices

ReLU(x)   = max(0, x)
GELU(x)   ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715*x³)))
SwiGLU(x, gate) = Swish(gate) * x   where Swish(x) = x * sigmoid(x)

ReLU: original. GELU: smoother, used in GPT-2/BERT/3. SwiGLU: gated activation, current SOTA. Empirically SwiGLU gives ~1% better perplexity at same compute. Used in Llama, Phi, Mistral.

Advertisement

SwiGLU full formula

FFN_SwiGLU(x) = W_2 · (Swish(W_gate · x) * W_up · x)

Three linear layers instead of two:
  W_gate ∈ ℝ^(d × d_ff)
  W_up   ∈ ℝ^(d × d_ff)
  W_2    ∈ ℝ^(d_ff × d)

SwiGLU has 3 projections (gate, up, down). To match parameter budget of standard FFN, d_ff is reduced to ~2.67× d_model. Small efficiency loss; modest quality gain.

Parameter count

Standard FFN: 2 * d * d_ff = 8 * d²    (with d_ff = 4d)
SwiGLU:       3 * d * d_ff ≈ 8 * d²    (with d_ff ≈ 2.67d)

For d=2048: ~33M params per FFN block. With L=24 layers: 800M just for FFN. The FFN dominates total parameters in most transformers. The block is also the slowest at inference (memory bandwidth on weight reads).

Per-token, per-position

The FFN operates token-by-token: each token's vector goes through the same FFN independently. No cross-token mixing here. Easy to parallelize across the sequence dimension. On CPU, each token's FFN is a batched matmul.

FFN: 2 (or 3 for SwiGLU) linear layers + activation, ~2/3 of all params. SwiGLU is the modern default. Easily parallel per token.