The Complete Transformer Block — Putting It Together

Every block of a modern transformer follows the same pattern. Two sub-blocks (attention + FFN), each wrapped in pre-norm + residual. Knowing the full pseudocode helps you read any open-source LLM implementation.

Advertisement

Full pseudocode

def block(x):
    # Sub-block 1: attention with pre-norm + residual
    h = RMSNorm(x)
    h = MultiHeadAttention(h)   # Q,K,V projections + attention + W_O
    x = x + h

    # Sub-block 2: FFN with pre-norm + residual
    h = RMSNorm(x)
    h = FFN_SwiGLU(h)             # 3 projections + Swish + element-wise
    x = x + h

    return x

L copies of this block stacked sequentially. After the last block: one final RMSNorm before W_out. The order (norm → sublayer → residual) is pre-norm; this is the modern default.

Parameter accounting per block

# For d = 2048, head dim = 64, h = 32 heads, d_ff ≈ 5.4k (SwiGLU):
attn:  4 * d²        = ~17M
ffn:   3 * d * d_ff  = ~33M
norms: 2 * d         = ~4K (negligible)

Total per block ≈ 50M params

FFN is ~2× attention in parameter count. With 24 layers: 1.2B params just in blocks. Plus embedding (~70M) plus final projection. Total ~1.3B for a model in this size class.

Advertisement

Memory bandwidth at inference

Each block reads ~50M params from RAM at every token generation step. At 1ms per token target, that's 50 GB/s of weight reads per layer. DDR5 ~70 GB/s. Bandwidth-bound. Quantization (INT4 = 4× smaller weights) is the only path to fast CPU inference.

Compute at training

Forward: ~2 · params · seq_len FLOPs per sample (≈ 100B FLOPs for our 50M-param block × 1024 tokens). Backward: ~2× forward. Optimizer step: ~12 · params (for AdamW with 4 states). Total per training step: ~5× forward FLOPs.

Storage on disk

Weights stored as contiguous tensors per layer: w_q, w_k, w_v, w_o, w_gate, w_up, w_down, norm_1, norm_2. GGUF/SafeTensors formats serialize these in a defined order. Loading is mmap; weights stream in as accessed. For SLM on CPU: cold start dominated by disk read speed.

Block = pre-norm + attn + residual + pre-norm + FFN + residual. ~50M params per block at d=2048. BW-bound on CPU inference.