Every block of a modern transformer follows the same pattern. Two sub-blocks (attention + FFN), each wrapped in pre-norm + residual. Knowing the full pseudocode helps you read any open-source LLM implementation.
Full pseudocode
def block(x):
# Sub-block 1: attention with pre-norm + residual
h = RMSNorm(x)
h = MultiHeadAttention(h) # Q,K,V projections + attention + W_O
x = x + h
# Sub-block 2: FFN with pre-norm + residual
h = RMSNorm(x)
h = FFN_SwiGLU(h) # 3 projections + Swish + element-wise
x = x + h
return xL copies of this block stacked sequentially. After the last block: one final RMSNorm before W_out. The order (norm → sublayer → residual) is pre-norm; this is the modern default.
Parameter accounting per block
# For d = 2048, head dim = 64, h = 32 heads, d_ff ≈ 5.4k (SwiGLU):
attn: 4 * d² = ~17M
ffn: 3 * d * d_ff = ~33M
norms: 2 * d = ~4K (negligible)
Total per block ≈ 50M paramsFFN is ~2× attention in parameter count. With 24 layers: 1.2B params just in blocks. Plus embedding (~70M) plus final projection. Total ~1.3B for a model in this size class.
Memory bandwidth at inference
Each block reads ~50M params from RAM at every token generation step. At 1ms per token target, that's 50 GB/s of weight reads per layer. DDR5 ~70 GB/s. Bandwidth-bound. Quantization (INT4 = 4× smaller weights) is the only path to fast CPU inference.
Compute at training
Forward: ~2 · params · seq_len FLOPs per sample (≈ 100B FLOPs for our 50M-param block × 1024 tokens). Backward: ~2× forward. Optimizer step: ~12 · params (for AdamW with 4 states). Total per training step: ~5× forward FLOPs.
Storage on disk
Weights stored as contiguous tensors per layer: w_q, w_k, w_v, w_o, w_gate, w_up, w_down, norm_1, norm_2. GGUF/SafeTensors formats serialize these in a defined order. Loading is mmap; weights stream in as accessed. For SLM on CPU: cold start dominated by disk read speed.