Every transformer operation reduces to a handful of linear-algebra primitives. Token embeddings are vectors. Weight matrices are 2D arrays. Attention is a series of matrix multiplications followed by softmax. If these operations feel mechanical, the whole architecture becomes mechanical.
Vectors and shapes
A token embedding x ∈ ℝᵈ is a vector of d real numbers — typically d ∈ {512, 768, 1024, 2048, 4096}. A sequence of N tokens is X ∈ ℝᴺˣᵈ, a 2D tensor where each row is one token's embedding. Batch of B sequences: ℝᴮˣᴺˣᵈ. Three dimensions; always written batch-first.
Matrix multiplication — the workhorse
Y = X · W
shape: [N, d] · [d, m] = [N, m]
Y[i,j] = sum over k of X[i,k] * W[k,j]Every linear layer is one matmul. d input features, m output features, N tokens processed in parallel. CPU implementations call BLAS (OpenBLAS, MKL); GPU calls cuBLAS. Identical math, different hardware.
Dot product = similarity
⟨a, b⟩ = sum over i of a[i] * b[i]
= |a| · |b| · cos(θ)Dot product is the unnormalized cosine similarity. Bigger value = more similar direction. Attention's core operation is a dot product between a query vector and many key vectors. Higher similarity → more attention weight.
Element-wise operations
Addition, ReLU, sigmoid, layer norm: applied element-by-element across the tensor. Cheap on CPU (cache-friendly). No matrix multiplies. Residual connections (output = input + sublayer(input)) are element-wise additions over the full sequence.
Why all of this matters for SLMs
Small language models (1-7B params) run on CPU because the operations are all matmul + element-wise. A 3B model has ~3 billion floating-point weights. INT4 quantized = 1.5 GB. Fits in RAM; inference is bottlenecked by matmul speed (BLAS performance + memory bandwidth).