Every transformer operation reduces to a handful of linear-algebra primitives. Token embeddings are vectors. Weight matrices are 2D arrays. Attention is a series of matrix multiplications followed by softmax. If these operations feel mechanical, the whole architecture becomes mechanical.

Advertisement

Vectors and shapes

A token embedding x ∈ ℝᵈ is a vector of d real numbers — typically d ∈ {512, 768, 1024, 2048, 4096}. A sequence of N tokens is X ∈ ℝᴺˣᵈ, a 2D tensor where each row is one token's embedding. Batch of B sequences: ℝᴮˣᴺˣᵈ. Three dimensions; always written batch-first.

Matrix multiplication — the workhorse

Y = X · W
shape:  [N, d] · [d, m]  =  [N, m]

Y[i,j] = sum over k of X[i,k] * W[k,j]

Every linear layer is one matmul. d input features, m output features, N tokens processed in parallel. CPU implementations call BLAS (OpenBLAS, MKL); GPU calls cuBLAS. Identical math, different hardware.

Advertisement

Dot product = similarity

⟨a, b⟩ = sum over i of a[i] * b[i]
       = |a| · |b| · cos(θ)

Dot product is the unnormalized cosine similarity. Bigger value = more similar direction. Attention's core operation is a dot product between a query vector and many key vectors. Higher similarity → more attention weight.

Element-wise operations

Addition, ReLU, sigmoid, layer norm: applied element-by-element across the tensor. Cheap on CPU (cache-friendly). No matrix multiplies. Residual connections (output = input + sublayer(input)) are element-wise additions over the full sequence.

Why all of this matters for SLMs

Small language models (1-7B params) run on CPU because the operations are all matmul + element-wise. A 3B model has ~3 billion floating-point weights. INT4 quantized = 1.5 GB. Fits in RAM; inference is bottlenecked by matmul speed (BLAS performance + memory bandwidth).

Vectors, matrices, dot products, element-wise ops. The whole transformer is built from these four primitives.