Transformer Math & CPU SLM

Transformer Math & CPU SLM

Linear algebra, attention math, training, CPU inference, weight storage — depth-first.

50Articles
50Topics covered
Articles in this category

All 50 articles, sorted alphabetically

Advertisement
ARTICLE · 01

The Autoregressive Generation Loop

How an LLM produces text, one token at a time.

Read article
ARTICLE · 02

Backpropagation Through Transformers

How gradients flow back through the network — chain rule applied.

Read article
ARTICLE · 03

Beam Search

Sometimes greedy isn't enough.

Read article
ARTICLE · 04

The Complete Transformer Block

All the pieces assembled with pseudocode.

Read article
ARTICLE · 05

CPU Cache Hierarchy and Transformer Inference

Why memory bandwidth is the SLM CPU inference bottleneck.

Read article
ARTICLE · 06

CPU Inference Pipelines

The software stacks that make SLM CPU inference fast.

Read article
ARTICLE · 07

CPU Matmul Kernels and BLAS

Why your transformer is 10× faster than a naive Python loop.

Read article
ARTICLE · 08

CPU Memory Budget for SLM Training

Counting every byte: weights, gradients, optimizer, activations.

Read article
ARTICLE · 09

Cross-Entropy Loss for Next-Token Prediction

What loss to minimize and why the gradient is so clean.

Read article
ARTICLE · 10

Dataloader and Tokenization Pipeline

Streaming data efficiently to the training loop.

Read article
ARTICLE · 11

The Geometry of Dot Products

How attention measures similarity in embedding space.

Read article
ARTICLE · 12

End-to-End CPU SLM Recipe

From scratch: train and serve a small model on a workstation.

Read article
ARTICLE · 13

Feed-Forward Network

Two-thirds of model params live here. Worth understanding.

Read article
ARTICLE · 14

Fine-Tuning Math

How to adapt without retraining the whole model.

Read article
ARTICLE · 15

FlashAttention

Same math, very different memory pattern.

Read article
ARTICLE · 16

Gradient Accumulation

Effective large batches on small memory.

Read article
ARTICLE · 17

Gradient Clipping and Training Stability

Stopping spikes from blowing up training.

Read article
ARTICLE · 18

KV Cache

Why and how to cache keys and values.

Read article
ARTICLE · 19

KV Cache Quantization Deep Dive

Compress the biggest memory hog at inference.

Read article
ARTICLE · 20

LayerNorm

How LayerNorm stabilizes activations.

Read article
ARTICLE · 21

Learning Rate Schedules

Why LR changes during training and the canonical curves.

Read article
ARTICLE · 22

Linear Algebra for Transformers

Vectors, matrices, and operations you actually need.

Read article
ARTICLE · 23

Long Context Strategies

How models handle 128K, 1M tokens.

Read article
ARTICLE · 24

Diagnosing Loss Curves

What healthy vs unhealthy training looks like.

Read article
ARTICLE · 25

Mixed Precision Training

Half-precision for speed; full precision for stability.

Read article
ARTICLE · 26

Mixture of Experts

Active params vs total params, and what MoE means for CPU inference.

Read article
ARTICLE · 27

Multi-Head Attention

Why split into heads and how the math reassembles.

Read article
ARTICLE · 28

Multi-Token Prediction (MTP)

Predict the next 2-4 tokens at once for free speed.

Read article
ARTICLE · 29

Output Projection and Logits

From final hidden state to next-token probabilities.

Read article
ARTICLE · 30

Perplexity and Evaluation Metrics

How to measure if your model is any good.

Read article
ARTICLE · 31

Positional Encoding

How transformers know which token is where.

Read article
ARTICLE · 32

Quantization Layouts in Memory

Block-wise scaling, packed INT4, and the layout-vs-quality trade.

Read article
ARTICLE · 33

Residual Connections and Gradient Flow

Why every transformer block adds the input back.

Read article
ARTICLE · 34

RLHF and DPO

From a pretrained model to a useful assistant.

Read article
ARTICLE · 35

RMSNorm

Why everyone moved away from LayerNorm.

Read article
ARTICLE · 36

Extending RoPE to Longer Contexts

YaRN, NTK-aware scaling, and LongRope.

Read article
ARTICLE · 37

Sampling Strategies

How to pick the next token from the model's distribution.

Read article
ARTICLE · 38

Scaled Dot-Product Attention

The math of the attention operation.

Read article
ARTICLE · 39

SGD vs Adam vs AdamW Optimizer Math

The update rules — and why Adam dominates LLM training.

Read article
ARTICLE · 40

SIMD Instructions for Transformer Math

AVX-512, AMX, and how they vectorize matmul.

Read article
ARTICLE · 41

SLM Architectures

Hyperparams of the leading sub-7B models.

Read article
ARTICLE · 42

Softmax Derivation

Why softmax, what it preserves, and how temperature changes it.

Read article
ARTICLE · 43

Speculative Decoding

Use a fast draft model; verify with the big model.

Read article
ARTICLE · 44

The Future of CPU SLM in 2026 and Beyond

Where this is headed.

Read article
ARTICLE · 45

Tied Embeddings and Parameter Sharing

Why SLMs share weights between input embedding and output projection.

Read article
ARTICLE · 46

Token Embedding Lookup

How discrete tokens become continuous vectors.

Read article
ARTICLE · 47

Tokenizer Math

How text becomes a token sequence.

Read article
ARTICLE · 48

Training Data for SLMs

What to train on when you can't use the whole internet.

Read article
ARTICLE · 49

Weight Initialization

Why all-zeros doesn't work and what to do instead.

Read article
ARTICLE · 50

How Weights Are Stored on Disk

From PyTorch tensor to GGUF/SafeTensors bytes.

Read article