Transformer Math & CPU SLM

ARTICLE · 01

The Autoregressive Generation Loop

How an LLM produces text, one token at a time.

Read article →

ARTICLE · 02

Backpropagation Through Transformers

How gradients flow back through the network — chain rule applied.

Read article →

ARTICLE · 03

Beam Search

Sometimes greedy isn't enough.

Read article →

ARTICLE · 04

The Complete Transformer Block

All the pieces assembled with pseudocode.

Read article →

ARTICLE · 05

CPU Cache Hierarchy and Transformer Inference

Why memory bandwidth is the SLM CPU inference bottleneck.

Read article →

ARTICLE · 06

CPU Inference Pipelines

The software stacks that make SLM CPU inference fast.

Read article →

ARTICLE · 07

CPU Matmul Kernels and BLAS

Why your transformer is 10× faster than a naive Python loop.

Read article →

ARTICLE · 08

CPU Memory Budget for SLM Training

Counting every byte: weights, gradients, optimizer, activations.

Read article →

ARTICLE · 09

Cross-Entropy Loss for Next-Token Prediction

What loss to minimize and why the gradient is so clean.

Read article →

ARTICLE · 10

Dataloader and Tokenization Pipeline

Streaming data efficiently to the training loop.

Read article →

ARTICLE · 11

The Geometry of Dot Products

How attention measures similarity in embedding space.

Read article →

ARTICLE · 12

End-to-End CPU SLM Recipe

From scratch: train and serve a small model on a workstation.

Read article →

ARTICLE · 13

Feed-Forward Network

Two-thirds of model params live here. Worth understanding.

Read article →

ARTICLE · 14

Fine-Tuning Math

How to adapt without retraining the whole model.

Read article →

ARTICLE · 15

FlashAttention

Same math, very different memory pattern.

Read article →

ARTICLE · 16

Gradient Accumulation

Effective large batches on small memory.

Read article →

ARTICLE · 17

Gradient Clipping and Training Stability

Stopping spikes from blowing up training.

Read article →

ARTICLE · 18

KV Cache

Why and how to cache keys and values.

Read article →

ARTICLE · 19

KV Cache Quantization Deep Dive

Compress the biggest memory hog at inference.

Read article →

ARTICLE · 20

LayerNorm

How LayerNorm stabilizes activations.

Read article →

ARTICLE · 21

Learning Rate Schedules

Why LR changes during training and the canonical curves.

Read article →

ARTICLE · 22

Linear Algebra for Transformers

Vectors, matrices, and operations you actually need.

Read article →

ARTICLE · 23

Long Context Strategies

How models handle 128K, 1M tokens.

Read article →

ARTICLE · 24

Diagnosing Loss Curves

What healthy vs unhealthy training looks like.

Read article →

ARTICLE · 25

Mixed Precision Training

Half-precision for speed; full precision for stability.

Read article →

ARTICLE · 26

Mixture of Experts

Active params vs total params, and what MoE means for CPU inference.

Read article →

ARTICLE · 27

Multi-Head Attention

Why split into heads and how the math reassembles.

Read article →

ARTICLE · 28

Multi-Token Prediction (MTP)

Predict the next 2-4 tokens at once for free speed.

Read article →

ARTICLE · 29

Output Projection and Logits

From final hidden state to next-token probabilities.

Read article →

ARTICLE · 30

Perplexity and Evaluation Metrics

How to measure if your model is any good.

Read article →

ARTICLE · 31

Positional Encoding

How transformers know which token is where.

Read article →

ARTICLE · 32

Quantization Layouts in Memory

Block-wise scaling, packed INT4, and the layout-vs-quality trade.

Read article →

ARTICLE · 33

Residual Connections and Gradient Flow

Why every transformer block adds the input back.

Read article →

ARTICLE · 34

RLHF and DPO

From a pretrained model to a useful assistant.

Read article →

ARTICLE · 35

RMSNorm

Why everyone moved away from LayerNorm.

Read article →

ARTICLE · 36

Extending RoPE to Longer Contexts

YaRN, NTK-aware scaling, and LongRope.

Read article →

ARTICLE · 37

Sampling Strategies

How to pick the next token from the model's distribution.

Read article →

ARTICLE · 38

Scaled Dot-Product Attention

The math of the attention operation.

Read article →

ARTICLE · 39

SGD vs Adam vs AdamW Optimizer Math

The update rules — and why Adam dominates LLM training.

Read article →

ARTICLE · 40

SIMD Instructions for Transformer Math

AVX-512, AMX, and how they vectorize matmul.

Read article →

ARTICLE · 41

SLM Architectures

Hyperparams of the leading sub-7B models.

Read article →

ARTICLE · 42

Softmax Derivation

Why softmax, what it preserves, and how temperature changes it.

Read article →

ARTICLE · 43

Speculative Decoding

Use a fast draft model; verify with the big model.

Read article →

ARTICLE · 44

The Future of CPU SLM in 2026 and Beyond

Where this is headed.

Read article →

ARTICLE · 45

Tied Embeddings and Parameter Sharing

Why SLMs share weights between input embedding and output projection.

Read article →

ARTICLE · 46

Token Embedding Lookup

How discrete tokens become continuous vectors.

Read article →

ARTICLE · 47

Tokenizer Math

How text becomes a token sequence.

Read article →

ARTICLE · 48

Training Data for SLMs

What to train on when you can't use the whole internet.

Read article →

ARTICLE · 49

Weight Initialization

Why all-zeros doesn't work and what to do instead.

Read article →

ARTICLE · 50

How Weights Are Stored on Disk

From PyTorch tensor to GGUF/SafeTensors bytes.

Read article →

All 50 articles, sorted alphabetically

The Autoregressive Generation Loop

Backpropagation Through Transformers

Beam Search

The Complete Transformer Block

CPU Cache Hierarchy and Transformer Inference

CPU Inference Pipelines

CPU Matmul Kernels and BLAS

CPU Memory Budget for SLM Training

Cross-Entropy Loss for Next-Token Prediction

Dataloader and Tokenization Pipeline

The Geometry of Dot Products

End-to-End CPU SLM Recipe

Feed-Forward Network

Fine-Tuning Math

FlashAttention

Gradient Accumulation

Gradient Clipping and Training Stability

KV Cache

KV Cache Quantization Deep Dive

LayerNorm

Learning Rate Schedules

Linear Algebra for Transformers

Long Context Strategies

Diagnosing Loss Curves

Mixed Precision Training

Mixture of Experts

Multi-Head Attention

Multi-Token Prediction (MTP)

Output Projection and Logits

Perplexity and Evaluation Metrics

Positional Encoding

Quantization Layouts in Memory

Residual Connections and Gradient Flow

RLHF and DPO

RMSNorm

Extending RoPE to Longer Contexts

Sampling Strategies

Scaled Dot-Product Attention

SGD vs Adam vs AdamW Optimizer Math

SIMD Instructions for Transformer Math

SLM Architectures

Softmax Derivation

Speculative Decoding

The Future of CPU SLM in 2026 and Beyond

Tied Embeddings and Parameter Sharing

Token Embedding Lookup

Tokenizer Math

Training Data for SLMs

Weight Initialization

How Weights Are Stored on Disk