All 50 articles, sorted alphabetically
The Autoregressive Generation Loop
How an LLM produces text, one token at a time.
Read article →Backpropagation Through Transformers
How gradients flow back through the network — chain rule applied.
Read article →Beam Search
Sometimes greedy isn't enough.
Read article →The Complete Transformer Block
All the pieces assembled with pseudocode.
Read article →CPU Cache Hierarchy and Transformer Inference
Why memory bandwidth is the SLM CPU inference bottleneck.
Read article →CPU Inference Pipelines
The software stacks that make SLM CPU inference fast.
Read article →CPU Matmul Kernels and BLAS
Why your transformer is 10× faster than a naive Python loop.
Read article →CPU Memory Budget for SLM Training
Counting every byte: weights, gradients, optimizer, activations.
Read article →Cross-Entropy Loss for Next-Token Prediction
What loss to minimize and why the gradient is so clean.
Read article →Dataloader and Tokenization Pipeline
Streaming data efficiently to the training loop.
Read article →The Geometry of Dot Products
How attention measures similarity in embedding space.
Read article →End-to-End CPU SLM Recipe
From scratch: train and serve a small model on a workstation.
Read article →Feed-Forward Network
Two-thirds of model params live here. Worth understanding.
Read article →Fine-Tuning Math
How to adapt without retraining the whole model.
Read article →FlashAttention
Same math, very different memory pattern.
Read article →Gradient Accumulation
Effective large batches on small memory.
Read article →Gradient Clipping and Training Stability
Stopping spikes from blowing up training.
Read article →KV Cache
Why and how to cache keys and values.
Read article →KV Cache Quantization Deep Dive
Compress the biggest memory hog at inference.
Read article →LayerNorm
How LayerNorm stabilizes activations.
Read article →Learning Rate Schedules
Why LR changes during training and the canonical curves.
Read article →Linear Algebra for Transformers
Vectors, matrices, and operations you actually need.
Read article →Long Context Strategies
How models handle 128K, 1M tokens.
Read article →Diagnosing Loss Curves
What healthy vs unhealthy training looks like.
Read article →Mixed Precision Training
Half-precision for speed; full precision for stability.
Read article →Mixture of Experts
Active params vs total params, and what MoE means for CPU inference.
Read article →Multi-Head Attention
Why split into heads and how the math reassembles.
Read article →Multi-Token Prediction (MTP)
Predict the next 2-4 tokens at once for free speed.
Read article →Output Projection and Logits
From final hidden state to next-token probabilities.
Read article →Perplexity and Evaluation Metrics
How to measure if your model is any good.
Read article →Positional Encoding
How transformers know which token is where.
Read article →Quantization Layouts in Memory
Block-wise scaling, packed INT4, and the layout-vs-quality trade.
Read article →Residual Connections and Gradient Flow
Why every transformer block adds the input back.
Read article →RLHF and DPO
From a pretrained model to a useful assistant.
Read article →RMSNorm
Why everyone moved away from LayerNorm.
Read article →Extending RoPE to Longer Contexts
YaRN, NTK-aware scaling, and LongRope.
Read article →Sampling Strategies
How to pick the next token from the model's distribution.
Read article →Scaled Dot-Product Attention
The math of the attention operation.
Read article →SGD vs Adam vs AdamW Optimizer Math
The update rules — and why Adam dominates LLM training.
Read article →SIMD Instructions for Transformer Math
AVX-512, AMX, and how they vectorize matmul.
Read article →SLM Architectures
Hyperparams of the leading sub-7B models.
Read article →Softmax Derivation
Why softmax, what it preserves, and how temperature changes it.
Read article →Speculative Decoding
Use a fast draft model; verify with the big model.
Read article →The Future of CPU SLM in 2026 and Beyond
Where this is headed.
Read article →Tied Embeddings and Parameter Sharing
Why SLMs share weights between input embedding and output projection.
Read article →Token Embedding Lookup
How discrete tokens become continuous vectors.
Read article →Tokenizer Math
How text becomes a token sequence.
Read article →Training Data for SLMs
What to train on when you can't use the whole internet.
Read article →Weight Initialization
Why all-zeros doesn't work and what to do instead.
Read article →How Weights Are Stored on Disk
From PyTorch tensor to GGUF/SafeTensors bytes.
Read article →