Articles in this category
All 20 articles, sorted alphabetically
Advertisement
ARTICLE · 01
Attention Is All You Need
What aged and what didn't.
Read article →ARTICLE · 02
Attention Is All You Need
The 2017 paper that started everything, re-explained for 2026.
Read article →ARTICLE · 03
Attention Variants in 2026
MHA, MQA, GQA, MLA — what's deployed.
Read article →ARTICLE · 04
Audio-Native Transformers: Understanding Models That Process Raw Sound Instead of Text-to-Speech
Read article →ARTICLE · 05
Beyond Transformers: Sparse Architectures and State-Space Models for Infinite Context
Read article →ARTICLE · 06
Causal vs Bidirectional Attention
Decoder encoder prefix-LM uses.
Read article →ARTICLE · 07
FlashAttention Explained
How the attention kernel was rewritten to be 5-10x faster.
Read article →ARTICLE · 08
FlashAttention-3 and Beyond: How Hardware-Aware Algorithms are Making Models 10x Faster
Read article →ARTICLE · 09
KV Cache Explained
Why generation is fast: the attention KV cache and its memory implications.
Read article →ARTICLE · 10
Linear Transformers: Can We Finally Achieve O(n) Complexity for Infinite Context?
Read article →ARTICLE · 11
Long-Context Evaluation
Needle-in-haystack and what it misses.
Read article →ARTICLE · 12
Mixture of Depths
Routing tokens through different depths.
Read article →ARTICLE · 13
Multi-Token Prediction (MTP)
DeepSeek V3's MTP and the speed-quality trade.
Read article →ARTICLE · 14
Performance Analysis Mamba Vs Transformer
Read article →ARTICLE · 15
Rotary Position Embeddings (RoPE)
The position encoding that replaced sinusoidal and enabled long-context extrapolation.
Read article →ARTICLE · 16
The Transformer Breakdown: A Deep Dive into Self-Attention, Key-Value Pairs, and Positional Encoding
Read article →ARTICLE · 17
The Vanishing Gradient Problem: How Transformers Solved What Killed Earlier RNNs
Read article →ARTICLE · 18
Tokenization Choices Compared
BPE, SentencePiece, Tiktoken — and why it matters.
Read article →ARTICLE · 19
Transformer Block Anatomy
What's inside one of the N layers.
Read article →ARTICLE · 20
Transformer Training Loss Curves
Reading them and what each shape means.
Read article →