Transformers

Transformers

Attention variants, RoPE, MoE, RMSNorm, MTP, sparse attention, FlashAttention.

20Articles
20Topics covered
Articles in this category

All 20 articles, sorted alphabetically

Advertisement
ARTICLE · 01

Attention Is All You Need

What aged and what didn't.

Read article
ARTICLE · 02

Attention Is All You Need

The 2017 paper that started everything, re-explained for 2026.

Read article
ARTICLE · 03

Attention Variants in 2026

MHA, MQA, GQA, MLA — what's deployed.

Read article
ARTICLE · 04

Audio-Native Transformers: Understanding Models That Process Raw Sound Instead of Text-to-Speech

Read article
ARTICLE · 05

Beyond Transformers: Sparse Architectures and State-Space Models for Infinite Context

Read article
ARTICLE · 06

Causal vs Bidirectional Attention

Decoder encoder prefix-LM uses.

Read article
ARTICLE · 07

FlashAttention Explained

How the attention kernel was rewritten to be 5-10x faster.

Read article
ARTICLE · 08

FlashAttention-3 and Beyond: How Hardware-Aware Algorithms are Making Models 10x Faster

Read article
ARTICLE · 09

KV Cache Explained

Why generation is fast: the attention KV cache and its memory implications.

Read article
ARTICLE · 10

Linear Transformers: Can We Finally Achieve O(n) Complexity for Infinite Context?

Read article
ARTICLE · 11

Long-Context Evaluation

Needle-in-haystack and what it misses.

Read article
ARTICLE · 12

Mixture of Depths

Routing tokens through different depths.

Read article
ARTICLE · 13

Multi-Token Prediction (MTP)

DeepSeek V3's MTP and the speed-quality trade.

Read article
ARTICLE · 14

Performance Analysis Mamba Vs Transformer

Read article
ARTICLE · 15

Rotary Position Embeddings (RoPE)

The position encoding that replaced sinusoidal and enabled long-context extrapolation.

Read article
ARTICLE · 16

The Transformer Breakdown: A Deep Dive into Self-Attention, Key-Value Pairs, and Positional Encoding

Read article
ARTICLE · 17

The Vanishing Gradient Problem: How Transformers Solved What Killed Earlier RNNs

Read article
ARTICLE · 18

Tokenization Choices Compared

BPE, SentencePiece, Tiktoken — and why it matters.

Read article
ARTICLE · 19

Transformer Block Anatomy

What's inside one of the N layers.

Read article
ARTICLE · 20

Transformer Training Loss Curves

Reading them and what each shape means.

Read article