Mixture of Experts routes tokens across different sub-networks. Mixture of Depths goes further: each token gets routed through a different number of layers. Simple tokens skip layers; hard tokens go deeper. Saves compute on easy tokens; preserves quality on hard ones.

Advertisement

The intuition

Not every token needs the full model. Function words (the, of, and), copy-paste from context, structural punctuation — all of these are easy. Spending the same compute on them as on a multi-step-reasoning token is wasteful.

How MoD works

Router at each layer decides: process this token, or skip. Skipped tokens pass through unchanged (residual connection). At training time, gradient flows through the routing decision. At inference, hard skip — saves the matmul.

Advertisement

Top-k routing

Common variant: each layer processes only the top-k 'most important' tokens. Bottom (N-k) skip. Combined with a target compute budget per token average. Saves 30-50% compute with minimal quality loss.

Where it fits

Best for inference cost reduction on large contexts. Less benefit on short prompts (every token needs full processing). Pairs naturally with sparse attention; both skip 'unimportant' work in different ways.

Adoption

Research papers strong (Google's MoD paper, follow-up work). Production adoption nascent. Inference engines need scheduling support for variable compute per token. Expect more in 2026-2027 model releases.

Some tokens skip layers. Saves compute on easy tokens. Standard MoE infrastructure adapts. 30-50% saving possible at limited quality cost.