Tokenization Choices Compared

Tokenization isn't sexy but shapes every other choice. Vocabulary size affects model capacity. Tokenizer choice affects multilingual coverage. Compression rate determines context window in practice.

Advertisement

BPE (Byte-Pair Encoding)

Greedy: merge most-frequent byte pairs iteratively. Result: subword units. Used in GPT family. Good English, OK other languages with sufficient training data.

SentencePiece

Language-agnostic, includes whitespace as part of tokens. Better multilingual coverage. Used in Llama, Mistral, T5. The current open-source default.

Advertisement

Compression matters

Same text in different tokenizers: GPT tokenizer might compress to 1000 tokens, another to 1500. 50% more compute for the same context. Tiktoken (GPT) is highly optimized for English; other tokenizers may be better for code or non-English text.

SentencePiece is the open default. Tokenizer choice impacts cost and multilingual quality. Measure compression on your data.