Tokenization isn't sexy but shapes every other choice. Vocabulary size affects model capacity. Tokenizer choice affects multilingual coverage. Compression rate determines context window in practice.
Advertisement
BPE (Byte-Pair Encoding)
Greedy: merge most-frequent byte pairs iteratively. Result: subword units. Used in GPT family. Good English, OK other languages with sufficient training data.
SentencePiece
Language-agnostic, includes whitespace as part of tokens. Better multilingual coverage. Used in Llama, Mistral, T5. The current open-source default.
Advertisement
Compression matters
Same text in different tokenizers: GPT tokenizer might compress to 1000 tokens, another to 1500. 50% more compute for the same context. Tiktoken (GPT) is highly optimized for English; other tokenizers may be better for code or non-English text.
SentencePiece is the open default. Tokenizer choice impacts cost and multilingual quality. Measure compression on your data.