Language of instruction

Instructions in English often outperform native language due to training data mix. But: instructions in target language reduce accidental English leaks in output.

Advertisement

Cross-lingual retrieval

Multilingual embeddings (BGE-M3, e5-mistral) enable query-in-language-A + docs-in-language-B. Native retrieval beats translate-then-retrieve for many pairs.

Advertisement

Translation as pivot

Complex reasoning: translate to English, reason, translate back. Loses nuance. Modern GPT-4/Claude do fine without pivot in most language pairs.

Right-to-left, non-Latin

Arabic, Hebrew, CJK: watch tokenization quality. Some models struggle with under-represented scripts (Amharic, Burmese, low-resource).