Long-Context Evaluation — Belgavi.AI Lab

Long-context models (128K, 1M tokens) get benchmarked on needle-in-haystack: hide a fact, see if model retrieves it. Models score >95% and ship. Real long-context use looks nothing like needle-in-haystack.

Advertisement

Needle-in-haystack: easy

Insert distinctive fact (key=ABC) into long context, ask for it. Most models above 90% even with 1M tokens. Doesn't test understanding, just retrieval.

Multi-fact reasoning is hard

Combine facts spread through context to answer. Drops below 50% for many models even at 32K. RULER, LongBench, NIAH+ benchmarks try to capture this.

Advertisement

Lost-in-the-middle

Information in the middle of context is recalled less reliably than at start or end. Persists across model families. Practical impact: put critical context at start or end, not buried middle.

Needle is a smoke test, not a capability test. Multi-fact reasoning is the real bar. Lost-in-the-middle is real.