RAG pipelines fail in three different places: retrieval missed the right doc, generation ignored the retrieved doc, or end-to-end gave a bad answer. Mixing these into one metric loses the signal needed to fix the actual problem.

Advertisement

Retrieval metrics

Recall@K, MRR, NDCG. Measured against curated gold-standard pairs (question, ideal docs). If recall@5 is low, the embedding model or chunking is wrong. Fix this before tuning generation.

Generation metrics

Faithfulness (does answer come from retrieved docs?). Answer relevance (does it answer the question?). LLM-as-judge with rubric. Catches hallucination separately from retrieval failure.

Advertisement

End-to-end metrics

User-relevant: did this answer the user's question? Use task-specific rubric, calibrated LLM-judge. The number that matters for product. But you can't fix what it measures without the upstream metrics.

Three separate metrics: retrieval, generation, end-to-end. Each tells you where to fix. One number alone hides the bug.