RAG pipelines fail in three different places: retrieval missed the right doc, generation ignored the retrieved doc, or end-to-end gave a bad answer. Mixing these into one metric loses the signal needed to fix the actual problem.
Retrieval metrics
Recall@K, MRR, NDCG. Measured against curated gold-standard pairs (question, ideal docs). If recall@5 is low, the embedding model or chunking is wrong. Fix this before tuning generation.
Generation metrics
Faithfulness (does answer come from retrieved docs?). Answer relevance (does it answer the question?). LLM-as-judge with rubric. Catches hallucination separately from retrieval failure.
End-to-end metrics
User-relevant: did this answer the user's question? Use task-specific rubric, calibrated LLM-judge. The number that matters for product. But you can't fix what it measures without the upstream metrics.