RAG Evaluation Metrics That Actually Matter
Moving beyond basic recall — measuring faithfulness, relevance, and answer quality in retrieval-augmented systems.
The Measurement Problem
Most RAG evaluations test retrieval recall and call it a day. But retrieval quality is necessary, not sufficient. A production RAG system can retrieve perfect context and still generate incorrect answers.
The Three Pillars
1. Retrieval Quality
Traditional IR metrics still apply, but with nuance:
| 1 | def retrieval_metrics(retrieved, ground_truth): |
| 2 | return { |
| 3 | class=class="syn-str">"syn-str">class="syn-str">"recall@5": len(set(retrieved[:5]) & ground_truth) / len(ground_truth), |
| 4 | class=class="syn-str">"syn-str">class="syn-str">"precision@5": len(set(retrieved[:5]) & ground_truth) / 5, |
| 5 | class=class="syn-str">"syn-str">class="syn-str">"mrr": next( |
| 6 | (1 / (i + 1) for i, doc in enumerate(retrieved) if doc in ground_truth), |
| 7 | 0 |
| 8 | ), |
| 9 | } |
2. Faithfulness
Does the generated answer actually follow from the retrieved context? This catches hallucinations that sound plausible but aren't grounded in the source material.
3. Answer Relevance
Is the generated answer actually addressing the user's question? High faithfulness with low relevance means your system is accurately summarizing the wrong information.
Building an Evaluation Pipeline
Automate evaluation as part of your CI/CD pipeline. Every change to chunking, retrieval, or prompting should trigger evaluation against a curated test set.
Key Takeaways
Measure faithfulness and relevance alongside retrieval quality. Automate evaluation. Build test sets from real user queries, not synthetic data.