ragDecember 15, 2024

RAG Evaluation Metrics That Actually Matter

Moving beyond basic recall — measuring faithfulness, relevance, and answer quality in retrieval-augmented systems.

The Measurement Problem

Most RAG evaluations test retrieval recall and call it a day. But retrieval quality is necessary, not sufficient. A production RAG system can retrieve perfect context and still generate incorrect answers.

The Three Pillars

1. Retrieval Quality

Traditional IR metrics still apply, but with nuance:

python
1def retrieval_metrics(retrieved, ground_truth):
2 return {
3 class=class="syn-str">"syn-str">class="syn-str">"recall@5": len(set(retrieved[:5]) & ground_truth) / len(ground_truth),
4 class=class="syn-str">"syn-str">class="syn-str">"precision@5": len(set(retrieved[:5]) & ground_truth) / 5,
5 class=class="syn-str">"syn-str">class="syn-str">"mrr": next(
6 (1 / (i + 1) for i, doc in enumerate(retrieved) if doc in ground_truth),
7 0
8 ),
9 }

2. Faithfulness

Does the generated answer actually follow from the retrieved context? This catches hallucinations that sound plausible but aren't grounded in the source material.

3. Answer Relevance

Is the generated answer actually addressing the user's question? High faithfulness with low relevance means your system is accurately summarizing the wrong information.

Building an Evaluation Pipeline

Automate evaluation as part of your CI/CD pipeline. Every change to chunking, retrieval, or prompting should trigger evaluation against a curated test set.

Key Takeaways

Measure faithfulness and relevance alongside retrieval quality. Automate evaluation. Build test sets from real user queries, not synthetic data.