SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction
Journal:
arXiv
Published Date:
Mar 3, 2025
Abstract
Large Language Models (LLMs) have demonstrated improved generation
performance by incorporating externally retrieved knowledge, a process known as
retrieval-augmented generation (RAG). Despite the potential of this approach,
existing studies evaluate RAG effectiveness by 1) assessing retrieval and
generation components jointly, which obscures retrieval's distinct
contribution, or 2) examining retrievers using traditional metrics such as
NDCG, which creates a gap in understanding retrieval's true utility in the
overall generation process. To address the above limitations, in this work, we
introduce an automatic evaluation method that measures retrieval quality
through the lens of information gain within the RAG framework. Specifically, we
propose Semantic Perplexity (SePer), a metric that captures the LLM's internal
belief about the correctness of the retrieved information. We quantify the
utility of retrieval by the extent to which it reduces semantic perplexity
post-retrieval. Extensive experiments demonstrate that SePer not only aligns
closely with human preferences but also offers a more precise and efficient
evaluation of retrieval utility across diverse RAG scenarios.