Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework
Journal:
arXiv
Published Date:
Mar 7, 2025
Abstract
Large language models (LLMs) are increasingly adopted in medical
question-answering (QA) scenarios. However, LLMs can generate hallucinations
and nonfactual information, undermining their trustworthiness in high-stakes
medical tasks. Conformal Prediction (CP) provides a statistically rigorous
framework for marginal (average) coverage guarantees but has limited
exploration in medical QA. This paper proposes an enhanced CP framework for
medical multiple-choice question-answering (MCQA) tasks. By associating the
non-conformance score with the frequency score of correct options and
leveraging self-consistency, the framework addresses internal model opacity and
incorporates a risk control strategy with a monotonic loss function. Evaluated
on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, the
proposed method meets specified error rate guarantees while reducing average
prediction set size with increased risk level, offering a promising uncertainty
evaluation metric for LLMs.