Trustworthy Medical Question Answering: An Evaluation-Centric Survey
Journal:
arXiv
Published Date:
Jun 4, 2025
Abstract
Trustworthiness in healthcare question-answering (QA) systems is important
for ensuring patient safety, clinical effectiveness, and user confidence. As
large language models (LLMs) become increasingly integrated into medical
settings, the reliability of their responses directly influences clinical
decision-making and patient outcomes. However, achieving comprehensive
trustworthiness in medical QA poses significant challenges due to the inherent
complexity of healthcare data, the critical nature of clinical scenarios, and
the multifaceted dimensions of trustworthy AI. In this survey, we
systematically examine six key dimensions of trustworthiness in medical QA,
i.e., Factuality, Robustness, Fairness, Safety, Explainability, and
Calibration. We review how each dimension is evaluated in existing LLM-based
medical QA systems. We compile and compare major benchmarks designed to assess
these dimensions and analyze evaluation-guided techniques that drive model
improvements, such as retrieval-augmented grounding, adversarial fine-tuning,
and safety alignment. Finally, we identify open challenges-such as scalable
expert evaluation, integrated multi-dimensional metrics, and real-world
deployment studies-and propose future research directions to advance the safe,
reliable, and transparent deployment of LLM-powered medical QA.