Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis.

Journal: Journal of biomedical informatics
Published Date:

Abstract

OBJECTIVE: Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research.

Authors

  • Qiuhong Wei
    Big Data Center for Children's Medical Care, Children's Hospital of Chongqing Medical University, Chongqing, China; Children Nutrition Research Center, Children's Hospital of Chongqing Medical University, Chongqing, China; National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, China International Science and Technology Cooperation Base of Child Development and Critical Disorders, Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China.
  • Zhengxiong Yao
    Department of Neurology, Children's Hospital of Chongqing Medical University, Chongqing, China.
  • Ying Cui
    Department of Medicine Chemistry, Logistics College of Chinese People's Armed Police Forces, Tianjin, 300309, China.
  • Bo Wei
    1 Department of General Surgery, Chinese PLA General Hospital, Beijing 100853, China.
  • Zhezhen Jin
    Mailman School of Public Health, Columbia University in the City of New York, New York, NY 10027, USA.
  • Ximing Xu
    Department of Pharmaceutics, School of Pharmacy, Jiangsu University, Zhenjiang, People's Republic of China.