Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review.

Journal: BMC medical informatics and decision making
Published Date:

Abstract

BACKGROUND: The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated.

Authors

  • Cindy N Ho
    Diabetes Technology Society, Burlingame, CA, USA.
  • Tiffany Tian
    Diabetes Technology Society, Burlingame, CA, USA.
  • Alessandra T Ayers
    Diabetes Technology Society, Burlingame, CA, USA.
  • Rachel E Aaron
    Diabetes Technology Society, Burlingame, CA, USA.
  • Vidith Phillips
    School of Medicine, Johns Hopkins University, Baltimore, MD, USA.
  • Risa M Wolf
    Department of Pediatric Endocrinology and Diabetes, Johns Hopkins University School of Medicine, Baltimore, MD.
  • Nestoras Mathioudakis
    School of Medicine, Johns Hopkins University, Baltimore, MD, USA.
  • Tinglong Dai
    Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA.
  • David C Klonoff
    2 Mills-Peninsula Medical Center, San Mateo, CA, USA.