Performance Evaluation of Lightweight Open-source Large Language Models in Pediatric Consultations: A Comparative Analysis
Journal:
arXiv
Published Date:
Jul 16, 2024
Abstract
Large language models (LLMs) have demonstrated potential applications in
medicine, yet data privacy and computational burden limit their deployment in
healthcare institutions. Open-source and lightweight versions of LLMs emerge as
potential solutions, but their performance, particularly in pediatric settings
remains underexplored. In this cross-sectional study, 250 patient consultation
questions were randomly selected from a public online medical forum, with 10
questions from each of 25 pediatric departments, spanning from December 1,
2022, to October 30, 2023. Two lightweight open-source LLMs, ChatGLM3-6B and
Vicuna-7B, along with a larger-scale model, Vicuna-13B, and the widely-used
proprietary ChatGPT-3.5, independently answered these questions in Chinese
between November 1, 2023, and November 7, 2023. To assess reproducibility, each
inquiry was replicated once. We found that ChatGLM3-6B demonstrated higher
accuracy and completeness than Vicuna-13B and Vicuna-7B (P < .001), but all
were outperformed by ChatGPT-3.5. ChatGPT-3.5 received the highest ratings in
accuracy (65.2%) compared to ChatGLM3-6B (41.2%), Vicuna-13B (11.2%), and
Vicuna-7B (4.4%). Similarly, in completeness, ChatGPT-3.5 led (78.4%), followed
by ChatGLM3-6B (76.0%), Vicuna-13B (34.8%), and Vicuna-7B (22.0%) in highest
ratings. ChatGLM3-6B matched ChatGPT-3.5 in readability, both outperforming
Vicuna models (P < .001). In terms of empathy, ChatGPT-3.5 outperformed the
lightweight LLMs (P < .001). In safety, all models performed comparably well (P
> .05), with over 98.4% of responses being rated as safe. Repetition of
inquiries confirmed these findings. In conclusion, Lightweight LLMs demonstrate
promising application in pediatric healthcare. However, the observed gap
between lightweight and large-scale proprietary LLMs underscores the need for
continued development efforts.