Benchmarking large language models for biomedical natural language processing applications and recommendations.

Journal: Nature communications
PMID:

Abstract

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.

Authors

  • Qingyu Chen
    Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Yan Hu
    Department of Thoracic Surgery, The Second Xiangya Hospital of Central South University, Changsha, Hunan, China.
  • Xueqing Peng
    Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Qianqian Xie
    Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.
  • Qiao Jin
    National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
  • Aidan Gilson
    Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Maxwell B Singer
    Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Xuguang Ai
    Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Po-Ting Lai
    National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
  • Zhizheng Wang
    National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
  • Vipina K Keloth
    Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Kalpana Raja
    Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Jimin Huang
    Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.
  • Huan He
    Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, United States.
  • Fongci Lin
    Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.
  • Jingcheng Du
    University of Texas Health Science Center at Houston, Houston, Texas, USA.
  • Rui Zhang
    Department of Cardiology, Zhongda Hospital, Medical School of Southeast University, Nanjing, China.
  • W Jim Zheng
    McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA.
  • Ron A Adelman
    Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Zhiyong Lu
    National Center for Biotechnology Information, Bethesda, MD 20894 USA.
  • Hua Xu
    Department of Urology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.