Benchmarking large language models for biomedical natural language processing applications and recommendations.

Journal: Nature communications

PMID: 40188094

Abstract

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.

Authors

Qingyu Chen

Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Yan Hu

Department of Thoracic Surgery, The Second Xiangya Hospital of Central South University, Changsha, Hunan, China.
Xueqing Peng

Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Qianqian Xie

Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.
Qiao Jin

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Aidan Gilson

Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Maxwell B Singer

Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Xuguang Ai

Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Po-Ting Lai

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Zhizheng Wang

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Vipina K Keloth

Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Kalpana Raja

Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Jimin Huang

Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.
Huan He

Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, United States.
Fongci Lin

Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.
Jingcheng Du

University of Texas Health Science Center at Houston, Houston, Texas, USA.
Rui Zhang

Department of Cardiology, Zhongda Hospital, Medical School of Southeast University, Nanjing, China.
W Jim Zheng

McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA.
Ron A Adelman

Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Zhiyong Lu

National Center for Biotechnology Information, Bethesda, MD 20894 USA.
Hua Xu

Department of Urology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.

Keywords

Benchmarking Humans Large Language Models Natural Language Processing

External Resources

View on PubMed Access via DOI PubMed (40188094)

Benchmarking large language models for biomedical natural language processing applications and recommendations.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals