A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning.

Journal: Journal of biomedical informatics
Published Date:

Abstract

BACKGROUND: Natural language processing (NLP) is a key technology to extract patient information from clinical narratives to support healthcare applications. The rapid development of large language models (LLMs) has revolutionized patient information extraction in the clinical domain, yet critical strategies for effectively adopting LLMs for optimal performance need further exploration. This study examines LLMs' effectiveness in patient information extraction, focusing on LLM architectures, fine-tuning strategies, and multi-task instruction tuning techniques for developing robust and generalizable patient information extraction systems. METHODS: This study aims to explore key strategies of adopting LLMs for clinical concept and relation extraction tasks, including: (1) encoder-only or decoder-only LLMs, (2) prompt-based parameter-efficient fine-tuning (PEFT) algorithms, and (3) multi-task instruction tuning on few-shot learning performance. We benchmarked a suite of LLMs, including encoder-only LLMs (e.g., BERT, GatorTron) and decoder-only LLMs (e.g., GatorTronGPT, Llama 3.1, GatorTronLlama), across five widely used benchmarking datasets. We compared traditional full-size fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning framework that combines both tasks across four datasets to evaluate the zero-shot and few-shot learning performance using the leave-one-dataset-out strategy. RESULTS: For single-task clinical CE, the two decoder-only LLMs (Llama 3.1 and GatorTronLlama) achieved the best performance, with average F1 scores of 0.8964 and 0.8981, respectively, across the five datasets, outperforming other LLMs with average F1 improvement of 0.7 ∼ 3.3%. Encoder-only LLMs with prompt-based learning outperformed those implemented using classification. For RE, the prompt-based PEFT strategy demonstrated remarkable performance, with an F1 improvement of up to 15.9% over traditional fine-tuning on all datasets. All three decoder-only LLMs outperformed encoder-only LLMs, increasing average F1 score by 1.8 to 6.6%, with GatorTronLlama achieving the best performance with an average F1 score of 0.8978. Multi-task instruction tuning showed remarkable improvements, boosting zero-shot and few-shot F1 scores by 1.1 ∼ 37.8% compared to those without multi-task fine-tuning. Notably, generative LLMs with multitask instruction tuning using only 20% of the full dataset achieved similar performance comparable to the full-size fine-tuning (<0.005 in terms of F1 score). CONCLUSIONS: Our findings support generative LLMs with PEFT as a cost-effective solution for patient information extraction. In addition, multi-task instruction tuning significantly improves the zero-shot and few-shot performance, contributing to better generalizability. This study provides practical guidelines to develop LLM-based scalable, adaptable, and high-performing patient information extraction systems.

Authors

Keywords

No keywords available for this article.