Development and evaluation of large-language models (LLMs) for oncology: A scoping review.

Journal: PLOS digital health
Published Date:

Abstract

Large language models (LLMs), a significant development in artificial intelligence (AI), are continuing to demonstrate seminal improvement in performance for various text analysis and generation tasks. There are limited systematic studies on LLM applications that were developed/evaluated in relevance to oncology. Our scoping review explores applications of LLMs in oncology to determine (1) the nature of LLM applications relevant to a cancer/tumor type, (2) the phases of cancer care addressed by the LLMs, (3) which LLMs were used in these applications, (4) the sources and pre-processing of datasets used, (5) the techniques used to optimize the performance of LLMs, (6) the methods of evaluation, and (7) the common limitations noted by the authors of these LLM applications and to study their implications in research and practice. A librarian-assisted search was performed across the following databases: Association for Computing Machinery (ACM), Embase, Engineering Village, IEEE Xplore, Medline, Scopus, SPIE and Web of Science till Jan 12, 2024. Pre-prints from this search were considered if they were published/accepted by Feb 29, 2024. From the initial search of 14863 articles, 60 were finally included. Our results demonstrated that LLMs were mostly evaluated across a diverse set of oncology-related applications. Generative pre-trained transformer (GPT)-based LLMs were mostly used. In the subset of studies where the phase(s) of cancer care was/were provided or implied, treatment and diagnosis were the most included phases. Data for development and evaluation extended from patient health records, synthetic patient records, research and professional society publications to social media. Prompt-designing and engineering were performed as data pre-processing steps in several studies. Clinicians, trainees, researchers, and patients were among the variety of users targeted by the applications. In the17% studies that developed LLMs for oncological aspects, domain adaptation through pre-training and fine-tuning were often performed and resulted in performance improvement. The evaluation of an LLM's performance involved usage of both standard, validated, non-standardized, and/or customized performance measures considering a variety of constructs, other than accuracy. Six primary themes emerged as limitations including limitation of generalizability/applicability, sample size, bias and subjectivity, and evaluation metrics. This review highlights that LLMs, specific to oncological aspects, are less common than general-purpose LLMs. The application areas were heterogeneous, used diverse data sources, were directed towards a variety of users, and resulted in variety of evaluation methods. Despite the diversity of LLM applications in oncology, future research needs to address the limited generalizability of these applications, mitigation of bias and subjectivity, and standardization of evaluation methodologies. Future applications of LLMs in oncology should include developing oncology-specific LLMs that can mitigate knowledge gaps and extend to diverse areas of oncology training and practice not considered so far.

Authors

  • Namya Mehan
    Integrated Biomedical Engineering and Health Sciences, McMaster University, Hamilton, Ontario, Canada.
  • Teshan Dias Desinghe
    Global Health Program, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada.
  • Ashirbani Saha
    Department of Radiology, Duke University School of Medicine, 2424 Erwin Road, Suite 302, Durham, NC, 27705, USA. ashirbani.saha@duke.edu.

Keywords

No keywords available for this article.