Efficient Training Corpus Retrieval for Large Language Model Fine Tuning: A Case Study in Cancer.

Journal: Studies in health technology and informatics
Published Date:

Abstract

The objective is to create an automated knowledge extraction tool for cancer research that builds high-quality academic corpora for LLM fine-tuning while investigating its effectiveness in interleukin-6 and bladder cancer domains. To address the current gap in knowledge retrieval techniques for cancer research data collection, we propose KnowledgePipeline, a novel automated tool that incorporates diverse aspects of academic papers and metadata. Our tool integrates content, co-citations, and co-authorship networks to construct domain-specific academic corpora suitable for fine-tuning LLMs. We leverage two LLMs (GPTJ-6.7B and Galactica30B) trained on domain-specific question-answer pairs from the refined data. The system's evaluation focuses on both the quality of extracted knowledge and the performance of fine-tuned models in open-ended question-answering tasks. We see that KnowledgePipeline offers a scalable, automated framework for domain-specific knowledge retrieval and fine-tuned applications in cancer research, advancing literature discovery and addressing critical biomedical challenges. It achieved high relevance scores of 68% for IL-6 and 74.5% for bladder cancer, with a fine-tuned Galactica-30B model demonstrating promising capabilities.

Authors

  • Avisha Das
    Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona, Phoenix, AZ, USA.
  • Chiamaka Diala
    University of Texas Health Science Center, Houston, TX, USA.
  • Guocai Chen
    Center for Computational Biomedicine, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA, Department of Public Health Science, Medical University of South Carolina, 135 Cannon Street, Suite 303, Charleston, SC 29425, USA and Department of Investigational Cancer Therapeutics, Institute for Personalized Cancer Therapy, UT-MD Anderson Cancer Center, 1400 Holcombe Blvd., FC8.3044, Houston, TX 77030, USA.
  • Zhao Li
    Research Center for Data Hub and Security, Zhejiang Lab, Hangzhou, China. lzjoey@gmail.com.
  • Rongbin Li
    School of Biomedical Informatics, University of Texas Health Science Center at Houston; Yale University; Melax Technologies, Houston.
  • Omer Anjum
    University of Texas Health Science Center, Houston, TX, USA.
  • W Jim Zheng
    McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA.