Efficient Training Corpus Retrieval for Large Language Model Fine Tuning: A Case Study in Cancer.

Journal: Studies in health technology and informatics

Published Date: Aug 7, 2025

Abstract

The objective is to create an automated knowledge extraction tool for cancer research that builds high-quality academic corpora for LLM fine-tuning while investigating its effectiveness in interleukin-6 and bladder cancer domains. To address the current gap in knowledge retrieval techniques for cancer research data collection, we propose KnowledgePipeline, a novel automated tool that incorporates diverse aspects of academic papers and metadata. Our tool integrates content, co-citations, and co-authorship networks to construct domain-specific academic corpora suitable for fine-tuning LLMs. We leverage two LLMs (GPTJ-6.7B and Galactica30B) trained on domain-specific question-answer pairs from the refined data. The system's evaluation focuses on both the quality of extracted knowledge and the performance of fine-tuned models in open-ended question-answering tasks. We see that KnowledgePipeline offers a scalable, automated framework for domain-specific knowledge retrieval and fine-tuned applications in cancer research, advancing literature discovery and addressing critical biomedical challenges. It achieved high relevance scores of 68% for IL-6 and 74.5% for bladder cancer, with a fine-tuned Galactica-30B model demonstrating promising capabilities.

Authors

Avisha Das

Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona, Phoenix, AZ, USA.
Chiamaka Diala

University of Texas Health Science Center, Houston, TX, USA.
Guocai Chen

Center for Computational Biomedicine, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA, Department of Public Health Science, Medical University of South Carolina, 135 Cannon Street, Suite 303, Charleston, SC 29425, USA and Department of Investigational Cancer Therapeutics, Institute for Personalized Cancer Therapy, UT-MD Anderson Cancer Center, 1400 Holcombe Blvd., FC8.3044, Houston, TX 77030, USA.
Zhao Li

Research Center for Data Hub and Security, Zhejiang Lab, Hangzhou, China. lzjoey@gmail.com.
Rongbin Li

School of Biomedical Informatics, University of Texas Health Science Center at Houston; Yale University; Melax Technologies, Houston.
Omer Anjum

University of Texas Health Science Center, Houston, TX, USA.
W Jim Zheng

McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA.

Keywords

Data Mining Humans Information Storage and Retrieval Interleukin-6 Large Language Models Natural Language Processing Neoplasms Urinary Bladder Neoplasms

External Resources

View on PubMed Access via DOI PubMed (40776057)

Efficient Training Corpus Retrieval for Large Language Model Fine Tuning: A Case Study in Cancer.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Efficient Training Corpus Retrieval for Large Language Model Fine Tuning: A Case Study in Cancer.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals