PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset
Journal:
arXiv
Published Date:
May 30, 2025
Abstract
Accurately predicting gene mutations, mutation subtypes and their exons in
lung cancer is critical for personalized treatment planning and prognostic
assessment. Faced with regional disparities in medical resources and the high
cost of genomic assays, using artificial intelligence to infer these mutations
and exon variants from routine histopathology images could greatly facilitate
precision therapy. Although some prior studies have shown that deep learning
can accelerate the prediction of key gene mutations from lung cancer pathology
slides, their performance remains suboptimal and has so far been limited mainly
to early screening tasks. To address these limitations, we have assembled
PathGene, which comprises histopathology images paired with next-generation
sequencing reports from 1,576 patients at the Second Xiangya Hospital, Central
South University, and 448 TCGA-LUAD patients. This multi-center dataset links
whole-slide images to driver gene mutation status, mutation subtypes, exon, and
tumor mutational burden (TMB) status, with the goal of leveraging pathology
images to predict mutations, subtypes, exon locations, and TMB for early
genetic screening and to advance precision oncology. Unlike existing datasets,
we provide molecular-level information related to histopathology images in
PathGene to facilitate the development of biomarker prediction models. We
benchmarked 11 multiple-instance learning methods on PathGene for mutation,
subtype, exon, and TMB prediction tasks. These experimental methods provide
valuable alternatives for early genetic screening of lung cancer patients and
assisting clinicians to quickly develop personalized precision targeted
treatment plans for patients. Code and data are available at
https://github.com/panliangrui/NIPS2025/.