A novel open access multimodal dataset of nodule imaging and circulating proteome from a lung cancer screening cohort

Journal: medRxiv
Published Date:

Abstract

Low-dose computed tomography (LDCT) lung cancer screening has significantly enhanced early detection and patient survival rates in the population at risk. Current screening methods, that primarily rely on LDCT imaging, will very likely benefit from molecular biomarkers to achieve a more comprehensive, accurate, personalized and non-invasive risk assessment leveraging multimodal tools. We present a novel open access multimodal (imaging, proteomics and demographic) dataset designed to provide an available research resource on LDCT-based early lung cancer detection. The dataset includes annotated screening LDCT scans and plasma proteomics generated by proximity extension assay (Olink) platform. The dataset integrates data from control screened individuals without nodules or with benign nodules, and LDCT-diagnosed lung cancer individuals, matched by sex, age and time between image and sample collection. Both radiological and molecular signatures were collected within a six month window, providing detailed insights into disease progression. Nodules were considered as lung cancer cases if biopsy-confirmed lung cancer was diagnosed within 5 years after imaging, enabling the study of longitudinal biomarker evolution and its correlation with imaging findings. To complement the dataset, clinical and demographic data are also available in open access, providing a detailed overview of patient characteristics. The informed consent signed by the participants allows for unrestricted open access for requests directy or indirectly related to lung cancer research. The dataset consists of annotated screening LDCT scans and plasma proteomics data measured with most of the Olink Target 96 platforms (1078 individual proteins across 12 panels focused on a specific area of disease or biology) for a total of 211 screening participants. There are 67 lung cancer patients, 68 matched controls with benign pulmonary nodules, 71 matched controls without nodules and 5 surgically excised false positive lesions. Experiments were performed to assess the technical quality and provide a proof-of-concept of usability of the dataset, showing the alignment with findings from previous published studies. This comprehensive dataset aims to facilitate research towards the development of personalized multimodal artificial intelligence models. We also aim to support the investigation of the relationship between imaging and molecular data, paving the way for more accurate understanding of early lung cancer biology. Finally, our open access dataset may help to develop or validate individualized risk prediction models that could significantly advance early lung cancer detection and intervention strategies.

Authors

  • Miriam Cobo; Diego Serrano; Jennifer Barranco; Andrea Pasquier; Juan Pablo de-Torres; Javier J Zulueta; José Ignacio Echeveste; Ana Ezponda; Jesús Pueyo; Allan Argueta; Julián Sanz-Ortega; Juan Bertó; Ana Belén Alcaide; Madeleine Di Frisco; Carmen Felgueroso; Arantza Campo; Alejandra de la Fuente; Ana Escobar; Karmele Valencia; Daniel Orive; Maria del Mar Ocón; Hanna Beata Globacka; Maria Antonia Fortuño; Valerio Perna; María Rodríguez; María Dolores Lozano; Alfonso Calvo; Ruben Pio; Rayjean J. Hung; Luis M Seijo; Wilson Silva; Gorka Bastarrika; Lara Lloret Iglesias; Luis M. Montuenga

Categories