GLIO-Select: Machine Learning-Based Feature Selection and Weighting of Tissue and Serum Proteomic and Metabolomic Data Uncovers Sex Differences in Glioblastoma.
Journal:
International journal of molecular sciences
Published Date:
May 2, 2025
Abstract
Glioblastoma (GBM) is a fatal brain cancer known for its rapid and aggressive growth, with some studies indicating that females may have better survival outcomes compared to males. While sex differences in GBM have been observed, the underlying biological mechanisms remain poorly understood. Feature selection can lead to the identification of discriminative key biomarkers by reducing dimensionality from high-dimensional medical datasets to improve machine learning model performance, explainability, and interpretability. Feature selection can uncover unique sex-specific biomarkers, determinants, and molecular profiles in patients with GBM. We analyzed high-dimensional proteomic and metabolomic profiles from serum biospecimens obtained from 109 patients with pathology-proven glioblastoma (GBM) on NIH IRB-approved protocols with full clinical annotation (local dataset). Serum proteomic analysis was performed using Somalogic aptamer-based technology (measuring 7289 proteins) and serum metabolome analysis using the University of Florida's SECIM (Southeast Center for Integrated Metabolomics) platform (measuring 6015 metabolites). Machine learning-based feature selection was employed to identify proteins and metabolites associated with male and female labels in high-dimensional datasets. Results were compared to publicly available proteomic and metabolomic datasets (CPTAC and TCGA) using the same methodology and TCGA data previously structured for glioma grading. Employing a machine learning-based and hybrid feature selection approach, utilizing both LASSO and mRMR, in conjunction with a rank-based weighting method (i.e., GLIO-Select), we linked proteomic and metabolomic data to clinical data for the purposes of feature reduction to identify molecular biomarkers associated with biological sex in patients with GBM and used a separate TCGA set to explore possible linkages between biological sex and mutations associated with tumor grading. Serum proteomic and metabolomic data identified several hundred features that were associated with the male/female class label in the GBM datasets. Using the local serum-based dataset of 109 patients, 17 features (100% ACC) and 16 features (92% ACC) were identified for the proteomic and metabolomic datasets, respectively. Using the CPTAC tissue-based dataset (8828 proteomic and 59 metabolomic features), 5 features (99% ACC) and 13 features (80% ACC) were identified for the proteomic and metabolomic datasets, respectively. The proteomic data serum or tissue (CPTAC) achieved the highest accuracy rates (100% and 99%, respectively), followed by serum metabolome and tissue metabolome. The local serum data yielded several clinically known features (PSA, PZP, HCG, and FSH) which were distinct from CPTAC tissue data (RPS4Y1 and DDX3Y), both providing methodological validation, with PZP and defensins (DEFA3 and DEFB4A) representing shared proteomic features between serum and tissue. Metabolomic features shared between serum and tissue were homocysteine and pantothenic acid. Several signals emerged that are known to be associated with glioma or GBM but not previously known to be associated with biological sex, requiring further research, as well as several novel signals that were previously not linked to either biological sex or glioma. EGFR, FAT4, and BCOR were the three features associated with 64% ACC using the TCGA glioma grading set. GLIO-Select shows remarkable results in reducing feature dimensionality when different types of datasets (e.g., serum and tissue-based) were used for our analyses. The proposed approach successfully reduced relevant features to less than twenty biomarkers for each GBM dataset. Serum biospecimens appear to be highly effective for identifying biologically relevant sex differences in GBM. These findings suggest that serum-based noninvasive biospecimen-based analyses may provide more accurate and clinically detailed insights into sex as a biological variable (SABV) as compared to other biospecimens, with several signals linking sex differences and glioma pathology via immune response, amino acid metabolism, and cancer hallmark signals requiring further research. Our results underscore the importance of biospecimen choice and feature selection in enhancing the interpretation of omics data for understanding sex-based differences in GBM. This discovery holds significant potential for enhancing personalized treatment plans and patient outcomes.