Construction of a 26‑feature gene support vector machine classifier for smoking and non‑smoking lung adenocarcinoma sample classification.

Journal: Molecular medicine reports
Published Date:

Abstract

The present study aimed to identify the feature genes associated with smoking in lung adenocarcinoma (LAC) samples and explore the underlying mechanism. Three gene expression datasets of LAC samples were downloaded from the Gene Expression Omnibus database through pre‑set criteria and the expression data were processed using meta‑analysis. Differentially expressed genes (DEGs) between LAC samples of smokers and non‑smokers were identified using limma package in R. The classification accuracy of selected DEGs were visualized using hierarchical clustering analysis in R language. A protein‑protein interaction (PPI) network was constructed using gene interaction data from the Human Protein Reference Database for the DEGs. Betweenness centrality was calculated for each node in the network and genes with the greatest BC values were utilized for the construction of the support vector machine (SVM) classifier. The dataset GSE43458 was used as the training dataset for the construction and the other datasets (GSE12667 and GSE10072) were used as the validation datasets. The classification accuracy of the classifier was tested using sensitivity, specificity, positive predictive value, negative predictive value and area under curve parameters with the pROC package in R language. The feature genes in the SVM classifier were subjected to pathway enrichment analysis using Fisher's exact test. A total of 347 genes were identified to be differentially expressed between samples of smokers and non‑smokers. The PPI network of DEGs were comprised of 202 nodes and 300 edges. An SVM classifier comprised of 26 feature genes was constructed to distinguish between different LAC samples, with prediction accuracies for the GSE43458, GSE12667 and GSE10072 datasets of 100, 100 and 94.83%, respectively. Furthermore, the 26 feature genes that were significantly enriched in 9 overrepresented biological pathways, including extracellular matrix‑receptor interaction, proteoglycans in cancer, cell adhesion molecules, p53 signaling pathway, microRNAs in cancer and apoptosis, were identified to be smoking‑related genes in LAC. In conclusion, an SVM classifier with a high prediction accuracy for smoking and non‑smoking samples was obtained. The genes in the classifier may likely be the potential feature genes associated with the development of patients with LAC who smoke.

Authors

  • Lei Yang
    George Mason University.
  • Lu Sun
    The Affiliated Mental Health Center of Jiangnan University, Wuxi Mental Health Center, Wuxi 214151, Jiangsu, China.
  • Wei Wang
    State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau 999078, China.
  • Hao Xu
    Department of Nuclear Medicine, the First Affiliated Hospital, Jinan University, Guangzhou 510632, P.R.China.gdhyx2012@126.com.
  • Yi Li
    Wuhan Zoncare Bio-Medical Electronics Co., Ltd, Wuhan, China.
  • Jia-Ying Zhao
    Department of Thoracic Surgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang 150086, P.R. China.
  • Da-Zhong Liu
    Department of Thoracic Surgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang 150086, P.R. China.
  • Fei Wang
    Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY, United States.
  • Lin-You Zhang
    Department of Thoracic Surgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang 150086, P.R. China.