Prediction of the disease causal genes based on heterogeneous network and multi-feature combination method.

Journal: Computational biology and chemistry
Published Date:

Abstract

At present, the prediction of disease causal genes is mainly based on heterogeneous. Research shows that heterogeneous network contains more information and have better prediction results. In this paper, we constructed a heterogeneous network including four node types of disease, gene, phenotype and gene ontology. On this basis, we use a machine learning algorithm to predict disease-causing genes. The algorithm is divided into three steps: preprocess and training sample extraction, features extraction and combination, model training and prediction. In the process of feature extraction and combination, by using network representation method, the representation vectors of nodes are generated as the embedding features of the nodes. We also extracted the structural features of each node in the network and then the embedding features and structure features are combined. The results of training and prediction show that the prediction algorithm based on all features combined together achieves the best prediction performance. Moreover, the combination of each network representation method's embedding features and structural features has also achieved performance improvement. In the process of training samples extraction, we propose three improvement directions according to the network structure and data set distribution. Firstly, a positive sample algorithm based on network connectivity is proposed, we try to keep the connectivity of the whole heterogeneous graph in the sampling process to avoid the negative impact of embedding features' extraction. Moreover, the influence of sample sampling ratio on experimental results was tested in the range of 0-1 with step size of 0.1. The influence of different proportion of positive and negative samples on the results was also tested. These improvements are intended to enhance the balance and robustness of the method. When the positive sample ratio is 0.1 and the proportion of negative and positive samples is 3, the model achieves the optimal result, and its AUC value and accuracy are 0.9887% and 94.55%, respectively, which are significantly higher than other models.

Authors

  • Lexiang Wang
    School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
  • Mingxiao Wu
    School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
  • Yulin Wu
    School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
  • Xiaofeng Zhang
    College of Medicine, Xi'an International University, Shaanxi, P. R. China.
  • Sen Li
    Department of Chemical and Biochemical Engineering, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, Fujian, China.
  • Ming He
    a State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education , Guizhou University , Guiyang , PR China.
  • Fan Zhang
    Department of Anesthesiology, Bishan Hospital of Chongqing Medical University, Chongqing, China.
  • Yadong Wang
    The Biofoundry, Department of Biomedical Engineering, Cornell University, Ithaca, NY, United States.
  • Junyi Li
    School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China. Electronic address: lijunyi@hit.edu.cn.