Prediction of the disease causal genes based on heterogeneous network and multi-feature combination method.

Journal: Computational biology and chemistry

Published Date: Feb 9, 2022

Abstract

At present, the prediction of disease causal genes is mainly based on heterogeneous. Research shows that heterogeneous network contains more information and have better prediction results. In this paper, we constructed a heterogeneous network including four node types of disease, gene, phenotype and gene ontology. On this basis, we use a machine learning algorithm to predict disease-causing genes. The algorithm is divided into three steps: preprocess and training sample extraction, features extraction and combination, model training and prediction. In the process of feature extraction and combination, by using network representation method, the representation vectors of nodes are generated as the embedding features of the nodes. We also extracted the structural features of each node in the network and then the embedding features and structure features are combined. The results of training and prediction show that the prediction algorithm based on all features combined together achieves the best prediction performance. Moreover, the combination of each network representation method's embedding features and structural features has also achieved performance improvement. In the process of training samples extraction, we propose three improvement directions according to the network structure and data set distribution. Firstly, a positive sample algorithm based on network connectivity is proposed, we try to keep the connectivity of the whole heterogeneous graph in the sampling process to avoid the negative impact of embedding features' extraction. Moreover, the influence of sample sampling ratio on experimental results was tested in the range of 0-1 with step size of 0.1. The influence of different proportion of positive and negative samples on the results was also tested. These improvements are intended to enhance the balance and robustness of the method. When the positive sample ratio is 0.1 and the proportion of negative and positive samples is 3, the model achieves the optimal result, and its AUC value and accuracy are 0.9887% and 94.55%, respectively, which are significantly higher than other models.

Authors

Lexiang Wang

School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
Mingxiao Wu

School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
Yulin Wu

School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
Xiaofeng Zhang

College of Medicine, Xi'an International University, Shaanxi, P. R. China.
Sen Li

Department of Chemical and Biochemical Engineering, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, Fujian, China.
Ming He

a State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education , Guizhou University , Guiyang , PR China.
Fan Zhang

Department of Anesthesiology, Bishan Hospital of Chongqing Medical University, Chongqing, China.
Yadong Wang

The Biofoundry, Department of Biomedical Engineering, Cornell University, Ithaca, NY, United States.
Junyi Li

School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China. Electronic address: lijunyi@hit.edu.cn.

Keywords

Algorithms Machine Learning

External Resources

View on PubMed Access via DOI PubMed (35217251)

Prediction of the disease causal genes based on heterogeneous network and multi-feature combination method.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals