iORI-ENST: identifying origin of replication sites based on elastic net and stacking learning.

Journal: SAR and QSAR in environmental research
PMID:

Abstract

DNA replication is not only the basis of biological inheritance but also the most fundamental process in all living organisms. It plays a crucial role in the cell-division cycle and gene expression regulation. Hence, the accurate identification of the origin of replication sites (ORIs) has a great meaning for further understanding the regulatory mechanism of gene expression and treating genic diseases. In this paper, a novel, feasible and powerful model, namely, iORI-ENST is designed for identifying ORIs. Firstly, we extract the different features by incorporating mono-nucleotide binary encoding and dinucleotide-based spatial autocorrelation. Subsequently, elastic net is utilized as the feature selection method to select the optimal feature set. And then stacking learning is employed to predict ORIs and non-ORIs, which contains random forest, adaboost, gradient boosting decision tree, extra trees and support vector machine. Finally, the ORI sites are identified on the benchmark datasets and with their accuracies of 91.41% and 95.07%, respectively. Meanwhile, an independent dataset is employed to verify the validation and transferability of our model and its accuracy reaches 91.10%. Comparing with state-of-the-art methods, our model achieves more remarkable performance. The results show our model is a feasible, effective and powerful tool for identifying ORIs. The source code and datasets are available at https://github.com/YingyingYao/iORI-ENST.

Authors

  • Y Yao
    School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China.
  • S Zhang
    Department of Pathology, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China.
  • Y Liang
    State Key Laboratory of Quality Research in Chinese Medicines & Faculty of Information Technology, Macau University of Science and Technology, Taipa, Macau, China yliang@must.edu.mo.