EnsembleSE: identification of super-enhancers based on ensemble learning.

Journal: Briefings in functional genomics
PMID:

Abstract

Super-enhancers (SEs) are typically located in the regulatory regions of genes, driving high-level gene expression. Identifying SEs is crucial for a deeper understanding of gene regulatory networks, disease mechanisms, and the development and physiological processes of organisms, thus exerting a profound impact on research and applications in the life sciences field. Traditional experimental methods for identifying SEs are costly and time-consuming. Existing methods for predicting SEs based solely on sequence data use deep learning for feature representation and have achieved good results. However, they overlook biological features related to physicochemical properties, leading to low interpretability. Additionally, the complex model structure often requires extensive labeled data for training, which limits their further application in biological data. In this paper, we integrate the strengths of different models and proposes an ensemble model based on an integration strategy to enhance the model's generalization ability. It designs a multi-angle feature representation method that combines local structure and global information to extract high-dimensional abstract relationships and key low-dimensional biological features from sequences. This enhances the effectiveness and interpretability of the model's input features, providing technical support for discovering cell-specific and species-specific patterns of SEs. We evaluated the performance on both mouse and human datasets using five metrics, including area under the receiver operating characteristic curve accuracy, and others. Compared to the latest models, EnsembleSE achieved an average improvement of 4.5% in F1 score and an average improvement of 8.05% in recall, demonstrating the robustness and adaptability of the model on a unified test set. Source codes are available at https://github.com/2103374200/EnsembleSE-main.

Authors

  • Wenying He
    School of Computer Science and Technology, Tianjin University, Tianjin, China.
  • Jialu Xu
    College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou, Zhejiang, China.
  • Yun Zuo
    Department of Mathematics, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China.
  • Yude Bai
    School of Software, Tiangong University, Tianjin 300387, China.
  • Fei Guo
    School of Electronic Information Engineering, Tianjin University, Tianjin 300072, China. Electronic address: gfjy001@yahoo.com.