GSR-ST: A generalized spatial-temporal framework for genomic signals and regions prediction using multi-scale feature fusion.

Journal: Computational biology and chemistry
Published Date:

Abstract

Genomic DNA sequences contain diverse functional genomic signals and regions (GSRs) that are crucial for regulating gene expression. The precise identification of these GSRs is fundamental to elucidating genomic architecture and understanding regulatory mechanisms. However, due to the data complexity and heterogeneity, current computational methods remain limited in their predictive accuracy. In this work, we propose a generalized spatial-temporal deep learning framework, GSR-ST, for efficiently identifying three kinds of GSRs: polyadenylation signals (PAS), translation initiation sites (TIS), and promoters. GSR-ST improves the model's predictive performance and generalization ability by integrating multi-scale information from DNA sequences through DNA Bidirectional Encoder Representations from Transformers (DNABERT) pre-trained embeddings and diverse handcrafted features. The framework employs a dual-channel parallel spatial-temporal network architecture to comprehensively capture sequence characteristics. Experimental results demonstrate that GSR-ST substantially outperforms state-of-the-art computational methods in predicting PAS and TIS across multiple eukaryotic species, as well as in predicting promoter for diverse bacterial species. The superior performance of GSR-ST on the independent test sets and its robustness in cross-species validations further confirm its effectiveness. The fusion of pretrained DNABERT embeddings and multiple handcrafted features, leveraged within a spatio-temporal network framework, enables GSR-ST to effectively extract global and local DNA sequence features. This capability makes it a versatile framework for diverse GSRs recognition tasks.

Authors

Keywords

No keywords available for this article.