Optimizing lipocalin sequence classification with ensemble deep learning models.
Journal:
PloS one
PMID:
40238838
Abstract
Deep learning (DL) has become a powerful tool for the recognition and classification of biological sequences. However, conventional single-architecture models often struggle with suboptimal predictive performance and high computational costs. To address these challenges, we present EnsembleDL-Lipo, an innovative ensemble deep learning framework that combines Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs) to enhance the identification of lipocalin sequences. Lipocalins are multifunctional extracellular proteins involved in various diseases and stress responses, and their low sequence similarity and occurrence in the 'twilight zone' of sequence alignment present significant hurdles for accurate classification. These challenges necessitate efficient computational methods to complement traditional, labor-intensive experimental approaches. EnsembleDL-Lipo overcomes these issues by leveraging a set of PSSM-based features to train a large ensemble of deep learning models. The framework integrates multiple feature representations derived from position-specific scoring matrices (PSSMs), optimizing classification performance across diverse sequence patterns. The model achieved superior results on the training dataset, with an accuracy (ACC) of 97.65%, recall of 97.10%, Matthews correlation coefficient (MCC) of 0.95, and area under the curve (AUC) of 0.99. Validation on an independent test set further confirmed the robustness of the model, yielding an ACC of 95.79%, recall of 90.48%, MCC of 0.92, and AUC of 0.97. These results demonstrate that EnsembleDL-Lipo is a highly effective and computationally efficient tool for lipocalin sequence identification, significantly outperforming existing methods and offering strong potential for applications in biomarker discovery.