Carmna: classification and regression models for nitrogenase activity based on a pretrained large protein language model.
Journal:
Briefings in bioinformatics
PMID:
40273431
Abstract
Nitrogen-fixing microorganisms play a critical role in the global nitrogen cycle by converting atmospheric nitrogen into ammonia through the action of nitrogenase (EC 1.18.6.1). In this study, we employed six machine learning algorithms to model the classification and regression of nitrogenase activity (Carmna). Carmna utilized the pretrained large-scale model ProtT5 for feature extraction from nitrogenase sequences and incorporated additional features, such as gene expression and codon preference, for model training. The optimal classification model, based on XGBoost, achieved an average area under receiver operating characteristic curve of 0.9365 and an F1 score of 0.85 in five-fold cross-validation. For regression, the best-performing model was a stacking approach based on support vector regression, with an average R2 of 0.5572 and a mean absolute error of 0.3351. Further interpretability analysis of the optimal regression model revealed that not only the proportion and codon preferences of standard amino acids, but also the expression levels and spatial distance of nitrogenase genes were associated with nitrogenase activity. We also obtained the minimum nitrogen-fixing nif cluster. This study deepens our understanding of the complex mechanisms regulating nitrogenase activity and contributes to the development of efficient bio-fertilizers.