Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets.
Journal:
Communications biology
Published Date:
Jun 7, 2025
Abstract
Predicting prokaryotic phenotypes-observable traits that govern functionality, adaptability, and interactions-holds significant potential for fields such as biotechnology, environmental sciences, and evolutionary biology. In this study, we leverage machine learning to explore the relationship between prokaryotic genotypes and phenotypes. Utilizing the highly standardized datasets in the BacDive database, we model eight physiological properties based on protein family inventories, evaluate model performance using multiple metrics, and examine the biological implications of our predictions. The high confidence values achieved underscore the importance of data quality and quantity for reliably inferring bacterial phenotypes. Our approach generates 50,396 completely new datapoints for 15,938 strains, now openly available in the BacDive database, thereby enriching existing phenotypic resources and enabling further research. The open-source software we provide can be readily applied to other datasets, such as those from metagenomic studies, and to various applications, including assessing the potential of soil bacteria for bioremediation.