Environmental adaptations in metagenomes revealed by deep learning.

Journal: BMC biology
Published Date:

Abstract

BACKGROUND: Deep learning has emerged as a powerful tool in the analysis of biological data, including the analysis of large metagenome data. However, its application remains limited due to high computational costs, model complexity, and difficulty extracting biological insights from these artificial neural networks (ANNs). In this study, we applied a transfer learning approach using the ESM-2 protein structure prediction model and our own smaller ANN to classify proteins containing the domain of unknown function 3494 (DUF3494) by their source environments. DUF3494 is found in a diverse group of putative ice-binding and substrate-binding proteins across a range of environments in prokaryotic and eukaryotic microorganisms. They present a compelling test case for exploring the balance between prediction accuracy and interpretability in sequence classification.

Authors

  • Johanna C Winder
    School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK. j.winder@uea.ac.uk.
  • Simon Poulton
    School of Biological Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK.
  • Taoyang Wu
    School of Computing Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK.
  • Thomas Mock
    School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK.
  • Cock van Oosterhout
    School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK.