ProkBERT family: genomic language models for microbiome applications.

Journal: Frontiers in microbiology
Published Date:

Abstract

BACKGROUND: In the evolving landscape of microbiology and microbiome analysis, the integration of machine learning is crucial for understanding complex microbial interactions, and predicting and recognizing novel functionalities within extensive datasets. However, the effectiveness of these methods in microbiology faces challenges due to the complex and heterogeneous nature of microbial data, further complicated by low signal-to-noise ratios, context-dependency, and a significant shortage of appropriately labeled datasets. This study introduces the ProkBERT model family, a collection of large language models, designed for genomic tasks. It provides a generalizable sequence representation for nucleotide sequences, learned from unlabeled genome data. This approach helps overcome the above-mentioned limitations in the field, thereby improving our understanding of microbial ecosystems and their impact on health and disease.

Authors

  • Balázs Ligeti
    Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary.
  • István Szepesi-Nagy
    Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary.
  • Babett Bodnár
    Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary.
  • Noémi Ligeti-Nagy
    Language Technology Research Group, HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary.
  • János Juhász
    Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary.

Keywords

No keywords available for this article.