Genomic language models: opportunities and challenges.

Journal: Trends in genetics : TIG

PMID: 39753409

Abstract

Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.

Authors

Gonzalo Benegas

Computer Science Division, University of California, Berkeley, CA, USA.
Chengzhong Ye

Department of Statistics, University of California, Berkeley, 94720, CA, USA.
Carlos Albors

Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Jianan Canal Li

National Science Foundation Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA.
Yun S Song

Computer Science Division, UC Berkeley, Berkeley, California, United States of America.

Keywords

Animals Genome Genomics Humans Language Models, Genetic Natural Language Processing

External Resources

View on PubMed Access via DOI PubMed (39753409)

Genomic language models: opportunities and challenges.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals