Effect of tokenization on transformers for biological sequences.

Journal: Bioinformatics (Oxford, England)

Published Date: Mar 29, 2024

Abstract

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families.

Authors

Edo Dotan

The Henry and Marilyn Taub Faculty of Computer Science, Technion - Israel Institute of Technology, Haifa 3200003, Israel.
Gal Jaschek

Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, United States.
Tal Pupko

Department of Earth and Planetary Science, UC Berkeley, Berkeley, CA, 94720, USA.
Yonatan Belinkov

The Henry and Marilyn Taub Faculty of Computer Science, Technion - Israel Institute of Technology, Haifa 3200003, Israel.

Keywords

Algorithms Computational Biology Deep Learning Natural Language Processing Proteins Sequence Alignment Sequence Analysis, Protein

External Resources

View on PubMed Access via DOI PubMed (38608190)

Effect of tokenization on transformers for biological sequences.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals