Learning a functional grammar of protein domains using natural language word embedding techniques.

Journal: Proteins

Published Date: Nov 25, 2019

Abstract

In this paper, using Word2vec, a widely-used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic "meaning" in the context of their functional contributions to the multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as "sentences" where domain identifiers are tokens which may be considered as "words." Using all InterPro (Finn et al. 2017) pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown function.

Authors

Daniel W A Buchan

Department of Computer Science, University College London, London, UK.
David T Jones

Department of Computer Science, Bioinformatics Group, University College London, Gower Street, London, WC1E 6BT, United Kingdom. d.t.jones@ucl.ac.uk.

Keywords

Databases, Protein Datasets as Topic Gene Ontology Humans Molecular Sequence Annotation Natural Language Processing Protein Domains Proteins Semantics

External Resources

View on PubMed Access via DOI PubMed (31703152)

Learning a functional grammar of protein domains using natural language word embedding techniques.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals