DeepAdd: Protein function prediction from k-mer embedding and additional features.

Journal: Computational biology and chemistry
Published Date:

Abstract

With the application of new high throughput sequencing technology, a large number of protein sequences is becoming available. Determination of the functional characteristics of these proteins by experiments is an expensive endeavor that requires a lot of time. Furthermore, at the organismal level, such kind of experimental functional analyses can be conducted only for a very few selected model organisms. Computational function prediction methods can be used to fill this gap. The functions of proteins are classified by Gene Ontology (GO), which contains more than 40,000 classifications in three domains, Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). Additionally, since proteins have many functions, function prediction represents a multi-label and multi-class problem. We developed a new method to predict protein function from sequence. To this end, natural language model was used to generate word embedding of sequence and learn features from it by deep learning, and additional features to locate every protein. Our method uses the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and have noticeable improvement over several algorithms, such as FFPred, DeepGO, GoFDR and other methods compared on the CAFA3 datasets.

Authors

  • Zhihua Du
    Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China. Electronic address: duzh@szu.edu.cn.
  • Yufeng He
    Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China.
  • Jianqiang Li
    School of Software Engineering, Beijing University of Technology, Beijing, China. Electronic address: lijianqiang@bjut.edu.cn.
  • Vladimir N Uversky
    Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, USA; USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, USA; Laboratory of New Methods in Biology, Institute for Biological Instrumentation, Russian Academy of Sciences, Institutskaya Str., 7, Pushchino, Moscow Region, 142290, Russia. Electronic address: vuversky@health.usf.edu.