Protein classification using modified n-grams and skip-grams.

Journal: Bioinformatics (Oxford, England)
Published Date:

Abstract

MOTIVATION: Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG).

Authors

  • S M Ashiqul Islam
    Institute of Biomedical Studies, Baylor University, Waco, TX, USA. S_Islam@Baylor.edu.
  • Benjamin J Heil
    Department of Computer Science.
  • Christopher Michel Kearney
    Institute of Biomedical Studies, Baylor University, Waco, TX, USA. Chris_Kearney@Baylor.edu.
  • Erich J Baker
    Institute of Biomedical Studies, Baylor University, Waco, TX, USA. Erich_Baker@Baylor.edu.