Amino acid encoding for deep learning applications.

Journal: BMC bioinformatics
Published Date:

Abstract

BACKGROUND: The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction - a process called 'end-to-end learning' - has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN.

Authors

  • Hesham ElAbd
    Genetics & Bioinformatics, Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany.
  • Yana Bromberg
  • Adrienne Hoarfrost
    Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA.
  • Tobias Lenz
    Research Group for Evolutionary Immunogenomics, Max Planck Institute for Evolutionary Biology, 24306, Plön, Germany.
  • Andre Franke
    Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany.
  • Mareike Wendorff
    Genetics & Bioinformatics, Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany.