Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective.

Journal: IEEE/ACM transactions on computational biology and bioinformatics
PMID:

Abstract

The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.

Authors

  • Katrin Sophie Bohnsack
  • Marika Kaden
    Computational Intelligence, University of Applied Sciences Mittweida, 09648 Mittweda, Germany.
  • Julia Abel
  • Thomas Villmann
    Department of Mathematics, University of Applied Sciences Mittweida, Mittweida, Germany.