Quantifying and Predicting the Difficulty of Multiple Sequence Alignment with AlDiScore

Journal: bioRxiv
Published Date:

Abstract

Multiple Sequence Alignment (MSA) constitutes an important and frequent operation in molecular sequence data analysis. There exist numerous tools, algorithms, and criteria to infer an MSA. This plethora of available approaches to MSA may induced an ensemble of divergent MSAs for the same underlying unaligned sequence set. Even a single MSA tool may infer distinct MSAs when varying the input parameters. Hence, when using a diversified set of MSA algorithms and parameterizations, the observed dispersion within an MSA ensemble expresses the difficulty of inferring a robust alignment. We refer to this notion as MSA difficulty. As downstream analyses heavily rely on the MSA, characterizing MSA difficulty for a given unaligned sequence set is critical. Initially, we show that measures of dispersion within diversified MSA ensembles can reliably predict MSA difficulty. We then assess the adequacy of these measures by computing the average reference-based distance between the MSAs in the MSA ensemble and its corresponding structural reference MSA and subsequently comparing this distance to the corresponding reference-free average distance over all MSA pairs in the ensemble. We find that Blackburne and Whelan's dpos alignment metric is most appropriate as its reference-free counterpart most accurately approximates the reference-based difficulty computed on BAliBASE reference data. We therefore use the average pairwise distance measured by dpos to quantify MSA difficulty on a scale from 0 (easy) to 1 (difficult) given an MSA ensemble. Next, we introduce the AlDiScore open-source tool, which uses machine learning to directly and reliably predict reference-free difficulty scores from unaligned sequence sets to completely omit expensive MSA computations. The underlying regression model relies upon a large set of features, including sampling-based measures of transitive consistency. We trained our AlDiScore models on a diverse collection of empirical datasets from BAliBASE, TreeBASE, an published studies. Subsequently, we demonstrate that AlDiScore attains an R2 of 0.89 and of 0.84 on unseen AA and DNA sequence sets extracted from the PANDIT v17 database. Finally, we show that there is no correlation between MSA difficulty and the corresponding phylogenetic difficulty of the respective MSA.

Authors

  • Bodynek
  • M.; Martin-Fernandez
  • L.; Bettisworth
  • B.; Haag
  • J.; Stamatakis
  • A.

Categories