Multiple versus pairwise sequence alignments for protein phylogenetics using foundation models
Journal:
bioRxiv
Published Date:
May 29, 2026
Abstract
Phylogenetic inference is a common task in molecular and evolutionary biology and has conventionally required a multiple sequence alignment (MSA), a statistical model of amino acid substitutions, and an optimality principle. Recently, global models of amino acid substitutions have been inferred from millions of MSAs using transformer-based deep learning, resulting in protein foundation models (pFMs), also known as protein language models (PLMs). Training pFMs on MSAs hypothetically enables them to encode residue dependencies and the phylogenetic structure of the MSA collection. In contrast, pFMs trained on individual sequences lack access to such phylogenetic structure. Here, we assess the phylogeny inference gains offered by the use of MSA for training pFMs by comparing the relative accuracies of phylogenies inferred using two types of pFMs: one trained on a large collection of MSAs (msat-pFM, [1]) and the other trained using a collection of single sequences (esm-pFM). For msat-pFM analysis, we inferred neighbor-joining trees using pairwise distances estimated directly from the sequence attention matrices. For esm-pFM [2], pairwise distances were obtained using the correlation of attentions of homologous residues, where pairwise sequence alignments (PSA) were used to establish residue homologies. Surprisingly, MSA phylogenies inferred using the msat-pFM were less accurate than esm-pFMs. This pattern was seen across datasets spanning both small and large numbers of species and proteins. Also, PSA phylogenies obtained using residue attentions from early ESM-PFM layers were much more accurate. These results suggest that the multiple sequence alignment step, which is obligatory to establish residue homologies across multiple sequences, may not add information when using evolutionary distances based on attentions in pFMs.