A machine learning framework for interpreting phylogenetic tree patterns in interkingdom horizontal gene transfer

Journal: bioRxiv
Published Date:

Abstract

Horizontal gene transfer (HGT), the movement of genetic material between unrelated organisms, is widely recognized as an important driver of genome evolution in bacteria. In eukaryotes, however, the evolutionary impact of HGT remains debated. The identification of interkingdom HGT (iHGT) is especially challenging due to the lack of gold standard methods. Traditionally, iHGT identification has relied on manual inspection of phylogenetic trees, a process that is subjective, difficult to reproduce, and not scalable to large datasets. In this study, we present a computational framework that formalizes phylogenetic tree interpretation into a supervised machine-learning problem. We define five recurrent phylogenetic patterns--iHGT, NoHGT, Limited donor evidence, Multiple major clades (Multiple MC), and Patchy phylogeny--capturing clear and ambiguous evolutionary scenarios. To operationalize these patterns, we developed a feature-extraction pipeline that quantifies taxonomic composition and phylogenetic topology using seven biological descriptors derived from gene trees. These features were used to train and evaluate multiple machine-learning models, among which a Random Forest (RF) classifier achieved the best performance (AUC-ROC = 0.98; accuracy = 0.89). Model interpretability analyses revealed that topological distance to additional clades and lineage diversity are the most informative predictors, reflecting key signals used in expert-driven phylogenetic interpretation. The RF model was further validated using 1,000 simulated phylogenies and 1,438 real iHGT candidates, achieving low misclassification rates (7.8% and 10.43%, respectively). Benchmarking against AVP (Alienness vs. Predictor), a comparable tool for iHGT detection, demonstrated improved performance across all evaluation metrics, highlighting the advantages of incorporating global phylogenetic structure into the classification process. This study provides a reproducible and scalable framework for phylogenetic pattern classification that captures complex evolutionary signals while maintaining biological interpretability. Beyond improving iHGT detection, the approach offers a more nuanced representation of evolutionary scenarios by explicitly accounting for inconclusive cases, supporting more robust inference in comparative genomics.

Authors

  • Aguirre-Carvajal
  • K.; Armijos-Jaramillo
  • V.; Munteanu
  • C. R.

Categories