upsML: A high-accuracy machine learning classifier for predicting Plasmodium falciparum var gene upstream groups.

Journal: PloS one
Published Date:

Abstract

Plasmodium falciparum erythrocyte membrane protein 1 (PfEMP1), encoded by the hypervariable var gene family, is central to malaria pathogenesis, influencing both disease severity and immune evasion. Classifying var genes into upstream groups (upsA, upsB, upsC, upsE) is important for understanding parasite biology and clinical outcomes, but remains challenging, especially with partial sequences, such as the DBLα tag or RNA-Seq assemblies. We developed upsML, a machine-learning-based classifier trained on 2,530 curated var genes, to accurately assign upstream groups based on sequence features from different partial gene regions. We compared seven methods, including support vector machines, random forests, XGBoost, and HMMER models. Several models in upsML achieve accuracies of 83% for DBLα-tag sequences and 92% for full-length PfEMP1 sequences, thereby significantly outperforming existing tools. Additionally, we developed a model to distinguish internal from subtelomeric var genes, which we applied to a global collection of P. falciparum genomes, revealing a higher frequency of internal var genes in Asia. upsML is available at https://github.com/sii-scRNA-Seq/upsML, providing a robust and efficient resource for large-scale var gene analysis. It can classify var genes from 20 genomes in under one second.

Authors

Keywords

No keywords available for this article.