The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers.

Journal: Scientific reports
Published Date:

Abstract

Metaviromic studies of potential emerging infection reservoirs led to discovery of many novel viruses. Since metaviromes contain viruses from target host, its food or other sources, fast and robust approaches are needed to predict hosts of unknown viruses based on their genome data. Four machine learning algorithms (random forest, two gradient boosting machines, support vector machine) were used here to predict the hosts of RNA viruses that infect mammals, insects and plants. The prediction efficiency was largely dependent on the dataset composition. In the more challenging task of predicting hosts of unknown virus genera, median weighted F1-score of 0.79 was achieved using support vector machine and 4-mer frequencies, a notable improvement over baseline methods (median weighted F1-scores 0.68 for the homology-based tBLASTx and 0.72 for ML trained on mono-, di- and trinucleotide frequencies). More complicated features and feature combinations provided worse results. When predicting hosts of short virus sequence fragments quality decreased but using same-length fragments instead of full genomes for training consistently produced an improvement of prediction quality. Therefore, short k-mers carry sufficient information to predict hosts of novel RNA virus genera. This algorithm can be useful in rapid analysis of metaviromic data to highlight potential biological threats.

Authors

  • Fedor S Perelygin
    Martsinovsky Institute of Medical Parasitology, Tropical and Vector Borne Diseases, First Moscow State Medical University (Sechenov University), Moscow, 119435, Russian Federation.
  • Alexander N Lukashev
    Martsinovsky Institute of Medical Parasitology, Tropical and Vector Borne Diseases, First Moscow State Medical University (Sechenov University), Moscow, 119435, Russian Federation.
  • Yulia A Aleshina
    Martsinovsky Institute of Medical Parasitology, Tropical and Vector Borne Diseases, First Moscow State Medical University (Sechenov University), Moscow, 119435, Russian Federation. vjulia94@gmail.com.