VHost-Classifier: virus-host classification using natural language processing.

Journal: Bioinformatics (Oxford, England)
Published Date:

Abstract

MOTIVATION: When analyzing viral metagenomic sequences, it is often desired to filter the results of a BLAST analysis by the host species of the virus. VHost-Classifier automates this procedure using a natural language processing algorithm written in Python 3, which takes a list of taxonomic identifiers (taxids) returned from a BLAST query using viral sequences as input. The taxid output is binned by the evolutionary lineage of their host, based on string matching the words in their English names. If VHost-Classifier cannot identify a host, it attempts to bin the sequences by the environment from which the sample originated. VHost-Classifier predicts the evolutionary lineage of the host from the virus name and does not rely on referencing taxids against a database; therefore, it is not constrained by the size of a database and can host classify newly characterized viruses.

Authors

  • Ezra Kitson
    Department of Microbiology and Immunology, University of British Columbia, Vancouver, BC, Canada.
  • Curtis A Suttle
    Department of Microbiology and Immunology, University of British Columbia, Vancouver, BC, Canada.