XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples.

Journal: Journal of computational biology : a journal of computational molecular cell biology
Published Date:

Abstract

It is estimated that approximately 15% of cancers worldwide can be linked to viral infections. The viruses that can cause or increase the risk of cancer include human papillomavirus, hepatitis B and C viruses, Epstein-Barr virus, and human immunodeficiency virus, to name a few. The computational analysis of the massive amounts of tumor DNA data, whose collection is enabled by the advancements in sequencing technologies, has allowed studies of the potential association between cancers and viral pathogens. However, the high diversity of oncoviral families makes reliable detection of viral DNA difficult, and the training of machine learning models that enable such analysis computationally challenging. We introduce XVir, a data pipeline that deploys a transformer-based deep learning architecture to reliably identify viral DNA present in human tumors. XVir is trained on a mix of sequencing reads coming from viral and human genomes, resulting in a model capable of robust detection of potentially mutated viral DNA across a range of experimental settings. Results on semi-experimental data demonstrate that XVir is able to achieve high classification accuracy, generally outperforming state-of-the-art competing methods. In particular, it retains high accuracy even when faced with diverse viral populations while being significantly faster to train than other large deep learning-based classifiers.

Authors

  • Shorya Consul
    Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA.
  • John Robertson
    Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA.
  • Haris Vikalo

Keywords

No keywords available for this article.