HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism
Journal:
bioRxiv
Published Date:
Mar 20, 2026
Abstract
Motivation: The emergence of novel viral pathogens poses critical threats to global health, yet current computational approaches for viral risk assessment are predominantly virus-specific and require extensive retraining for each new threat. Computational methods for rapid characterization of emerging viruses across multiple epidemiologically relevant dimensions--pathogenicity, host tropism, and transmissibility--are urgently needed to inform public health responses and guide experimental prioritization. Results: We present HViLM (Human Virome Language Model), the first foundation model for pan-viral genomic analysis through continued pre-training of DNABERT-2 on 5 million non-redundant viral sequences (MMseqs2-clustered from 25 million chunks at 80% identity) spanning 9,000 species across 45+ viral families from the VIRION database. We introduce the Human Virome Understanding Evaluation (HVUE) benchmark comprising seven curated datasets across three prediction tasks: pathogenicity classification, host tropism prediction, and transmissibility assessment. Through parameter-efficient fine-tuning with LoRA, HViLM achieves state-of-the-art performance with average accuracies of 95.32% for pathogenicity, 96.25% for host tropism, and 97.36% for transmissibility assessment. The model demonstrates robust cross-family generalization, substantially outperforming sequence-similarity baselines and general genomic foundation models. Attention-based interpretability analysis reveals that HViLM captures biologically meaningful pathogenicity determinants through molecular mimicry of host regulatory elements, including convergent evolution of eight independent sequences targeting Interferon Regulatory Factor 1 (Irf1) for immune evasion. Availability: The HVUE benchmark datasets, training scripts, and complete implementation are publicly available at https://github.com/duttaprat/HViLM . Pre-trained HViLM-base model weights and fine-tuned task-specific variants are available on Hugging Face at https://huggingface.co/duttaprat/HViLM-base .