Decoding the interconnected splicing patterns of hepatitis B virus and host using large language and deep learning models

Journal: bioRxiv
Published Date:

Abstract

Hepatitis B virus (HBV) infection causes approximately one million deaths annually and remains a major driver of hepatocellular carcinoma. Despite its compact 3.2-kb genome, HBV exhibits extensive alternative splicing. Functionally, HBV splice variants contribute to immune evasion and reduce the likelihood of achieving a functional cure. Here, we show that HBV splicing efficiency—quantified from 279 RNA-seq libraries of HBV-associated liver biopsies and cultured cells—correlates more strongly with disease progression than the overall proportion of spliced HBV RNAs, an emerging biomarker. All HBV splice sites are embedded within protein-coding regions, forming a genetic architecture distinct from typical host splice sites. To decode the sequence determinants of HBV splicing, we apply transformer-based and deep learning models to 4,707 HBV genomes. These models reveal that more highly used HBV splice sites are more conserved and share features with host splice sites that are less frequently used but still functional. This similarity likely reflects constraints imposed by HBV’s compact genome, which must accommodate overlapping protein-coding regions. HBV may have evolved to exploit suboptimal but spliceable host-like motifs without disrupting its genetic architecture. Further analysis of splicing propensity across HBV genomes reveals genotype-specific patterns, indicating regulation by sequence context in a site- and genotype-dependent manner. HBV genotypes may have coevolved with their human hosts to fine-tune splicing through host-like features, supporting mechanisms of viral persistence and immune evasion. This study demonstrates the utility of AI in decoding viral splicing architectures and provides a framework for investigating co-transcriptional processes in other clinically important viruses. Hepatitis B virus (HBV) is a major global health concern, causing 1.1 million deaths in 2022 and 1.2 million new infections each year. It is a leading cause of serious liver conditions, including cirrhosis and liver cancer. Although vaccines can prevent HBV infection, there is currently no cure. HBV produces different types of genetic messages (RNAs), including spliced versions that are processed by the host cell’s machinery. These spliced RNAs help the virus evade the immune system and make it harder for treatments to fully clear the infection. In this study, we analysed 279 HBV samples from liver tissues and lab-grown cells and found that the efficiency of host-mediated splicing of viral RNA reflects the severity of disease. Using advanced artificial intelligence tools, we mapped the splicing patterns in both the virus and human host, and investigated over 4,700 HBV genomes. We discovered that HBV splice sites resemble host splice sites that are used less frequently but remain functional, suggesting the virus has evolved RNA sequences that are compatible with the host cell’s splicing machinery while accommodating its compact genome. Insights into this viral adaptation may help researchers identify new biomarkers for disease severity and develop therapeutic strategies that disrupt the virus’s ability to exploit the host cell’s machinery. The raw RNA-sequencing (RNA-seq) libraries analysed in this study were previously published and are available in the Gene Expression Omnibus under accession number GSE155983. These libraries form part of a curated collection of 279 RNA-seq datasets derived from HBV-associated liver biopsy tissues and cultured cells. Further details and metadata are available in the associated GitHub repository: https://github.com/lcscs12345/HBV_splicing_paper_2025.

Authors

  • Chun Shen Lim; Chris M. Brown