Novel Machine Learning-based Approach to Identify Viral Biomarkers of Human Respiratory Emissions from Oral and Nasal Metagenomes
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Humans spend approximately 90% of their lives in built environments, making virus transmission indoors a key determinant of health. Environmental sampling of respiratory viral pathogens is often challenging because of frequent non-detect measurements. Non-detect measurements do not differentiate between samples containing low or no pathogens from samples that simply lack respiratory expulsions altogether. This ambiguity can be resolved by scanning samples for a biomarker of human respiratory emissions. To do so, reliable biomarkers for environmental monitoring need to be identified. Ideal biomarkers are prevalent across individuals, abundant, and unique to the human respiratory tract. Here, we present a new machine learning-based approach to query for suitable biomarker candidates from publicly available metagenomes and apply it to identify viral biomarkers of healthy oral and nasal microbiomes. Twelve viral biomarker candidates were selected from 1,232 curated viral operational taxonomic units. The viral biomarker candidates had as much as 63% prevalence across respiratory metagenomes and prevalence was further increased to 77-81% by combining two or three biomarkers. Quantitative PCR confirmed that these viral biomarkers were prevalent and abundant in nasal swabs and saliva samples. Notably, top candidate biomarkers remained stable and detectable through multiple lab purification steps, increasing confidence in their viral origins and demonstrating their suitability for environmental monitoring. These findings demonstrate that existing metagenomes can be used to identify effective biomarker candidates for environmental sampling. Developing non-pharmaceutical interventions to reduce virus transmission indoors relies on robust environmental monitoring methods. Monitoring viral pathogens is challenging because of frequent non-detect measurements that introduce uncertainty. For instance, a non-detect measurement could indicate either the absence of the pathogen or simply the lack of human respiratory activity and thus exposure. To aid in distinguishing these scenarios, this study identifies viruses from the human respiratory tract using publicly available sequencing data that can be incorporated into environmental monitoring as biomarkers of human respiratory activity. These viral biomarkers will improve indoor monitoring to help enact interventions to mitigate virus transmission. Furthermore, our approach to identify biomarkers from existing metagenomes can be adapted for future biomarker identification in any system.