HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity
Journal:
arXiv
Published Date:
May 19, 2025
Abstract
Respiratory viral infections pose a global health burden, yet the cellular
immune responses driving protection or pathology remain unclear. Natural
infection cohorts often lack pre-exposure baseline data and structured temporal
sampling. In contrast, inoculation and vaccination trials generate insightful
longitudinal transcriptomic data. However, the scattering of these datasets
across platforms, along with inconsistent metadata and preprocessing procedure,
hinders AI-driven discovery. To address these challenges, we developed the
Human Respiratory Viral Immunization LongitudinAl Gene Expression
(HR-VILAGE-3K3M) repository: an AI-ready, rigorously curated dataset that
integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies
encompassing over 2.56 million cells. Spanning vaccination, inoculation, and
mixed exposures, the dataset includes microarray, bulk RNA-seq, and single-cell
RNA-seq from whole blood, PBMCs, and nasal swabs, sourced from GEO, ImmPort,
and ArrayExpress. We harmonized subject-level metadata, standardized outcome
measures, applied unified preprocessing pipelines with rigorous quality
control, and aligned all data to official gene symbols. To demonstrate the
utility of HR-VILAGE-3K3M, we performed predictive modeling of vaccine
responders and evaluated batch-effect correction methods. Beyond these initial
demonstrations, it supports diverse systems immunology applications and
benchmarking of feature selection and transfer learning algorithms. Its scale
and heterogeneity also make it ideal for pretraining foundation models of the
human immune response and for advancing multimodal learning frameworks. As the
largest longitudinal transcriptomic resource for human respiratory viral
immunization, it provides an accessible platform for reproducible AI-driven
research, accelerating systems immunology and vaccine development against
emerging viral threats.