A DNA Methylation Classification Model Predicts Organ and Disease Site
Journal:
arXiv
Published Date:
May 30, 2025
Abstract
Cell-free DNA (cfDNA) analysis is a powerful, minimally invasive tool for
monitoring disease progression, treatment response, and early detection. A
major challenge, however, is accurately determining the tissue of origin,
especially in complex or heterogeneous disease contexts. To address this, we
developed a machine learning framework that leverages tissue-specific DNA
methylation signatures to classify both tissue and disease origin from cfDNA
data. Our model integrates methylation datasets across diverse epigenomic
platforms, including Whole Genome Bisulfite Sequencing (WGBS), Illumina
Infinium Bead Arrays, and Enzymatic Methyl-seq (EM-seq). To account for
platform variability and data sparsity, we applied imputation strategies and
harmonized CpG features to enable cross-platform learning. Dimensionality
reduction revealed clear tissue-specific clustering of methylation profiles. A
random forest classifier trained on these features achieved consistent
classification performance (accuracy 0.75-0.8 across test sets and platforms).
Notably, our model distinguished clinically relevant tissues such as inflamed
synovium and peripheral blood mononuclear cells (PBMCs) in arthritis patients
and deconvoluted synthetic cfDNA mixtures mimicking real-world liquid biopsy
samples. The predicted tissue proportions closely matched the true values,
demonstrating the model's potential for both classification and quantitative
inference. These results support the feasibility of using cross-platform
methylation data and machine learning for scalable, generalizable cfDNA
diagnostics and lay the groundwork for future integration of disease-specific
epigenetic features to guide clinical decision-making in precision medicine.