A Generalizable Machine Learning Framework for cfDNA based Early Detection of Hepatocellular Carcinoma: a Feasibility Study with Preclinical Validation

Journal: bioRxiv
Published Date:

Abstract

Early detection of hepatocellular carcinoma (HCC) is critical for improving patient outcomes, yet current screening tools lack sensitivity and specificity. We demonstrate a flexible machine learning framework for HCC detection using methylation profiles from bisulfite sequencing across multiple assay platforms and sample types. The framework supports a “split-and-filter” approach that routes each sequenced sample to an assay-matched classifier without requiring cross-assay feature compatibility. We constructed assay-specific classifiers using four independent public methylation datasets (∼2,500 total samples) representing distinct bisulfite-sequencing technologies: GSE93203 (MCB-targeted hypermethylation), GSE63775 (MCTA-Seq tandem-repeat hypermethylation), PRJCA001372 (HBV-integration–associated hypomethylation), and the HCC subset of GSE149438 (EpiPanGIDX - DMR-level hyper- and hypomethylation). Separate models were trained using using the biologically relevant features for each dataset and evaluated in two independent blind validation datasets: published tissue WGBS (24 samples: 12 early-stage HCC, 12 matched controls; PRJNA984754) and a new preclinical plasma cfDNA WGBS dataset (12 samples) generated at a commercial sequencing laboratory. Limited feature overlap among assays precluded a single unified model. Instead, overlapping features enabled construction of a proof-of-concept meta-classifier for sample routing across assay-specific models. Assay-specific cfDNA models, trained independently on CpG sites from original publications, were evaluated using the biopsy(tissue) dataset and a new plasma dataset as blind validation. All four assay-specific models generalized well to the validation data, with accuracies of 83.5%-100%. In the validation with the plasma cfDNA samples, the best-performing classifier (among XGBoost, Random Forest, and Logistic Regression) for each public dataset achieved 80–100% sensitivity and 86–100% specificity, with all Stage 2 cases correctly detected across models. The single Stage 1A case showed methylation levels overlapping with cirrhotic controls, consistent with biological expectations. Despite this, a couple of the models predicted this correctly, showing greater sensitivity to Stage 1 cancer. A generalizable framework for early detection of HCC composed of assay-specific classifiers and a meta-classifier is described. This architecture readily accommodates addition of new assays via feature-matched models and meta-classifications. Larger, prospectively collected studies are necessary to confirm performance and enable clinical translation.

Authors

  • Mythili Subharam; Ryan Koehler; Tejas Sreedhar