Sample Size Requirements for Machine Learning Classification of Binary Outcomes in Bulk RNA-Seq Data

Journal: medRxiv
Published Date:

Abstract

Bulk RNA sequencing data is often leveraged to build machine learning (ML)-based predictive models for classification of disease groups or subtypes, but the sample size needed to adequately train these models is unknown. We collected 27 experimental datasets from the Gene Expression Omnibus and the Cancer Genome Atlas. In 24/27 datasets, pseudo-data were simulated using Bayesian Network Generation. Three ML algorithms were assessed: XGBoost (XGB), Random Forest (RF), and Neural Networks (NN). Learning curves were fit, and sample sizes needed to reach the full-dataset AUC minus 0.02 were determined and compared across the datasets/algorithms. Multivariable negative binomial regression models quantified relationships between dataset-level characteristics and required sample sizes within each algorithm. These models were validated in independent experimental datasets. Across the datasets studied, median required sample sizes were 480 (XGB)/190 (RF)/269 (NN). Higher effect sizes, less class imbalance/dispersion, and less complex data were associated with lower required sample size. Validation demonstrated that predictions were accurate in new data. Comparison of results to sample sizes obtained from differential analysis power analysis methods showed that ML methods generally required larger sample sizes. In conclusion, incorporating ML-based sample size planning alongside traditional power analysis can provide more robust results.

Authors

  • Scott Silvey; Amy Olex; Shaojun Tang; Jinze Liu