Widespread use of invalid statistical tests in biomedical machine learning

Journal: bioRxiv
Published Date:

Abstract

Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance -- not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.

Authors

  • Zeng
  • T.; Li
  • H.; Zhang
  • S.; Tan
  • Y. Q.; Tian
  • F.; Orban
  • C.; An
  • L.; Che
  • W.; Cheng
  • J.; Chong
  • J. S. X.; Dehestani
  • N.; Dong
  • Z.; Li
  • X.; Li
  • Z.; Lim
  • M. J. R.; Lin
  • Y.; Ling
  • Q.; Ling
  • Z.; Low
  • X. Z.; Mansour L.
  • S.; Ng
  • K. K.; Nguyen
  • T. T.; Ooi
  • L. Q. R.; Pande
  • S.; Qian
  • X.; Ruan
  • J.; Wang
  • Z.; Xie
  • Y.; Zhang
  • C.; Zhang
  • Y.; Patil
  • K.; Parkes
  • L.; Dhamala
  • E.; Chopra
  • S.; Zalesky
  • A.; Holmes
  • A.; Eickhoff
  • S.; Zhou
  • J. H.; Renaud
  • O.; Dosenbach
  • N.; Kording
  • K. P.; Bzdok
  • D.; Nichols
  • T.; Yeo
  • B. T. T.

Categories