Simpler predictive models provide higher accuracy for ovarian cancer detection

Journal: bioRxiv
Published Date:

Abstract

Ovarian cancer remains a danger to women’s health, and accurate screening tests would likely increase survival. Two established protein biomarkers, CA125 and HE4, have been shown to work well in isolation, but achieve even higher accuracy when combined using logistic regression (LR). We show here that this LR-based combination of protein concentrations achieves high accuracy when distinguishing healthy samples from cancer samples (AUC = 0.99) and benign masses from cancer (AUC = 0.86). This approach exhibits superior performance on an external validation cohort compared to a more complex method, which was published with the dataset we use here and tested on the same data. While our method only uses proteins, the more complex method also uses features derived from cell-free DNA (cfDNA). We show that many of that method’s cfDNA features are affected by confounding technical variation, which impacts the previously reported results. Our results are in line with the principle that simpler machine learning models will tend to exhibit better generalizability on new data.

Authors

  • Derrick E. Wood; Joseph Roy; Bari J. Ballew