A new pipeline for cross-validation fold-aware machine learning prediction of clinical outcomes addresses hidden data-leakage in omics based 'predictors'.
Journal:
bioRxiv
Published Date:
Mar 16, 2026
Abstract
Machine learning approaches are increasingly applied to high-dimensional biological data in which features are often dataset-dependent. In many omics workflows, features are computed using information derived from the entire dataset, such as correlations between variables, clustering structures, or enrichment scores. We refer to these as global dataset features, defined as features whose computation depends on properties of the full dataset, including the number of samples, the relationships between samples, or global statistical summaries. In such cases, standard validation strategies can fail, especially when evaluating on independent datasets, due to information leakage that leads to overly optimistic performance estimates. This issue is particularly relevant in omics analyses, where features are derived from correlation-based methods, enrichment analyses, or model-driven transformations that implicitly use information from the full dataset. To address this challenge, we present pipeML, a flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation fold construction. pipeML enables users to recompute global dataset features independently within each cross-validation fold, ensuring strict separation between training and test data, while preserving compatibility with a wide range of machine learning algorithms for both classification and survival tasks. The framework integrates feature selection, repeated and stratified cross-validation, model stacking, and comprehensive performance evaluation, supporting advanced validation schemes such as leave-one-dataset-out analysis. Using multiple real-world biological datasets, we demonstrate that pipeML enables leakage-free model evaluation when global dataset features are used. We argue that overestimation of model performance in the cross-validation setting can lead to overoptimistic expectations for validation on independent datasets. By explicitly addressing data leakage and offering a transparent, modular workflow, pipeML provides a robust solution for developing and validating machine learning models in complex biological settings. The pipeML R package as well as a tutorial and documentation page is available at https://github.com/VeraPancaldiLab/pipeML. The code to reproduce the analysis and figures is available on github at https://github.com/VeraPancaldiLab/pipeML_paper.