Non-temporal tree-based models outperform temporal deep learning models in the prediction of chemotherapy-induced side effects from longitudinal laboratory data

Journal: medRxiv
Published Date:

Abstract

The increasing availability of electronic health records (EHRs) provides opportunities to apply machine learning (ML) methods in support of clinical decision-making. The temporal nature of laboratory values in EHR data records makes them particularly suitable for temporal deep learning (DL) architectures that model patient trajectories over time. However, despite this potential, the application of temporal DL models to longitudinal laboratory data has largely been limited to intensive care unit (ICU) settings and coarse outcome prediction tasks such as mortality and readmission. How well they perform in sparse, irregular, and highly imbalanced data settings that are typical of clinical care outside of the ICU has not been fully assessed. To close this knowledge gap, we focused on the clinically important yet underexplored tasks to predict the chemotherapy-related complications aplasia and neutropenic fever before clinical onset, using longitudinal laboratory data extracted from EHR records from two independent datasets. Based on these datasets and targets, we systematically evaluated 13 ML models, including 7 state-of-the-art temporal DL models and 4 non-temporal tree-based baselines. Across all combinations of datasets and targets, non-temporal tree-based models, particularly CatBoost, consistently outperformed the temporal DL models. These findings suggest state-of-the-art temporal DL models still struggle with factors such as class imbalance, sparsity, irregularity, and asynchronicity of laboratory values that are typical of routinely collected laboratory data beyond the ICU, and that further research is needed to overcome these challenges.

Authors

  • Farnaz Rahimi; Christel Sirocchi; Julian Matschinske; Markus Metzler; Jakob Zierk; David B. Blumenthal

Categories