Investigating the Data Addition Dilemma in Longitudinal TBI MRI
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Clinical machine learning (CML)for brain MRI often assumes that more data guarantees better performance, yet added samples can reduce accuracy when they arise from a different distribution, a phenomenon known as the Data Addition Dilemma. We present a systematic study of this issue in longitudinal TBI MRI, where acute baseline scans (S1) and follow-up scans (S2) differ substantially. Using a 14-subject, 28-scan cohort, we quantify the combined effects of intra-subject session shifts and inter-subject variability on severity classification. We evaluate four training schemes: (1) intra-session upper bound (S1→S1), (2) cross-session OOD testing (S1→S2), (3) pooled training (S1+S2→S1,S2), and (4) LOSO-IPA, which adds one unlabeled S2 scan per patient. With a lightweight logistic-regression model on PCA features, we show that naive pooling can degrade accuracy, pooled training trades baseline performance for modest robustness gains, and LOSOIPA recovers accuracy close to the intra-session limit. We recommend per-subject follow-up anchoring and diagonal CORAL alignment to mitigate session effects. These results clarify when additional data help or hinder CML workflows and provide a minimally invasive strategy for reliable longitudinal TBI severity assessment.