Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV
Journal:
bioRxiv
Published Date:
May 11, 2026
Abstract
Objective: How structured clinical features and cluster-semantic embeddings interact under self-distillation in EHR prediction models is unknown. Existing approaches treat these sources separately (gradient-boosted trees exploit tabular features while sequence models process text), and their interaction under self-distillation regularisation remains uncharacterised. We introduce the Narrative Velocity (NV) framework and evaluate this interaction in a 7-model benchmark. Materials and Methods: Cadence is a ~5.86M-parameter residual multilayer perceptron (MLP) combining structured EHR features with frozen PubMedBERT embeddings of cluster-label strings under born-again self-distillation from a prior Cadence checkpoint (seed-42 teacher). Cadence is benchmarked against six comparators on MIMIC-IV v3.1 with dual-sex TRIPOD+AI reporting (5 student seeds for Cadence; 2--3 seeds for baselines). Results: At full-cohort scale, Cadence achieves 38.04 +/- 0.04% male and 35.66 +/- 0.04% female top-1 accuracy, exceeding the strongest non-neural baseline (XGBoost-2420, trained on the identical 2,420-dimensional input) by +1.35 pp male and +0.82 pp female (paired t-test on shared seeds 42--44: t(2)=69.06, p = 2.10 x 10^-4 male; t(2)=25.32, p = 1.56 x 10^-3 female). On time-to-next-event regression Cadence lowers MAE by 7.68 d male and 7.30 d female versus XGBoost-2420; FT-Transformer attains the lowest absolute MAE at full scale (27.58 d male, 36.63 d female), revealing a classification-regression trade-off across model families. A controlled 2x2 random-vector ablation isolates the self-distillation--embedding interaction at +0.49 pp top-1 (95% CI [0.35, 0.64] pp; bootstrap, n = 10,000 resamples; 3-teacher-seed mean +0.513 +/- 0.010 pp) under a matched-dimensionality null. A 3-teacher-seed validation (multi_teacher_02) confirms the interaction is robust to teacher-seed identity (per-seed values +0.525, +0.509, +0.507 pp; mean +0.513 +/- 0.010 pp). Cadence achieves the best Brier score among evaluated models (0.774 male / 0.798 female) but its raw probabilities are systematically miscalibrated (ECE 0.077 vs. XGBoost-884's 0.010); after a single scalar temperature scaling step (T* ~0.81), ECE drops to ~0.028 while Brier remains best. On a small (n = 1,120 patients, 39,120 events) external OCR-extracted BWH cohort, Cadence ranked 3rd of 7 models with three confounded sources of error (institutional shift, OCR noise, centroid mapping); we therefore report this as a generalisation probe rather than a definitive external validation. At the longer h30 evaluation horizon Cadence's MAE advantage reverses (47.35 d versus XGBoost 45.06 d), reflecting the absence of a matched-horizon self-distillation teacher. Discussion: The 2x2 random-vector ablation confirms that the self-distillation gain on PubMedBERT embeddings (+0.78 pp) exceeds that on matched-dimensionality random vectors (+0.29 pp) by +0.49 pp, isolating the interaction to semantic content rather than feature dimensionality. The factorial decomposition (+0.49--0.51 pp interaction) and the sequential pipeline-level decomposition (Supplementary Table S3) are complementary triangulations under different reference frames and are not directly additive. Conclusion: This 7-model benchmark establishes a dual-sex, dual-metric, cross-institutional reference for next clinical event prediction under the TRIPOD+AI reporting framework. These results characterise discrimination and calibration on a single retrospective cohort; prospective evaluation, decision-curve analysis, and harm-benefit assessment are required before clinical deployment. Keywords: clinical event prediction, electronic health records, MIMIC-IV, Narrative Velocity, residual MLP, PubMedBERT, knowledge distillation, TRIPOD+AI