Hierarchical integration of multimodal clinical data to predict epilepsy surgery outcome
Journal:
medRxiv
Published Date:
May 6, 2026
Abstract
Background: Integrating multimodal data into medical artificial intelligence (AI) tools and evaluating whether they outperform human experts remains a critical challenge. Epilepsy surgery offers a unique paradigm for this evaluation, as it provides an expert-independent measure (Engel score) of post-surgical outcome. Currently, evaluation for epilepsy surgery relies on the visual interpretation and human synthesis of multimodal data. While clinical evaluations are individualized and account for complex anatomical variability, integrating these diverse, high-dimensional modalities to generate a probability of surgical success remains challenging. Here, we leverage this objective outcome score to investigate the feasibility of a data-driven, phenotype-based model against the current clinical gold standard. Methods: The evaluation was performed on an epilepsy-type controlled cohort of 57 patients from six tertiary epilepsy surgery centers who underwent resective/ablative surgery in the mesiotemporal lobe. Multimodal data, namely, patient demographics, semiology, invasive electrophysiology monitoring, and neuroimaging, were utilized. We first estimated how human experts perceive surgery success. Subsequently, we developed a data-driven model integrating these modalities to predict surgery outcomes. The model performance was compared to the current clinical gold standard (three independent human experts) and published outcome calculators. Finally, modality-level phenotypes were derived based on the models predictions. Results: Predictions by human experts correlated poorly with post-surgical outcomes, and published outcome calculators did not perform better than the experts (DeLongs p = 0.367). Our model incorporating multimodal data achieved an area under the receiver operating characteristic curve (AUROC) of 0.801. It performed statistically better than the best human expert (DeLongs p = 0.043) and achieved a higher AUROC than the best published surgical outcome calculator (0.801 vs. 0.694). Conclusions: We demonstrated the proof-of-concept that data-driven multimodal phenotypes can inform personalized surgery planning in epilepsy. Furthermore, we provide a framework for integrating multimodal data and benchmarking medical AI performance against human experts.