A comparative evaluation of handling missing data points and modalities in electronic health records.
Journal:
International journal of medical informatics
Published Date:
Jan 21, 2026
Abstract
BACKGROUND: Healthcare data, generally available as electronic health records (EHR), provide rich insights for predictive modelling. A common challenge in using EHR data is the missing information, which may occur completely at random (MCAR), at random (MAR) or not at random (MNAR). A typical measure to deal with missingness is through imputation, which could be statistical or learning-based. However, with imputation, we run the risk of changing the original data distribution. This can lead to serious issues, as even small changes in healthcare data can negatively impact clinical accuracy and decision-making. Alternative approaches are required. OBJECTIVE: This study examines machine learning strategies that address missing data directly within models. The goal is to assess how models preserve data structure and performance across different missingness patterns and rates in single vs multi-modal datasets. METHODS: We evaluate multiple machine learning architectures across three datasets. Two experimental setups are designed: one introduces missing data points in time-series records at the feature level, and the other masks complete or partial modalities in a multimodal dataset. Synthetic missingness is applied using established mechanisms and rates, with all experiments repeated across five random seeds. Results are benchmarked against imputation-based baselines to assess differences in data distribution and model performance. RESULTS: Direct modelling approaches preserved the underlying data structure better than imputation, which introduced distributional shifts. Embedding visualisations showed clearer label-based clustering in non-imputed settings. Models were more sensitive to missing text than missing measurements, underlining the contextual importance of clinical notes. CONCLUSION: We provide a comparative analysis of different modelling strategies for handling missingness. We demonstrate that direct modelling approaches maintain clinical patterns more effectively than imputation. This emphasises the importance of integrating missingness handling into the modelling pipeline and selecting models based on missingness type and modality to ensure reliability in clinical applications.
Authors
Keywords
No keywords available for this article.