Representation Learning of Lab Values via Masked AutoEncoder
Journal:
arXiv
Published Date:
Jan 5, 2025
Abstract
Accurate imputation of missing laboratory values in electronic health records
(EHRs) is critical to enable robust clinical predictions and reduce biases in
AI systems in healthcare. Existing methods, such as variational autoencoders
(VAEs) and decision tree-based approaches such as XGBoost, struggle to model
the complex temporal and contextual dependencies in EHR data, mainly in
underrepresented groups. In this work, we propose Lab-MAE, a novel
transformer-based masked autoencoder framework that leverages self-supervised
learning for the imputation of continuous sequential lab values. Lab-MAE
introduces a structured encoding scheme that jointly models laboratory test
values and their corresponding timestamps, enabling explicit capturing temporal
dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that
Lab-MAE significantly outperforms the state-of-the-art baselines such as
XGBoost across multiple metrics, including root mean square error (RMSE),
R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves
equitable performance across demographic groups of patients, advancing fairness
in clinical predictions. We further investigate the role of follow-up
laboratory values as potential shortcut features, revealing Lab-MAE's
robustness in scenarios where such data is unavailable. The findings suggest
that our transformer-based architecture, adapted to the characteristics of the
EHR data, offers a foundation model for more accurate and fair clinical
imputation models. In addition, we measure and compare the carbon footprint of
Lab-MAE with the baseline XGBoost model, highlighting its environmental
requirements.