Foundation Models for Clinical Records at Health System Scale
Journal:
arXiv
Published Date:
Jul 1, 2025
Abstract
Large-scale pretraining has transformed modeling of language and other data
types, but its potential remains underexplored in healthcare with structured
electronic health records (EHRs). We present a novel generative pretraining
strategy for sequential EHR data using next-visit event prediction. Our model
learns to autoregressively generate various tokenized clinical events for the
next visit based on patient history and inherently handles the joint prediction
of heterogeneous data types. Additionally, we introduce regularization on
predicting repeated events and highlight a key pitfall in EHR-based foundation
model evaluations: repeated event tokens can inflate performance metrics when
new onsets are not distinguished from subsequent occurrences. Our model is
evaluated via zero-shot prediction for forecasting dementia and knee
osteoarthritis incidence within 2 and 5 years, and the model performance rivals
a fully fine-tuned masked pretrained Transformer baseline, demonstrating that
our approach captures complex clinical dependencies without requiring costly
task-specific fine-tuning.