Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models
Journal:
arXiv
Published Date:
May 30, 2025
Abstract
Cancer screening, leading to early detection, saves lives. Unfortunately,
existing screening techniques require expensive and intrusive medical
procedures, not globally available, resulting in too many lost would-be-saved
lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation
Models, a cancer pre-screening methodology that identifies high-risk patients
for further screening solely based on their historical medical records. With
millions of electronic healthcare records (EHR), we establish the scaling law
of EHR foundation models pretrained on medical code sequences, pretrain
compute-optimal foundation models of up to 2.4 billion parameters, and finetune
them on clinician-curated cancer risk prediction cohorts. In our retrospective
evaluation comprising of thirty thousand patients, CATCH-FM achieved strong
efficacy (60% sensitivity) with low risk (99% specificity and Negative
Predictive Value), outperforming feature-based tree models as well as general
and medical large language models by large margins. Despite significant
demographic, healthcare system, and EHR coding differences, CATCH-FM achieves
state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot
leaderboard, outperforming EHR foundation models pretrained using on-site
patient data. Our analysis demonstrates the robustness of CATCH-FM in various
patient distributions, the benefits of operating in the ICD code space, and its
ability to capture non-trivial cancer risk factors. Our code will be
open-sourced.