A Pan-Organ Vision-Language Model for Generalizable 3D CT Representations

Journal: medRxiv
Published Date:

Abstract

Vision-language foundation models (VLMs) for computed tomography (CT) are emerging tools capable of learning generalizable representations from large-scale clinical imaging data. Yet, it remains unclear to what extent these models encode biologically meaningful information relevant to real-world clinical variation. We introduce Percival, a CT-native VLM trained on more than 400,000 CT-report pairs from the Penn Medicine BioBank using a dual-encoder symmetric contrastive framework, with the objective of characterizing the biological associations embedded through contrastive pretraining. Across over 20,000 held-out participants, Percival’s latent space shows strong alignment with clinical attributes, body-size measures, and multiple laboratory biomarkers. Phenome-wide analyses further reveal broad correspondence between latent features and disease phenotypes, including conditions not typically evaluated by CT; survival analyses demonstrate that the embeddings capture longitudinal risk patterns. Together, these findings reveal that CT-VLMs uncover a rich latent structure aligned with physiological measurements and disease phenotypes spanning the disease-prevalence spectrum.

Authors

  • Cameron A. Beeche; Joonghyun Kim; Hamed Tavolinejad; Bingxin Zhao; Jessie Dong; Rakesh Sharma; Jeffrey Duda; James Gee; Farouk Dako; Anurag Verma; Colleen Morse; Bojian Hou; Li Shen; Hersh Sagreiya; Christos Davatzikos; Scott Damrauer; Rohan Shad; Marylyn D. Ritchie; Daniel Rader; Qi Long; Eric Eaton; Tianlong Chen; Charles E. Kahn; Julio Chirinos; Walter R. Witschey; Penn Medicine Biobank