GENPHIRE: Enhancing Disease Risk Prediction Using Large Language Model

Journal: medRxiv
Published Date:

Abstract

Estimating an individual’s liability to a disease is a fundamental problem in genome research. By exploiting findings from genome-wide association studies (GWASs), many powerful polygenic risk scores (PRSs) have been developed to predict disease risk based on genetic profile. Despite much success, the performance of PRS models is hindered by its inability to capture complex, nonlinear effects and interactions among variants. In this study, we introduce GENPHIRE or Genetic–Phenotypic Representation, a novel machine learning framework designed for disease risk prediction. The central idea in GENPHIRE is to translate an individual’s genotype profile to a “sentence” consist of basic clinical information together with an ordered list of top phenotypes for which the individual is found to have elevated number of risk alleles. After translation, the sentence is converted to an embedded vector by an pre-trained large language model (LLM) to assess its disease risk. We have tested GENPHIRE using UK Biobank data across a broad range of diseases and found it outperforms state-of-the-art PRS models more than 80% of the time. Our results demonstrated that LLM-derived embeddings can be leveraged for disease risk prediction when an individual’s genotype profile is effectively represented. Our findings highlight a promising alternative strategy that complements existing PRS approaches.

Authors

  • Danwei Yao; Chang Liu; Shifan Yan; Jiayi Zhang; Yan V. Sun; Zhaohui S. Qin