Large language models identify causal genes in complex trait GWAS

Journal: medRxiv
Published Date:

Abstract

Identifying causal genes at genome-wide association study (GWAS) loci remains a major challenge. Literature evidence for disease-gene co-occurrence, whether through automated approaches or human expert annotation, is one way of nominating causal genes at GWAS loci. However, current automated approaches are limited in accuracy and generalizability, and expert annotation is not scalable to hundreds of thousands of significant findings. Here, we demonstrate that large language models (LLMs) can accurately prioritize likely causal genes at GWAS loci. We rigorously evaluated several widely available general-purpose LLMs using a benchmark of high-confidence causal gene annotations, including a novel set of 26 previously unpublished GWAS. Our results show that LLMs outperform current state-of-the-art methods and substantially augment their performance. These findings establish LLMs as a powerful, efficient, and scalable approach to causal gene discovery.

Authors

  • Suyash S. Shringarpure; Wei Wang; Sotiris Karagounis; Xin Wang; Anna C. Reisetter; Adam Auton; Aly A. Khan