Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.

Journal: Human genetics

Published Date: Apr 23, 2025

Abstract

Variants of Uncertain Significance (VUS) in genetic testing for hereditary diseases burden patients and clinicians, yet clinical data that could reduce VUS are underutilized due to a lack of scalable strategies. We assessed whether a machine learning approach using genotype and phenotype data could improve variant classification and reduce VUS. In this cohort study of a multi-step machine learning approach, patient data from test requisition forms were used to distinguish patients with molecular diagnoses from controls ("patient score"). A generative Bayesian model then used patient scores and variant classifications to infer variant pathogenicity ("variant score"). The study included 3.5 million patients referred for clinical genetic testing across various conditions. Primary outcomes were model- and gene-level discrimination, classification performance, probabilistic calibration, and concordance with orthogonal pathogenicity measures. Integration into a semi-quantitative classification framework was based on posterior pathogenicity probabilities matching PPV ≥ 0.99/NPV ≥ 0.95 thresholds, followed by expert review. We generated 1,334 clinical variant models (CVMs); 595 showed high performance in both machine learning steps (AUROCpatient ≥ 0.8 and AUROCvariant ≥ 0.8) on held-out data. High-confidence predictions from these CVMs provided evidence for 5,362 VUS observed in 200,174 patients, representing 23.4% of all VUS observations in these genes. In 17 frequently tested genes, CVMs reclassified over 1,000 unique VUS, reducing VUS report rates by 9-49% per condition. In conclusion, a scalable machine learning approach using underutilized clinical data improved variant classification and reduced VUS.

Authors

Toby R Manders

Labcorp Genetics Inc, 1400 16th Street, San Francisco, CA, 94103, USA. toby.manders@labcorp.com.
Christopher A Tan

Labcorp Genetics Inc, 1400 16th Street, San Francisco, CA, 94103, USA.
Yuya Kobayashi

Invitae Corporation, San Francisco, California, USA.
Alexander Wahl

Labcorp Genetics Inc, 1400 16th Street, San Francisco, CA, 94103, USA.
Carlos Araya

Invitae Corporation, 1400 16th Street, San Francisco, CA, 94103, USA.
Alexandre Colavin

Invitae Corporation, San Francisco, California, USA.
Flavia M Facio

Invitae Corporation, San Francisco, California, USA.
Hillery Metz

Invitae Corporation, San Francisco, California, USA.
Jason Reuter

Labcorp Genetics Inc, 1400 16th Street, San Francisco, CA, 94103, USA.
Laure Frésard

Labcorp Genetics Inc, 1400 16th Street, San Francisco, CA, 94103, USA.
Samskruthi R Padigepati

Labcorp Genetics Inc, 1400 16th Street, San Francisco, CA, 94103, USA.
David A Stafford

Labcorp Genetics Inc, 1400 16th Street, San Francisco, CA, 94103, USA.
Robert L Nussbaum

Invitae Corporation, San Francisco, California, USA.
Keith Nykamp

Invitae Corporation, San Francisco, California, USA.

Keywords

Bayes Theorem Genetic Testing Genetic Variation Genotype Humans Language Large Language Models Machine Learning Phenotype

External Resources

View on PubMed Access via DOI PubMed (40266329)

Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals