Expanding biobank pharmacogenomics through machine learning calls of structural variation.
Journal:
Genetics
Published Date:
Jul 9, 2025
Abstract
Biobanks linking genetic data with clinical health records provide exciting opportunities for pharmacogenomic (PGx) research on genetic variation and drug response. Designed as central and multiuse resources, biobanks can facilitate diverse PGx research efforts, including the study of drug efficacy and adverse effects. Specialized PGx alleles and phenotypes are critical for such studies and can be conveniently called from existing array-based genotypes routinely collected in most biobanks. We describe a central callset of PGx alleles and phenotypes in over 80,000 participants of the Michigan Genomics Initiative (MGI) biobank, created using the PyPGx software on Trans-Omics for Precision Medicine-imputed genotypes. The array-based PGx allele calls demonstrate concordance (>92%) with a set of PCR-validated alleles collected during clinical care, but do not identify PGx alleles dependent on structural variation, including the clinically important CYP2D6*5 deletion. To address this, we developed a support vector machine trained on genotype array single nucleotide variant probe intensities to classify CYP2D6*5 carriers. This method had >99% accuracy and reclassified ∼7% of African American and ∼4% of White MGI participants to lower activity metabolizer phenotypes, predicting higher risks of adverse drug reactions. We demonstrate that central PGx callsets created with existing tools and genetic data can be augmented by customized calls for challenging alleles based on structural variants to broaden the research potential and clinical utility of biobanks. These PGx callsets can be created in biobanks with existing array-based genotype data and highlight the utility of advanced computational methods in PGx allele identification.