Natural Language Processing to Build a Multicenter Computable Phenotype Library for Adults with Congenital Heart Disease
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Our objective was to build classifiers for multiple phenotypes that categorize a cohort of adults with congenital heart disease (ACHD), that can be used to populate variables in a biobank. A dataset of 1492 ACHD patients, with expert-created labels for eight phenotypes, was created and used to train classifiers with three different architectures. A larger unlabeled dataset containing 15869 patients was used to pre-train the classifiers, and a 20% subset of the unlabeled dataset was used to validate the classifier predictions. On held out labeled data, F1 scores for the eight target phenotypes of interest ranged from 0.66 to 1. Of those, the six phenotypes with best classification performance were then validated on unlabeled data, where positive predictive value ranged from 81.5% to 100%. We were able to classify six out of eight phenotypes with satisfactory performance. Challenging phenotypes included cyanosis and New York Heart Association functional class. Both vary over time and in the latter case there is limited agreement between human observers. Different phenotypes benefited from different model architectures to some degree, but the differences are small enough that uniformity of deployment may be a more important factor in choosing what models to deploy. We saw no benefit to joint training, but some phenotypes benefited from a multiclass model. Human-curated data can be used to train NTLP-based ACHD phenotype classifiers with excellent test characteristics acceptable for application in quality improvement efforts and to populate ACHD registry data.