Recurrent Neural Networks to Automatically Identify Rare Disease Epidemiologic Studies from PubMed.

Journal: AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
PMID:

Abstract

Rare diseases affect between 25 and 30 million people in the United States, and understanding their epidemiology is critical to focusing research efforts. However, little is known about the prevalence of many rare diseases. Given a lack of automated tools, current methods to identify and collect epidemiological data are managed through manual curation. To accelerate this process systematically, we developed a novel predictive model to programmatically identify epidemiologic studies on rare diseases from PubMed. A long short-term memory recurrent neural network was developed to predict whether a PubMed abstract represents an epidemiologic study. Our model performed well on our validation set (precision = 0.846, recall = 0.937, AUC = 0.967), and obtained satisfying results on the test set. This model thus shows promise to accelerate the pace of epidemiologic data curation in rare diseases and could be extended for use in other types of studies and in other disease domains.

Authors

  • Jennifer N John
    Stanford University, Stanford, CA.
  • Eric Sid
    Office of Rare Disease Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD, 20892, USA.
  • Qian Zhu
    Institute for Prevention and Control of AIDS and STD, Henan Center for Disease Control and Prevention, Zhengzhou 450016, Henan, China.