Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes.

Journal: JCO clinical cancer informatics
Published Date:

Abstract

PURPOSE: The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance-risk of cancer for germline mutation carriers-or prevalence of germline genetic mutations.

Authors

  • Yujia Bao
    The Massachusetts Institute of Technology, Cambridge, MA, USA.
  • Zhengyi Deng
    Massachusetts General Hospital, Boston, MA.
  • Yan Wang
    College of Animal Science and Technology, Beijing University of Agriculture, Beijing, China.
  • Heeyoon Kim
    Massachusetts Institute of Technology, Boston, MA.
  • Victor Diego Armengol
    Massachusetts General Hospital, Boston, MA.
  • Francisco Acevedo
    Massachusetts General Hospital, Boston, MA.
  • Nofal Ouardaoui
    Harvard T.H. Chan School of Public Health, Boston, MA.
  • Cathy Wang
    Harvard TH Chan School of Public Health, Boston, MA.
  • Giovanni Parmigiani
    Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02215; gp@jimmy.harvard.edu.
  • Regina Barzilay
    Computer Science and Artificial Intelligence Laboratory , Massachusetts Institute of Technology , 77 Massachusetts Avenue , Cambridge , MA 02139 , USA . Email: regina@csail.mit.edu.
  • Danielle Braun
    Harvard TH Chan School of Public Health, Boston, MA.
  • Kevin S Hughes
    Division of Surgical Oncology, MGH, Boston, USA.