Deep neural networks and distant supervision for geographic location mention extraction.

Journal: Bioinformatics (Oxford, England)
PMID:

Abstract

MOTIVATION: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER.

Authors

  • Arjun Magge
    Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
  • Davy Weissenbacher
    Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
  • Abeed Sarker
    Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States.
  • Matthew Scotch
    Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA.
  • Graciela Gonzalez-Hernandez
    Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.