PDF text classification to leverage information extraction from publication reports.

Journal: Journal of biomedical informatics
Published Date:

Abstract

OBJECTIVES: Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems.

Authors

  • Duy Duc An Bui
  • Guilherme Del Fiol
    Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT, United States.
  • Siddhartha Jonnalagadda
    Department of Preventive Medicine-Health and Biomedical Informatics, Northwestern University, Chicago, IL, USA.