Automating Data Extraction from PDF Sleep Reports Using Data Mining Techniques.

Journal: Studies in health technology and informatics
Published Date:

Abstract

This work introduces a web application for extracting, processing, and visualizing data from sleep studies reports. Using Optical Character Recognition (OCR) and Natural Language Processing (NLP), the pipeline extracts over 75 key data points from four types of sleep reports. The web application offers an intuitive interface to view individual reports' details and aggregate data from multiple reports. The pipeline demonstrated 100% accuracy in extracting targeted information from a test set of 40 reports, even in cases with missing data or formatting inconsistencies. The developed tool streamlines the analysis of OSA reports, reducing the need for technical expertise and enabling healthcare providers and researchers to utilize sleep study data efficiently. Future work aims to expand the dataset for more complex analyses and imputation techniques.

Authors

  • Fábio Teixeira
    University of Porto, Portugal.
  • João Costa
    University of Porto, Portugal.
  • Pedro Amorim
    University of Porto, Portugal.
  • Nuno Guimarães
    University of Porto, Portugal.
  • Daniela Ferreira-Santos
    CINTESIS - Centre for Health Technology and Services Research, Portugal.