Automating Data Extraction from PDF Sleep Reports Using Data Mining Techniques.

Journal: Studies in health technology and informatics

Published Date: May 15, 2025

Abstract

This work introduces a web application for extracting, processing, and visualizing data from sleep studies reports. Using Optical Character Recognition (OCR) and Natural Language Processing (NLP), the pipeline extracts over 75 key data points from four types of sleep reports. The web application offers an intuitive interface to view individual reports' details and aggregate data from multiple reports. The pipeline demonstrated 100% accuracy in extracting targeted information from a test set of 40 reports, even in cases with missing data or formatting inconsistencies. The developed tool streamlines the analysis of OSA reports, reducing the need for technical expertise and enabling healthcare providers and researchers to utilize sleep study data efficiently. Future work aims to expand the dataset for more complex analyses and imputation techniques.

Authors

Fábio Teixeira

University of Porto, Portugal.
João Costa

University of Porto, Portugal.
Pedro Amorim

University of Porto, Portugal.
Nuno Guimarães

University of Porto, Portugal.
Daniela Ferreira-Santos

CINTESIS - Centre for Health Technology and Services Research, Portugal.

Keywords

Data Mining Electronic Health Records Humans Natural Language Processing Polysomnography

External Resources

View on PubMed Access via DOI PubMed (40380606)

Automating Data Extraction from PDF Sleep Reports Using Data Mining Techniques.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals