Unsupervised Extraction of Body-Text from Clinical PDF Documents.

Journal: Studies in health technology and informatics

Published Date: Aug 22, 2024

Abstract

Automatic extraction of body-text within clinical PDF documents is necessary to enhance downstream NLP tasks but remains a challenge. This study presents an unsupervised algorithm designed to extract body-text leveraging large volume of data. Using DBSCAN clustering over aggregate pages, our method extracts and organize text blocks using their content and coordinates. Evaluation results demonstrate precision scores ranging from 0.82 to 0.98, recall scores from 0.62 to 0.94, and F1-scores from 0.71 to 0.96 across various medical specialty sources. Future work includes dynamic parameter adjustments for improved accuracy and using larger datasets.

Authors

Adel Bensahla

Division of Medical Information Sciences, Geneva University Hospitals, Geneva, Switzerland.
Jamil Zaghir

Division of Medical Information Sciences, University Hospitals of Geneva.
Christophe Gaudet-Blavignac

Division of Medical Information Sciences Geneva University Hospitals and University of Geneva.
Christian Lovis

Division of Medical Information Sciences Geneva University Hospitals and University of Geneva.

Keywords

Algorithms Data Mining Electronic Health Records Humans Natural Language Processing Unsupervised Machine Learning

External Resources

View on PubMed Access via DOI PubMed (39176711)

Unsupervised Extraction of Body-Text from Clinical PDF Documents.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Unsupervised Extraction of Body-Text from Clinical PDF Documents.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals