Optimizing Data Extraction: Harnessing RAG and LLMs for German Medical Documents.

Journal: Studies in health technology and informatics
PMID:

Abstract

In the field of medical data analysis, converting unstructured text documents into a structured format suitable for further use is a significant challenge. This study introduces an automated local deployed data privacy secure pipeline that uses open-source Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) architecture to convert medical German language documents with sensitive health-related information into a structured format. Testing on a proprietary dataset of 800 unstructured original medical reports demonstrated an accuracy of up to 90% in data extraction of the pipeline compared to data extracted manually by physicians and medical students. This highlights the pipeline's potential as a valuable tool for efficiently extracting relevant data from unstructured sources.

Authors

  • Yingding Wang
    Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, LMU Munich, Munich, Germany.
  • Simon Leutner
    Medical Technology and IT (MIT), University Hospital, LMU Munich, Munich, Germany.
  • Michael Ingrisch
    Department of Radiology, Ludwig-Maximilians-University Munich, Munich, Germany.
  • Christoph Klein
    Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, Ludwig-Maximilians-Universität München, Munich, Germany.
  • Ludwig Christian Hinske
    Institute for Digital Medicine, University Hospital Augsburg, Augsburg, Germany.
  • Katharina Danhauser
    Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, LMU Munich, Munich, Germany.