Automation of Trainable Datasets Generation for Medical-Specific Language Model: Using MIMIC-IV Discharge Notes.

Journal: Studies in health technology and informatics
PMID:

Abstract

This study introduces a novel approach for generating machine-generated instruction datasets for fine-tuning medical-specialized language models using MIMIC-IV discharge records. The study created a large-scale text dataset comprising instructions, cropped discharge notes as inputs, and outputs in JSONL format. The dataset was generated through three main stages, generating instruction and output using seed tasks provided by medical experts, followed by invalid data filtering. The generated dataset consisted of 51,385 sets, with mean ROUGE between seed tasks of 0.185. Evaluation of the generated dataset were promising, with high validity rates determined by both GPT-3.5 and a human annotator (88.0% and 88.5% respectively). The study highlights the potential of automating dataset creation for NLP tasks in the medical domain.

Authors

  • Youngrong Lee
    Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Republic of Korea.
  • Chansik Kim
    Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Republic of Korea.
  • Taehoon Ko
    Office of Hospital Information, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea.