Automation of Trainable Datasets Generation for Medical-Specific Language Model: Using MIMIC-IV Discharge Notes.

Journal: Studies in health technology and informatics

PMID: 39176825

Abstract

This study introduces a novel approach for generating machine-generated instruction datasets for fine-tuning medical-specialized language models using MIMIC-IV discharge records. The study created a large-scale text dataset comprising instructions, cropped discharge notes as inputs, and outputs in JSONL format. The dataset was generated through three main stages, generating instruction and output using seed tasks provided by medical experts, followed by invalid data filtering. The generated dataset consisted of 51,385 sets, with mean ROUGE between seed tasks of 0.185. Evaluation of the generated dataset were promising, with high validity rates determined by both GPT-3.5 and a human annotator (88.0% and 88.5% respectively). The study highlights the potential of automating dataset creation for NLP tasks in the medical domain.

Authors

Youngrong Lee

Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Republic of Korea.
Chansik Kim

Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Republic of Korea.
Taehoon Ko

Office of Hospital Information, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea.

Keywords

Electronic Health Records Humans Natural Language Processing Patient Discharge Patient Discharge Summaries

External Resources

View on PubMed Access via DOI PubMed (39176825)

Automation of Trainable Datasets Generation for Medical-Specific Language Model: Using MIMIC-IV Discharge Notes.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals