Large Language Models Can be Good Medical Annotators: A Case Study of Drug Change Detection in Japanese EHRs.
Journal:
Studies in health technology and informatics
Published Date:
Aug 7, 2025
Abstract
In this study, we combined automatically generated labels from large language models (LLMs) with a small number of manual annotations to classify adverse event-related treatment discontinuations in Japanese EHRs. By fine-tuning JMedRoBERTa and T5 on 6,156 LLM-labeled records and 200 manually labeled samples and then evaluating on a 100-record test set, T5 achieved a precision of 0.83, albeit with a recall of only 0.25. We noted that when training solely on the 200 human-labeled samples (that contained significantly few positive cases), the model failed to detect any adverse events, making a reliable measurement of precision or recall infeasible (that is, N/A). This underscores the potential of large-scale LLM-driven labeling as well as the need to improve recall and label quality in practical clinical scenarios.