Leveraging Large Language Models for Synthetic Data Generation to Enhance Adverse Drug Event Detection in Tweets.

Journal: Studies in health technology and informatics
Published Date:

Abstract

Adverse drug event (ADE) detection in social media texts poses significant challenges due to the informal nature of the text and the limited availability of annotations. The scarcity of ADE named entity recognition (NER) datasets for social media hinders the development of robust ADE detection models for this type of corpus. In this paper, we leveraged the generative capabilities of large language models (LLMs) to create synthetic data, addressing this dataset gap. Specifically, we generated 17,000 tweets with ADE annotations and pre-trained NER models on this synthetic data. Our evaluations on an out-of-sample collection of 915 manually annotated tweets revealed that these models outperform state-of-the-art lexico-based and massively pre-trained open NER models. We also show that fine-tuning our synthetically pre-trained models on human-annotated data surpasses the current state-of-the-art in ADE detection on tweets. These findings suggest that synthetic data generated by LLMs can enhance ADE detection performance, offering a promising avenue to explore in response to the scarcity of annotated ADE datasets. The synthetic dataset is available at https://huggingface.co/datasets/anthonyyazdaniml/synthetic-ner-ade-tweets-v1.

Authors

  • Anthony Yazdani
    Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
  • Hossein Rouhizadeh
    Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland. hossein.rouhizadeh@unige.ch.
  • Alban Bornet
    Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.
  • Douglas Teodoro
    Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.