Can Generative LLMs Help Classify Imbalanced Real-World Data? Exploring Rare Diseases on Social Media.
Journal:
Studies in health technology and informatics
Published Date:
Aug 7, 2025
Abstract
Developmental and Epileptic Encephalopathies (DEEs) are rare, severe conditions often discussed by families on social media, offering valuable insights into their experiences. Identifying these messages amidst unrelated content is crucial but challenging due to data imbalance. This study evaluates different uses of generative large language models (LLMs) for binary classification of DEE-related experiences within social media posts. Using CamemBERT as a baseline, we compared two strategies: zero-shot prompt-based classification and synthetic data generation for minority class augmentation. While zero-shot prompting underperformed, the addition of 2% synthetic data improved all metrics (macro/positive F1, precision and recall). Higher proportions of synthetic data led to decreased precision. These findings underscore the potential of hybrid approaches combining fine-tuning and domain-specific synthetic data for addressing data imbalance in rare disease contexts. Further validation across models and datasets is needed.