MenstLLaMA: A Specialized Large Language Model for Menstrual Health Education in India
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
The quality and accessibility of menstrual health education in developing nations, including India, remain inadequate due to challenges such as poverty, social stigma, and gender inequality. While community-driven initiatives aim to raise awareness, artificial intelligence (AI) offers a scalable solution for disseminating accurate information. However, existing general-purpose large language models (LLMs) are ill-suited for this task, suffering from low accuracy, cultural insensitivity, and overly complex responses. To address these limitations, we developed MenstLLaMA, a specialized LLM tailored to the Indian context, designed to deliver menstrual health education empathetically, supportively, and accessible. To develop and evaluate MenstLLaMA, a specialized LLM tailored to deliver accurate, culturally sensitive menstrual health education, and to assess its effectiveness compared to existing general-purpose models. We curated MENST, a novel domain-specific dataset comprising 23,820 question-answer pairs, aggregated from medical websites, government portals, and health education resources. This dataset was systematically annotated with metadata capturing age groups, regions, topics, and socio-cultural contexts. MenstLLaMA was developed by fine-tuning Meta-LLaMA-3-8B-Instruct using Parameter Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) techniques to achieve domain alignment with reduced computational overhead. We benchmarked MenstLLaMA against nine state-of-the-art general-purpose LLMs, including GPT-4o, Claude-3, Gemini 1.5 Pro, Mistral, and others. The evaluation followed a multi-layered framework: (1) automatic evaluation using BLEU, METEOR, ROUGE-L, and BERTScore metrics (2) expert evaluation by clinical experts (N=18) rating 200 expert-curated queries (3) medical practitioner interaction using an interactive chatbot (ISHA) for qualitative assessment across Relevance, Understandability, Preciseness, Correctness and Context sensitivity and (4) a user study with volunteer participants (N=200) evaluating MenstLLaMA in 15–20 minute randomized sessions for user satisfaction assessment on performance across seven qualitative metrics. MenstLLaMA achieved the highest BLEU (0.059) and BERTScore (0.911), outperforming GPT-4o (BLEU: 0.052, BERTScore: 0.896) and Claude-3 (BERTScore: 0.888). Clinical experts preferred MenstLLaMA’s responses over gold-standard answers in several culturally sensitive cases. In evaluation by medical practitioners ISHA, the chat interface of MenstLLaMA, it scored 3.5 in Relevance, 3.6 in Understandability, 3.1/5 in Preciseness, 3.5/5 in Correctness, and 4.0/5 in Context Sensitivity. User evaluations indicated strong ratings for Understandability (4.7/5), Relevance (4.3/5), Preciseness (4.28/5), Correctness (4.1/5), Tone (4.6/5), Flow (4.2/5), and Context Sensitivity (3.9/5). MenstLLaMA demonstrates exceptional accuracy, empathy, and user satisfaction in menstrual health education, bridging critical gaps left by general-purpose LLMs. Its potential for integration into broader health education platforms positions it as a transformative tool for menstrual well-being. Future research may explore its long-term impact on public perception, menstrual hygiene practices, expanding demographic representation, enhancing context sensitivity, and integrating multi-modal and voice-based interactions for broader accessibility.