Evaluation of AI language models in answering pregnancy-related questions assessed by obstetrics specialists.

Journal: Scientific reports
Published Date:

Abstract

This study aimed to compare the performance of three large language models-ChatGPT-3.5, Gemini, and ChatGPT-4.0-in generating responses to ten frequently asked pregnancy-related questions, as evaluated by obstetrics and gynecology specialists. Seventy-five specialists independently rated 30 anonymized AI-generated responses using a 5-point Likert scale across four domains: accuracy, reliability, patient-friendliness, and comprehensibility. All questions were standardized and presented verbatim to each model using identical zero-shot prompts. Data were analyzed using the Kruskal-Wallis test with Bonferroni-adjusted Mann-Whitney U post-hoc comparisons. Inter-rater consistency was assessed using Cronbach's alpha. Spearman correlation was used to examine associations between clinical experience and evaluation patterns. ChatGPT-4.0 demonstrated the highest overall performance, particularly in accuracy (median 4.35; mean ± SD: 4.30 ± 0.48) and patient-friendliness (4.40; 4.35 ± 0.47). Gemini performed comparably to ChatGPT-4.0 in comprehensibility (3.70; 3.68 ± 0.54), while ChatGPT-3.5 consistently received the lowest scores. Significant differences were observed among the three models for accuracy, reliability, and patient-friendliness (all p < 0.001), but not for comprehensibility (p = 0.521). A modest positive correlation was found between clinical experience and reliability ratings (r = 0.261, p = 0.0238). Among the evaluated models, ChatGPT-4.0 provided the most clinically aligned and patient-friendly responses to common pregnancy questions. While AI tools may offer valuable support for patient education, expert oversight remains essential to ensure accuracy and safety. Further research should explore their real-world impact on patient comprehension, behavior, and clinical outcomes.

Authors

Keywords

No keywords available for this article.