Comparing ChatGPT-4 and specialist recommendations in urogynecology case management: a blinded assessment.
Journal:
European journal of obstetrics, gynecology, and reproductive biology
Published Date:
Mar 28, 2026
Abstract
BACKGROUND: Large language models (LLMs) such as ChatGPT-4 are increasingly utilized in clinical medicine, yet their performance in subspecialties requiring personalized decision-making, including urogynecology, remains insufficiently understood. OBJECTIVE: Comparing the clinical acceptability of management recommendations generated by ChatGPT-4 with those of board-certified urogynecologist using standardized clinical case scenarios. STUDY DESIGN: Twelve outpatient urogynecology vignettes were developed by senior subspecialists. Three fellowship-trained urogynecologists and ChatGPT-4 independently generated management recommendations. Twenty senior urogynecology evaluators, blinded to authorship, assessed each response using a 10-point scale. Primary analysis compared median human versus AI scores. Secondary analyses evaluated score distributions, geographic differences, and response patterns in guideline-based versus clinical reasoning scenarios. RESULTS: 960 evaluations were completed. Median scores did not differ between human (5.8; 95% CI, 4.4-9.0) and AI-generated responses (6.1; 95% CI, 4.0-9.4; p = 0.39). Evaluator-level analysis similarly showed no difference. GPT -4 outperformed human responders in all six guideline-based cases (average advantage + 99 points), while human specialists generally outperformed GPT -4 in clinical reasoning cases. Score distributions differed significantly: human responses showed a near-normal distribution, whereas GPT -4 responses displayed a U-shaped pattern (D = 0.262, p < 0.001). International evaluators rated GPT -4 higher than Israeli evaluators (p = 0.031). CONCLUSION: ChatGPT-4 generated management recommendations that were rated similarly to those of board-certified urogynecologists. However, AI responses demonstrated greater variability and a more polarized distribution of ratings. These findings suggest that large language models may function as adjunct decision-support tools in structured scenarios, while complex clinical reasoning still requires specialist oversight.
Authors
Keywords
No keywords available for this article.