Comparative analysis of ChatGPT 3.5 and ChatGPT 4 obstetric and gynecological knowledge.

Journal: Scientific reports
Published Date:

Abstract

Generative Pretrained Transformer (GPT) is one of the most ubiquitous large language models (LLMs), employing artificial intelligence (AI) to generate human-like language. Although the use of ChatGPT has been evaluated in different medical specialties, sufficient evidence in the field of obstetrics and gynecology is still lacking. The aim of our study was to analyze the knowledge of the two latest generations of ChatGPT (ChatGPT-3.5 and ChatGPT-4) in the area of obstetrics and gynecology, and thereby to assess their potential applicability in clinical practice. We submitted 352 single-best-answer questions from the Polish Specialty Certificate Examinations in Obstetrics and Gynecology to ChatGPT-3.5 and ChatGPT-4, in both Polish and English. The models' accuracy was evaluated, and performance was analyzed based on question difficulty and language. Statistical analyses were conducted using the Mann-Whitney U test and the chi-square test. The results of the study indicate that both LLMs demonstrate satisfactory knowledge in the analyzed specialties. Nonetheless, we observed a significant superiority of ChatGPT-4 over its predecessor regarding the accuracy of answers. The correctness of answers of both models was associated with the difficulty index of questions. In addition, based on our analysis, ChatGPT should be used in English for optimal performance.

Authors

  • Franciszek Ługowski
    1st Department of Obstetrics and Gynecology, Medical University of Warsaw, Warsaw, Poland. franciszeklugowski@gmail.com.
  • Julia Babińska
    1st Department of Obstetrics and Gynecology, Medical University of Warsaw, Warsaw, Poland.
  • Artur Ludwin
    1st Department of Obstetrics and Gynecology, Medical University of Warsaw, Warsaw, Poland.
  • Paweł Jan Stanirowski
    1st Department of Obstetrics and Gynecology, Medical University of Warsaw, Warsaw, Poland.