Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.

Journal: Neurosurgical review
PMID:

Abstract

Large-language models (LLMs) have shown the capability to effectively answer medical board examination questions. However, their ability to answer imagebased questions has not been examined. This study sought to evaluate the performance of two LLMs (GPT-4o and Google Gemini) on an image-based question bank designed for neurosurgery board examination preparation. The accuracy of LLMs was tested using 379 image-based questions from The Comprehensive Neurosurgery Board Preparation Book: Illustrated Questions and Answers and Neurosurgery Practice Questions and Answers. LLMs were asked to answer all questions on their own and provide an explanation for their chosen answer. The problem-solving order of questions and quality of LLM responses was evaluated by senior neurological surgery residents who have passed the American Board of Neurological Surgery (ABNS) primary examination. First order questions assess anatomy, second-order questions require diagnostic reasoning, and third-order questions test deeper clinical knowledge by inferring diagnoses and related facts, evaluating the model's ability to recall and apply medical concepts. Chi-squared tests and independent-samples t-tests were conducted to measure performance differences between LLMs. On the image-based question bank, GPT-4o and Gemini achieved correct score percentages of 51.45% (95% CI: 46.43-56.44%) and 39.58% (95% CI: 34.78-44.58%), respectively. GPT-4o significantly outperformed Gemini overall (P = 0.0013), particularly in pathology/histology (P = 0.036) and radiology (P = 0.014). GPT-4o also performed better on second-order questions (56.52% vs. 41.85%, P = 0.0067) and had a higher average response quality rating (2.77 vs. 2.31, P = 0.000002). On a question bank with 379 image-based questions designed for neurosurgery board preparation, GPT-4o obtained a score of 51.45% and outperformed Gemini. GPT-4o not only achieved higher accuracy but also provided higher-quality responses compared to Gemini. In comparison to previous studies on LLM performance of board-style questions, image-based question performance was lower, indicating LLMs may struggle with machine vision/medical image interpretation tasks.

Authors

  • Suyash Sau
    University of Rochester School of Medicine and Dentistry, Rochester, NY, USA. suyash_sau@urmc.rochester.edu.
  • Derek D George
    Department of Neurosurgery, University of Rochester Medical Center, Rochester, NY, USA.
  • Rohin Singh
    Department of Neurosurgery, University of Rochester Medical Center, Rochester, NY, USA.
  • Gurkirat S Kohli
    Department of Neurosurgery, University of Rochester Medical Center, Rochester, NY, USA.
  • Adam Li
    Department of Neurosurgery, University of Rochester Medical Center, Rochester, NY, USA.
  • Muhammad I Jalal
    University of Rochester School of Medicine and Dentistry, Rochester, NY, USA.
  • Aman Singh
    University of Rochester School of Medicine and Dentistry, Rochester, NY, USA.
  • Taylor J Furst
    Department of Neurosurgery, University of Rochester Medical Center, Rochester, NY, USA.
  • Redi Rahmani
    Department of Neurosurgery, University of Rochester Medical Center, Rochester, NY, USA.
  • G Edward Vates
    Department of Neurosurgery, University of Rochester Medical Center, Rochester, NY, USA.
  • Jonathan Stone
    Department of Neurosurgery, University of Rochester Medical Center, Rochester, NY, USA.