Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.
Journal:
Advances in physiology education
PMID:
39824512
Abstract
Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5 and GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on multiple-choice questions (MCQs) from the medical neuroscience course database to evaluate chatbot reliability. Five successive attempts of each chatbot to answer 200 United States Medical Licensing Examination (USMLE)-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that, at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots, with 83% and 81.7% correct answers, which is better than the average student result. They were followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%). Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1-86.7%, and the lowest results were for Brain stem, Special senses, and Cerebellum, with 54.4-57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques. This research evaluates the effectiveness of different AI-driven large language models (Claude, ChatGPT, Copilot, and Gemini) compared to medical students in answering neuroscience questions. The study offers insights into the specific areas of neuroscience in which these chatbots may excel or have limitations, providing a comprehensive analysis of chatbots' current capabilities in processing and interacting with certain topics of the basic medical sciences curriculum.