Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis
Journal:
arXiv
Published Date:
Jun 3, 2025
Abstract
Background: Large language models (LLMs) have demonstrated substantial
potential to support clinical practice. Other than Chat GPT4 and its
predecessors, few LLMs, especially those of the leading and more powerful
reasoning model class, have been subjected to medical specialty examination
questions, including in the domain of primary care. This paper aimed to test
the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and
Gemini 2.5 Pro) in primary care education, specifically in answering Member of
the Royal College of General Practitioners (MRCGP) style examination questions.
Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer
100 randomly chosen multiple choice questions from the Royal College of General
Practitioners GP SelfTest on 25 May 2025. Questions included textual
information, laboratory results, and clinical images. Each model was prompted
to answer as a GP in the UK and was provided with full question information.
Each question was attempted once by each model. Responses were scored against
correct answers provided by GP SelfTest.
Results: The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was
99.0%, 95.0%, 95.0%, and 95.0%, respectively. The average peer score for the
same questions was 73.0%.
Discussion: All models performed remarkably well, and all substantially
exceeded the average performance of GPs and GP registrars who had answered the
same questions. o3 demonstrated the best performance, while the performances of
the other leading models were comparable with each other and were not
substantially lower than that of o3. These findings strengthen the case for
LLMs, particularly reasoning models, to support the delivery of primary care,
especially those that have been specifically trained on primary care clinical
data.