Reasoning-optimised large language models reach near-expert accuracy on board-style orthopaedic exams: A multi-model comparison on 702 multiple-choice questions.

Journal: Knee surgery, sports traumatology, arthroscopy : official journal of the ESSKA

Published Date: Dec 17, 2025

Abstract

PURPOSE: The purpose of this study was to compare the accuracy, calibration, reproducibility and operating cost of seven large language models (LLMs)-including four newer models capable of using advanced reasoning techniques to analyse complex medical information and generate accurate responses-on text-only orthopaedic multiple-choice questions (MCQs) and to quantify gains over GPT-4. METHODS: From Orthobullets, 702 unique, non-image MCQs (drawn from AAOS Self-Assessment Examinations, Self-Assessment-Based Questions and Orthopaedic In Training Examination-Based Questions banks) were extracted. Each question was submitted to the following LLMs: OpenAI o3, Anthropic Claude Sonnet 4, Claude Opus 4 (with/without 'Extended Thinking') and Google Gemini 2.5 Pro. Additionally, OpenAI's GPT-4, GPT-4o and the open-weight Gemma 3 27B served as comparators. The primary outcome was overall accuracy. The secondary outcomes were topic and difficulty-stratified accuracy, calibration (expected calibration error [ECE] and Brier score), reproducibility (flip rate on a retest question subset), latency, token use and cost. Statistical tests included paired McNemar, Cochran Q, ordinal logistic regression and Fleiss κ (Bonferroni-adjusted α = 0.05). RESULTS: GPT-4 achieved 69.7% accuracy (95% CI = 66.2-72.9). All four reasoning-optimised models scored ≥14 percentage points higher (p < 3.3 × 10-15); OpenAI o3 led with 93.6% (95% CI = 91.5-95.2), which represents a 34% relative error reduction. Accuracy tended to decline with question difficulty, yet the reasoning advantage persisted in every difficulty stratum. Claude Opus 4 showed the best calibration (ECE = 0.023), while GPT-4 exhibited overconfidence (ECE = 0.215). All models except Gemma 3 27B exhibited non-zero flip rates. Median query time: 0.9 s (Gemma) to 15.9 s (Gemini 2.5 Pro). Cost: 0 to 29.9 USD per 1000 queries. CONCLUSIONS: Reasoning-optimised LLMs now answer text-based orthopaedic exam questions with high accuracy and substantially better confidence calibration than earlier models. However, persistent stochasticity and large latency-cost disparities may limit clinical deployment. LEVEL OF EVIDENCE: N/A.

Authors

Pedro Diniz

Department of Bioengineering and iBB - Institute for Bioengineering and Biosciences, Instituto Superior Técnico Universidade de Lisboa Lisbon Portugal.
Takuji Yokoe

Orthopaedic Department Centro Hospitalar Póvoa de Varzim Vila do Conde Portugal.
Felix C Öttl

Balgrist Universitätsklinik, Zürich, Schweiz.
Hélder Pereira

Orthopaedic Department Centro Hospitalar Póvoa de Varzim Vila do Conde Portugal.
Rui Henriques

INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal.
Kristian Samuelsson

Department of Orthopaedics Institute of Clinical Sciences, The Sahlgrenska Academy University of Gothenburg Gothenburg Sweden.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (41404998)

Reasoning-optimised large language models reach near-expert accuracy on board-style orthopaedic exams: A multi-model comparison on 702 multiple-choice questions.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Reasoning-optimised large language models reach near-expert accuracy on board-style orthopaedic exams: A multi-model comparison on 702 multiple-choice questions.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals