Reasoning-optimised large language models reach near-expert accuracy on board-style orthopaedic exams: A multi-model comparison on 702 multiple-choice questions.
Journal:
Knee surgery, sports traumatology, arthroscopy : official journal of the ESSKA
Published Date:
Dec 17, 2025
Abstract
PURPOSE: The purpose of this study was to compare the accuracy, calibration, reproducibility and operating cost of seven large language models (LLMs)-including four newer models capable of using advanced reasoning techniques to analyse complex medical information and generate accurate responses-on text-only orthopaedic multiple-choice questions (MCQs) and to quantify gains over GPT-4. METHODS: From Orthobullets, 702 unique, non-image MCQs (drawn from AAOS Self-Assessment Examinations, Self-Assessment-Based Questions and Orthopaedic In Training Examination-Based Questions banks) were extracted. Each question was submitted to the following LLMs: OpenAI o3, Anthropic Claude Sonnet 4, Claude Opus 4 (with/without 'Extended Thinking') and Google Gemini 2.5 Pro. Additionally, OpenAI's GPT-4, GPT-4o and the open-weight Gemma 3 27B served as comparators. The primary outcome was overall accuracy. The secondary outcomes were topic and difficulty-stratified accuracy, calibration (expected calibration error [ECE] and Brier score), reproducibility (flip rate on a retest question subset), latency, token use and cost. Statistical tests included paired McNemar, Cochran Q, ordinal logistic regression and Fleiss κ (Bonferroni-adjusted α = 0.05). RESULTS: GPT-4 achieved 69.7% accuracy (95% CI = 66.2-72.9). All four reasoning-optimised models scored ≥14 percentage points higher (p < 3.3 × 10-15); OpenAI o3 led with 93.6% (95% CI = 91.5-95.2), which represents a 34% relative error reduction. Accuracy tended to decline with question difficulty, yet the reasoning advantage persisted in every difficulty stratum. Claude Opus 4 showed the best calibration (ECE = 0.023), while GPT-4 exhibited overconfidence (ECE = 0.215). All models except Gemma 3 27B exhibited non-zero flip rates. Median query time: 0.9 s (Gemma) to 15.9 s (Gemini 2.5 Pro). Cost: 0 to 29.9 USD per 1000 queries. CONCLUSIONS: Reasoning-optimised LLMs now answer text-based orthopaedic exam questions with high accuracy and substantially better confidence calibration than earlier models. However, persistent stochasticity and large latency-cost disparities may limit clinical deployment. LEVEL OF EVIDENCE: N/A.
Authors
Keywords
No keywords available for this article.