Reasoning-optimised large language models reach near-expert accuracy on board-style orthopaedic exams: A multi-model comparison on 702 multiple-choice questions.

Journal: Knee surgery, sports traumatology, arthroscopy : official journal of the ESSKA
Published Date:

Abstract

PURPOSE: The purpose of this study was to compare the accuracy, calibration, reproducibility and operating cost of seven large language models (LLMs)-including four newer models capable of using advanced reasoning techniques to analyse complex medical information and generate accurate responses-on text-only orthopaedic multiple-choice questions (MCQs) and to quantify gains over GPT-4. METHODS: From Orthobullets, 702 unique, non-image MCQs (drawn from AAOS Self-Assessment Examinations, Self-Assessment-Based Questions and Orthopaedic In Training Examination-Based Questions banks) were extracted. Each question was submitted to the following LLMs: OpenAI o3, Anthropic Claude Sonnet 4, Claude Opus 4 (with/without 'Extended Thinking') and Google Gemini 2.5 Pro. Additionally, OpenAI's GPT-4, GPT-4o and the open-weight Gemma 3 27B served as comparators. The primary outcome was overall accuracy. The secondary outcomes were topic and difficulty-stratified accuracy, calibration (expected calibration error [ECE] and Brier score), reproducibility (flip rate on a retest question subset), latency, token use and cost. Statistical tests included paired McNemar, Cochran Q, ordinal logistic regression and Fleiss κ (Bonferroni-adjusted α = 0.05). RESULTS: GPT-4 achieved 69.7% accuracy (95% CI = 66.2-72.9). All four reasoning-optimised models scored ≥14 percentage points higher (p < 3.3 × 10-15); OpenAI o3 led with 93.6% (95% CI = 91.5-95.2), which represents a 34% relative error reduction. Accuracy tended to decline with question difficulty, yet the reasoning advantage persisted in every difficulty stratum. Claude Opus 4 showed the best calibration (ECE = 0.023), while GPT-4 exhibited overconfidence (ECE = 0.215). All models except Gemma 3 27B exhibited non-zero flip rates. Median query time: 0.9 s (Gemma) to 15.9 s (Gemini 2.5 Pro). Cost: 0 to 29.9 USD per 1000 queries. CONCLUSIONS: Reasoning-optimised LLMs now answer text-based orthopaedic exam questions with high accuracy and substantially better confidence calibration than earlier models. However, persistent stochasticity and large latency-cost disparities may limit clinical deployment. LEVEL OF EVIDENCE: N/A.

Authors

  • Pedro Diniz
    Department of Bioengineering and iBB - Institute for Bioengineering and Biosciences, Instituto Superior Técnico Universidade de Lisboa Lisbon Portugal.
  • Takuji Yokoe
    Orthopaedic Department Centro Hospitalar Póvoa de Varzim Vila do Conde Portugal.
  • Felix C Öttl
    Balgrist Universitätsklinik, Zürich, Schweiz.
  • Hélder Pereira
    Orthopaedic Department Centro Hospitalar Póvoa de Varzim Vila do Conde Portugal.
  • Rui Henriques
    INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal.
  • Kristian Samuelsson
    Department of Orthopaedics Institute of Clinical Sciences, The Sahlgrenska Academy University of Gothenburg Gothenburg Sweden.

Keywords

No keywords available for this article.