Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams.
Journal:
Advances in health sciences education : theory and practice
Published Date:
Aug 6, 2025
Abstract
Many medical schools primarily use multiple-choice questions (MCQs) in pre-clinical assessments due to their efficiency and consistency. However, while MCQs are easy to grade, they often fall short in evaluating higher-order reasoning and understanding student thought processes. Despite these limitations, MCQs remain popular because alternative assessments require more time and resources to grade. This study explored whether OpenAI's GPT-4o Large Language Model (LLM) could be used to effectively grade narrative short answer questions (SAQs) in case-based learning (CBL) exams when compared to faculty graders. The primary outcome was equivalence of LLM grading, assessed using a bootstrapping procedure to calculate 95% confidence intervals (CIs) for mean score differences. Equivalence was defined as the entire 95% CI falling within a ± 5% margin. Secondary outcomes included grading precision, subgroup analysis by Bloom's taxonomy, and correlation between question complexity and LLM performance. Analysis of 1,450 responses showed LLM scores were equivalent to faculty scores overall (mean difference: -0.55%, 95% CI: -1.53%, + 0.45%). Equivalence was also demonstrated for Remembering, Applying, and Analyzing questions, however, discrepancies were observed for Understanding and Evaluating questions. AI grading demonstrated high precision (ICC = 0.993, 95% CI: 0.992-0.994). Greater differences between LLM and faculty scores were found for more difficult questions (R2 = 0.6199, p < 0.0001). LLM grading could serve as a tool for preliminary scoring of student assessments, enhancing SAQ grading efficiency and improving undergraduate medical education examination quality. Secondary outcome findings emphasize the need to use these tools in combination with, not as a replacement for, faculty involvement in the grading process.
Authors
Keywords
No keywords available for this article.