Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams.

Journal: Advances in health sciences education : theory and practice

Published Date: Aug 6, 2025

Abstract

Many medical schools primarily use multiple-choice questions (MCQs) in pre-clinical assessments due to their efficiency and consistency. However, while MCQs are easy to grade, they often fall short in evaluating higher-order reasoning and understanding student thought processes. Despite these limitations, MCQs remain popular because alternative assessments require more time and resources to grade. This study explored whether OpenAI's GPT-4o Large Language Model (LLM) could be used to effectively grade narrative short answer questions (SAQs) in case-based learning (CBL) exams when compared to faculty graders. The primary outcome was equivalence of LLM grading, assessed using a bootstrapping procedure to calculate 95% confidence intervals (CIs) for mean score differences. Equivalence was defined as the entire 95% CI falling within a ± 5% margin. Secondary outcomes included grading precision, subgroup analysis by Bloom's taxonomy, and correlation between question complexity and LLM performance. Analysis of 1,450 responses showed LLM scores were equivalent to faculty scores overall (mean difference: -0.55%, 95% CI: -1.53%, + 0.45%). Equivalence was also demonstrated for Remembering, Applying, and Analyzing questions, however, discrepancies were observed for Understanding and Evaluating questions. AI grading demonstrated high precision (ICC = 0.993, 95% CI: 0.992-0.994). Greater differences between LLM and faculty scores were found for more difficult questions (R2 = 0.6199, p < 0.0001). LLM grading could serve as a tool for preliminary scoring of student assessments, enhancing SAQ grading efficiency and improving undergraduate medical education examination quality. Secondary outcome findings emphasize the need to use these tools in combination with, not as a replacement for, faculty involvement in the grading process.

Authors

Arvind Rajan

Department of Medicine, The University of North Carolina School of Medicine, Chapel Hill, NC, USA.
Seth McKenzie Alexander

Department of Health Sciences, The University of North Carolina School of Medicine, Chapel Hill, NC, USA.
Christina L Shenvi

Department of Emergency Medicine, The University of North Carolina School of Medicine, Chapel Hill, NC, 27599, USA. cshenvi@med.unc.edu.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (40767985)

Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals