Can large language models generate exam questions comparable to humans? A systematic review and meta-analysis study in medical education.

Journal: Medical teacher
Published Date:

Abstract

INTRODUCTION: Recently developed artificial intelligence (AI), particularly large language models (LLMs), e.g. GPT, streamline the development of multiple-choice questions (MCQs); however, concerns remain about psychometric quality, fairness across learner groups, and potential bias. METHODS: Following PRISMA 2020, we searched PubMed, Google Scholar, Web of Science, and the Cochrane Library for studies comparing both approaches in health education. We extracted difficulty and discrimination outcomes and assessed risk of bias using R Studio, the original Cochrane Risk of Bias tool for randomised trials and ROBINS-I for non-randomised studies. Random-effects meta-analyses were performed, with subgroup and sensitivity analyses to explore heterogeneity. RESULTS: Twelve studies were included in the systematic review (two randomised trials and ten non-randomised studies); eleven contributed learner-derived difficulty data and twelve contributed discrimination data to the meta-analyses. There was no evidence of a difference between AI-generated and human-authored items for difficulty (SMD 0.05, 95% CI -0.30 to 0.40; p = 0.77; I2 = 82%) or discrimination (SMD -0.10, 95% CI -0.35 to 0.15; p = 0.42; I2 = 70%). Subgroup analyses by AI model and educational domain did not fully explain the significant difference between-study heterogeneity. However, human-generated questions were more difficult in medical licensing examinations. Sensitivity analysis revealed that excluding one study altered the main results, indicating a significant benefit for AI-generated items on discrimination. The evidence base is limited with little attention to equity in the literature. DISCUSSION (IMRAD): To our knowledge, this is the first review comparing AI versus human-generated MCQs. Our findings highlight that AI-generated MCQs demonstrate psychometric performance comparable to human items overall but are presently best positioned to supplement expert item development in lower-stakes contexts as research evolves. We propose a precautionary approach with staged, auditable safety checkpoints, including provenance, continuous psychometric surveillance, and human gatekeeping to support equitable, culturally safe, high-quality outputs.

Authors

Keywords

No keywords available for this article.