Can large language models generate exam questions comparable to humans? A systematic review and meta-analysis study in medical education.

Journal: Medical teacher

Published Date: Jun 21, 2026

Abstract

INTRODUCTION: Recently developed artificial intelligence (AI), particularly large language models (LLMs), e.g. GPT, streamline the development of multiple-choice questions (MCQs); however, concerns remain about psychometric quality, fairness across learner groups, and potential bias. METHODS: Following PRISMA 2020, we searched PubMed, Google Scholar, Web of Science, and the Cochrane Library for studies comparing both approaches in health education. We extracted difficulty and discrimination outcomes and assessed risk of bias using R Studio, the original Cochrane Risk of Bias tool for randomised trials and ROBINS-I for non-randomised studies. Random-effects meta-analyses were performed, with subgroup and sensitivity analyses to explore heterogeneity. RESULTS: Twelve studies were included in the systematic review (two randomised trials and ten non-randomised studies); eleven contributed learner-derived difficulty data and twelve contributed discrimination data to the meta-analyses. There was no evidence of a difference between AI-generated and human-authored items for difficulty (SMD 0.05, 95% CI -0.30 to 0.40; p = 0.77; I2 = 82%) or discrimination (SMD -0.10, 95% CI -0.35 to 0.15; p = 0.42; I2 = 70%). Subgroup analyses by AI model and educational domain did not fully explain the significant difference between-study heterogeneity. However, human-generated questions were more difficult in medical licensing examinations. Sensitivity analysis revealed that excluding one study altered the main results, indicating a significant benefit for AI-generated items on discrimination. The evidence base is limited with little attention to equity in the literature. DISCUSSION (IMRAD): To our knowledge, this is the first review comparing AI versus human-generated MCQs. Our findings highlight that AI-generated MCQs demonstrate psychometric performance comparable to human items overall but are presently best positioned to supplement expert item development in lower-stakes contexts as research evolves. We propose a precautionary approach with staged, auditable safety checkpoints, including provenance, continuous psychometric surveillance, and human gatekeeping to support equitable, culturally safe, high-quality outputs.

Can large language models generate exam questions comparable to humans? A systematic review and meta-analysis study in medical education.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Can large language models generate exam questions comparable to humans? A systematic review and meta-analysis study in medical education.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals