Quality assurance and validity of AI-generated single best answer questions.

Journal: BMC medical education
PMID:

Abstract

BACKGROUND: Recent advancements in generative artificial intelligence (AI) have opened new avenues in educational methodologies, particularly in medical education. This study seeks to assess whether generative AI might be useful in addressing the depletion of assessment question banks, a challenge intensified during the Covid-era due to the prevalence of open-book examinations, and to augment the pool of formative assessment opportunities available to students. While many recent publications have sought to ascertain whether AI can achieve a passing standard in existing examinations, this study investigates the potential for AI to generate the exam itself. This research utilized a commercially available AI large language model (LLM), OpenAI GPT-4, to generate 220 single best answer (SBA) questions, adhering to Medical Schools Council Assessment Alliance guidelines the and a selection of Learning Outcomes (LOs) of the Scottish Graduate-Entry Medicine (ScotGEM) program. All questions were assessed by an expert panel for accuracy and quality. A total of 50 AI-generated and 50 human-authored questions were used to create two 50-item formative SBA examinations for Year 1 and Year 2 ScotGEM students. Each exam, delivered via the Speedwell eSystem, comprised 25 AI-generated and 25 human-authored questions presented in random order. Students completed the online, closed-book exams on personal devices under exam conditions that reflected summative examinations. The performance of both AI-generated and human-authored questions was evaluated, focusing on facility and discrimination index as key metrics. The screening process revealed that 69% of AI-generated SBAs were fit for inclusion in the examinations with little or no modifications required. Modifications, when necessary, were predominantly due to reasons such as the inclusion of "all of the above" options, usage of American English spellings, and non-alphabetized answer choices. 31% of questions were rejected for inclusion in the examinations, due to factual inaccuracies and non-alignment with students' learning. When included in an examination, post hoc statistical analysis indicated no significant difference in performance between the AI- and human- authored questions in terms of facility and discrimination index.

Authors

  • Ayla Ahmed
    University of St Andrews, St Andrews, UK.
  • Ellen Kerr
    University of St Andrews, St Andrews, UK.
  • Andrew O'Malley
    University of St Andrews, St Andrews, UK. aso2@st-andrews.ac.uk.