Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education.

Journal: Academic medicine : journal of the Association of American Medical Colleges
Published Date:

Abstract

ProblemClinical vignette-based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content.ApproachThe authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain.OutcomesThirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%-50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%-75.0%) for human-generated MCQs and 64.4% (50.0%-83.3%) for AI-generated MCQs (P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model-generated MCQs in medical education.Next StepsFuture studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.

Authors

  • Frank I Jackson
  • Nathan A Keller
  • Insaf Kouba
    Northwell, New Hyde Park, New York, United States of America.
  • Wassil Kouba
  • Luis A Bracero
  • Matthew J Blitz
    Northwell, New Hyde Park, New York, United States of America.

Keywords

No keywords available for this article.