Development and Validation of CPX-MATE: An End-to-End Medical Education Platform Integrating Voice-Based Virtual Patient Simulation and Automated Real-time Evaluation

Journal: medRxiv
Published Date:

Abstract

Background: Objective Structured Clinical Examination (OSCE; Clinical Performance Examination [CPX] in South Korea) is a high-stakes assessment of clinical performance, communication, and reasoning during time-limited patient encounters. As AI-enabled virtual standardized patient (VSP) simulation and automated scoring are introduced for OSCE-like training, prospective evidence is needed on how such systems perform and are perceived when embedded in real educational workflows. Methods: We developed CPX with Medical students' Assistant for Training and Evaluation (CPX-MATE), a web-based platform integrating (1) CPX with Virtual Standardized Patient (CPX-VSP), real-time voice dialogue with a VSP using speech-to-speech (STS) models, and (2) CPX with Real-Time Evaluator (CPX-RTE), automated transcription, checklist-based scoring, and feedback from encounter audio using a Speech-to-Text model and a large language model. During an emergency medicine clerkship (Nov 2025-Jan 2026), 60 senior medical students completed two 12-min CPX encounters (VSP with acute pancreatitis; HSP with ureteral stone) with immediate CPX-RTE feedback. For CPX-VSP, students were assigned to either a full-capacity or a resource-limited STS configuration (n=30 each). Dialogue fidelity was evaluated by turn-by-turn analysis of student-VSP exchanges, classifying responses into clinically meaningful error types (tangential, oversharing, role-breaking, off-script). CPX-RTE performance was assessed by agreement (Gwet's AC1) with professor real-time and resident video-based ratings using a 45-item checklist. Usability of CPX-VSP and CPX-RTE, with overall system usability scale (SUS), were surveyed, and mean per-session costs for CPX-VSP and CPX-RTE were calculated. Results: Across 3,282 dialogue turns, overall error rates were 1.77% versus 9.43% for full-capacity versus resource-limited STS configurations (p<0.001), driven by fewer tangential and oversharing responses; no off-script errors were observed. The mean per-session cost was $0.12 for resource-limited configuration and $0.78 for full-capacity configuration. CPX-RTE showed high agreement with human ratings (AC1=0.916 vs professor; 0.916 vs resident), with slightly different levels of agreement across four sections, and high usability across all domains (mean scores, 4.65-4.92), with a per-session cost of $0.17. CPX-MATE demonstrated good overall usability (median [IQR] of 77.5 [70.0-85.0]). Conclusions: Embedded within a prospective clinical clerkship, CPX-MATE demonstrated operational fidelity and human-level checklist agreement as an end-to-end, voice-based AI-assisted OSCE platform. This real-world deployment supports its scalable integration as a complementary assessment tool while highlighting the importance of systematic validation and context-aware implementation in medical education.

Authors

  • Song
  • J. W.; Kim
  • M.; Hong
  • C.; Kim
  • Y. S.; Cho
  • J.; Kim
  • J. H.; Myung
  • J.; Choi
  • A.; Yoon
  • H.; Lee
  • S. G. W.; You
  • S. C.; Park
  • C.