Automated Assessment of OSCE Physical Exams using Multimodal AI

Journal: medRxiv

Published Date: Jan 18, 2026

Abstract

Background The assessment of physical examination skills in medical education is resource-intensive and prone to inter-rater variability. While artificial intelligence (AI) has successfully automated the grading of clinical notes and transcripts, evaluating the physical techniques themselves-what students do rather than what they say-remains an unsolved challenge. We evaluated whether a multimodal AI system could assess physical examination skills with expert-level reliability. Methods In this retrospective ablation study, we analyzed 300 video-recorded encounters from six Objective Structured Clinical Examination (OSCE) stations (cardiovascular, respiratory, gastrointestinal, musculoskeletal, and neurological). We compared the performance of a multimodal AI model (Gemini 2.5 Pro) across single- and multi-camera configurations and isolated input modalities (video, audio-only, transcript-only, visual-only) against standard human grading. The primary outcome was agreement with a physician-adjudicated ground-truth reference standard, measured by quadratic weighted Cohen's kappa (k). Results The AI system using a synchronized 3-camera native video configuration achieved significantly higher reliability (k = 0.830; 95% CI, 0.773-0.880) than the standard human evaluators (k = 0.732; 95% CI, 0.687-0.776). Performance followed a strict hierarchy: native video > audio-only > transcript-only > visual-only. Notably, visual-only models failed (k approximately 0.20) despite high detection accuracy, revealing a "visual paradox" where models could identify when an action occurred but not how well it was performed without audio cues. Conclusions A properly configured multimodal AI system can grade physical examination skills with reliability exceeding that of trained human evaluators. Success requires native processing of synchronized audio-visual streams; transcript-based or visual-only approaches are insufficient for high-stakes assessment. These findings suggest that AI can provide scalable, objective, and valid assessment of clinical skills, overcoming the limitations of traditional human grading.

Authors

Kang
S.; Holcomb
M.; Shakur
A. H.; Hein
D.; Ngo
H.-T.; Schuler
H.; Jarrett
P.; Dalton
T.; Jamieson
A. R.

External Resources

View on medRxiv Access via DOI

Automated Assessment of OSCE Physical Exams using Multimodal AI

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Automated Assessment of OSCE Physical Exams using Multimodal AI

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals