An LLM-Based Comparison of Ambient AI Scribes for Clinical Documentation

Journal: medRxiv
Published Date:

Abstract

Ambient AI scribes have become an increasingly promising option for automating clinical documentation, with dozens of enterprise solutions available. It remains uncertain whether models with domain-specific tuning outperform naïve models “out of the box.” This study evaluated five commercial AI scribes, alongside a custom solution using the base model of GPT-o1 without fine-tuning, as well as an experienced human scribe, in a series of simulated clinical encounters. Generated notes from these parties were scored by large language models (LLMs) using a rubric assessing completeness, organization, accuracy, complexity handling, conciseness, and adaptability. Our naive solution achieved scores comparable with industry-leading solutions across all rubric dimensions. These findings suggest that the added value of domain-specific training in ambient AI medical scribes may be limited when compared to base foundation models.

Authors

  • Jaison Jain; James Kaan; Suraj Jain; Austin Young; Camilo Martinez; William Kartsonis; Carlos Ortiz; Ryan Cheng; Erik Jaklitsch; Srinivas Cherukuri; Aleksandra Qilleri; Apostolos Tassiopoulos