Safety and Utility of an Agentic Large Language Model-Based Hospital Course Summarizer: A Prospective Real-World Pilot Study
Journal:
medRxiv
Published Date:
Feb 6, 2026
Abstract
Importance: High-quality discharge summaries are essential for safe care transitions but contribute substantially to clinician documentation burden and burnout. While retrospective studies suggest large language models (LLMs) can generate clinical summaries comparable in quality to physician-authored summaries, prospective evidence on safety, utility, and effects on clinician well-being in real-world settings is limited. Objective: To evaluate the safety, utilization, and impact on clinician burden of MedAgentBrief, an LLM-based agentic workflow for generating hospital course summaries, during prospective clinical deployment. Design, Setting, and Participants: Single-arm prospective pilot study of 11 attending hospitalist physicians on one inpatient unit from August 1 to October 11, 2025, with baseline comparisons from April 9 to July 31, 2025. Intervention: MedAgentBrief, a custom agentic AI workflow using Gemini 2.5 Pro, generated draft hospital course summaries nightly from the admission history and physical and daily progress notes. Drafts were securely emailed to physicians each day for review and optional use. Main Outcomes and Measures: The primary outcome was physician-reported potential for and severity of harm from unedited summaries, measured with the AHRQ Common Format Harm Scale. Secondary outcomes included utilization rate; error types (omissions, inaccuracies, hallucinations); time spent creating discharge summaries (EHR logs); and changes in cognitive burden (NASA Task Load Index [NASA-TLX]) and burnout (Stanford Professional Fulfillment Index [PFI] Work Exhaustion Scale). Results: The system generated 1,274 summaries. Among 384 discharges, physicians used AI content in 219 cases (57%). Feedback was provided on 100 summaries (40.2%); omissions were noted in 25% and inaccuracies in 20%, while hallucinations were rare (2%). Physicians rated 88% of unedited summaries as having no harm potential and 1% as likely to cause moderate harm; no severe harm was reported. Burnout scores decreased significantly (1.75 vs 1.20; P = .03). Time savings were heterogeneous: 71% of physicians had reductions in median documentation time, with the largest reduction up to 2.9 minutes. Conclusions and Relevance: During prospective deployment, an LLM-based agentic workflow produced hospital course summaries that were frequently used, with mild to minimal harm risk identified. The intervention was associated with a significant reduction in physician burnout, supporting the viability of AI summarization to reduce documentation burden.