Agentic Artificial Intelligence for the Automated Generation of Accurate Summary Podcasts of Radiology Research Papers.

Journal: Korean journal of radiology
Published Date:

Abstract

OBJECTIVE: To evaluate whether a custom agentic artificial intelligence (AI) pipeline can overcome the limitations of general-purpose large language model tools, when compared with a generic commercial tool (Google NotebookLM [NBLM]), for generating podcast-style summaries of radiology research articles. MATERIALS AND METHODS: Twenty-two PDF-format original research articles published in the April 2025 issue of Radiology were processed using our Programmable, Phoneme-Aware PDF-to-Podcast Pipeline (P5) and NBLM to generate 44 audio episodes. P5 utilizes a multi-agent workflow for script generation, quality assurance, pronunciation enhancement, and audio synthesis. Four radiologists from a pool of 25 (7 generalists and 18 specialists) were randomly assigned to evaluate each blinded audio episode, yielding 176 total evaluations. The primary outcomes were the number of hallucinations (factual errors) per episode and the percentage of hallucination-free episodes. Secondary outcomes included the number of inappropriate statements, mispronunciations, and flow disruptions; the composite quality score (Quality Assessment of Educational Podcasts [QAEP]); the key results coverage score; and overall listener preference. Data were analyzed using generalized linear mixed models. RESULTS: The P5 method produced significantly fewer hallucinations per episode compared with NBLM (mean, 0.32 vs. 0.93; P < 0.001) and a higher proportion of hallucination-free episodes (71.6% [63/88] vs. 56.8% [50/88]; P = 0.013), consistently across generalists and specialists. P5 demonstrated significantly fewer mispronunciations (mean, 0.11 vs. 1.62; P < 0.001) and flow disruptions (mean, 0.19 vs. 1.06; P < 0.001) per episode, as well as higher mean QAEP composite scores (4.62 vs. 4.28; P < 0.001), compared with NBLM, consistently across generalists and specialists. Raters preferred P5 over NBLM in 72.7% (64/88) of comparisons (P = 0.003). CONCLUSION: Our custom agentic AI pipeline generated podcast-style summaries of radiology research articles with significantly higher quality and greater listener preference than the generic commercial tool.

Authors

Keywords

No keywords available for this article.