Evaluation of the Mayo endoscopic score in ulcerative colitis using a multimodal large language model: a human-blinded accuracy study.

Journal: Intestinal research
Published Date:

Abstract

BACKGROUND/AIMS: To date, some large language models (LLMs), such as Chat Generative Pre-trained Transformer 4-omni (ChatGPT-4o), can process images through visual transformer patching. An LLM analysis was conducted to assess the ability of ChatGPT-4o to assign the Mayo endoscopic score (MES). METHODS: A selection set of high-quality endoscopic frames was identified to compare 4 input models to select the most performant, confirmed in an extended set of 304 frames. Concordance with the evaluation by expert endoscopists was assessed. RESULTS: Only one of the pre-tested models demonstrated significant concordance (κ = 0.232, 95% confidence interval [CI] = 0.167 to 0.296, P= 0.003; ρ = 0.36, P= 0.011), with a mean bias of -0.26 ± 1.192 (95% CI, -2.596 to 2.076). This was confirmed in the extended set (κ = 0.260, 95% CI = 0.195 to 0.324, P< 0.001; ρ = 0.288, P< 0.001). The absolute concordance for the selected model was 44% and 45.3% in the selection and extended sets, respectively. For the identification of moderately-to-severely active disease, a sensitivity of 73% (95% CI, 60% to 82%), specificity of 60% (95% CI, 54% to 66%), positive predictive value of 32% (95% CI, 25% to 40%), and negative predictive value of 90% (95% CI, 84% to 93%) were identified. CONCLUSIONS: ChatGPT-4o shows a mild potential in evaluating MES in endoscopic frames, but further refinements are mandatory.

Authors

Keywords

No keywords available for this article.