Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening
Journal:
arXiv
Published Date:
Jul 2, 2025
Abstract
Large language models (LLMs) can simulate clinical reasoning based on natural
language prompts, but their utility in ophthalmology is largely unexplored.
This study evaluated GPT-4's ability to interpret structured textual
descriptions of retinal fundus photographs and simulate clinical decisions for
diabetic retinopathy (DR) and glaucoma screening, including the impact of
adding real or synthetic clinical metadata. We conducted a retrospective
diagnostic validation study using 300 annotated fundus images. GPT-4 received
structured prompts describing each image, with or without patient metadata. The
model was tasked with assigning an ICDR severity score, recommending DR
referral, and estimating the cup-to-disc ratio for glaucoma referral.
Performance was evaluated using accuracy, macro and weighted F1 scores, and
Cohen's kappa. McNemar's test and change rate analysis were used to assess the
influence of metadata. GPT-4 showed moderate performance for ICDR
classification (accuracy 67.5%, macro F1 0.33, weighted F1 0.67, kappa 0.25),
driven mainly by correct identification of normal cases. Performance improved
in the binary DR referral task (accuracy 82.3%, F1 0.54, kappa 0.44). For
glaucoma referral, performance was poor across all settings (accuracy ~78%, F1
<0.04, kappa <0.03). Metadata inclusion did not significantly affect outcomes
(McNemar p > 0.05), and predictions remained consistent across conditions.
GPT-4 can simulate basic ophthalmic decision-making from structured prompts but
lacks precision for complex tasks. While not suitable for clinical use, LLMs
may assist in education, documentation, or image annotation workflows in
ophthalmology.