Multimodal Performance of GPT-4 in Complex Ophthalmology Cases.
Journal:
Journal of personalized medicine
Published Date:
Apr 21, 2025
Abstract
The integration of multimodal capabilities into GPT-4 represents a transformative leap for artificial intelligence in ophthalmology, yet its utility in scenarios requiring advanced reasoning remains underexplored. This study evaluates GPT-4's multimodal performance on open-ended diagnostic and next-step reasoning tasks in complex ophthalmology cases, comparing it against human expertise. : GPT-4 was assessed across three study arms: (1) text-based case details with figure descriptions, (2) cases with text and accompanying ophthalmic figures, and (3) cases with figures only (no figure descriptions). We compared GPT-4's diagnostic and next-step accuracy across arms and benchmarked its performance against three board-certified ophthalmologists. : GPT-4 achieved 38.4% (95% CI [33.9%, 43.1%]) diagnostic accuracy and 57.8% (95% CI [52.8%, 62.2%]) next-step accuracy when prompted with figures without descriptions. Diagnostic accuracy declined significantly compared to text-only prompts ( = 0.007), though the next-step performance was similar ( = 0.140). Adding figure descriptions restored diagnostic accuracy (49.3%) to near parity with text-only prompts ( = 0.684). Using figures without descriptions, GPT-4's diagnostic accuracy was comparable to two ophthalmologists ( = 0.30, = 0.41) but fell short of the highest-performing ophthalmologist ( = 0.0004). For next-step accuracy, GPT-4 was similar to one ophthalmologist ( = 0.22) but underperformed relative to the other two ( = 0.0015, = 0.0017). : GPT-4's diagnostic performance diminishes when relying solely on ophthalmic images without textual context, highlighting limitations in its current multimodal capabilities. Despite this, GPT-4 demonstrated comparable performance to at least one ophthalmologist on both diagnostic and next-step reasoning tasks, emphasizing its potential as an assistive tool. Future research should refine multimodal prompts and explore iterative or sequential prompting strategies to optimize AI-driven interpretation of complex ophthalmic datasets.
Authors
Keywords
No keywords available for this article.