DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning
Journal:
arXiv
Published Date:
Feb 25, 2025
Abstract
Purpose: To evaluate the accuracy and reasoning ability of DeepSeek-R1 and
three other recently released large language models (LLMs) in bilingual complex
ophthalmology cases. Methods: A total of 130 multiple-choice questions (MCQs)
related to diagnosis (n = 39) and management (n = 91) were collected from the
Chinese ophthalmology senior professional title examination and categorized
into six topics. These MCQs were translated into English using DeepSeek-R1. The
responses of DeepSeek-R1, Gemini 2.0 Pro, OpenAI o1 and o3-mini were generated
under default configurations between February 15 and February 20, 2025.
Accuracy was calculated as the proportion of correctly answered questions, with
omissions and extra answers considered incorrect. Reasoning ability was
evaluated through analyzing reasoning logic and the causes of reasoning error.
Results: DeepSeek-R1 demonstrated the highest overall accuracy, achieving 0.862
in Chinese MCQs and 0.808 in English MCQs. Gemini 2.0 Pro, OpenAI o1, and
OpenAI o3-mini attained accuracies of 0.715, 0.685, and 0.692 in Chinese MCQs
(all P<0.001 compared with DeepSeek-R1), and 0.746 (P=0.115), 0.723 (P=0.027),
and 0.577 (P<0.001) in English MCQs, respectively. DeepSeek-R1 achieved the
highest accuracy across five topics in both Chinese and English MCQs. It also
excelled in management questions conducted in Chinese (all P<0.05). Reasoning
ability analysis showed that the four LLMs shared similar reasoning logic.
Ignoring key positive history, ignoring key positive signs, misinterpretation
medical data, and too aggressive were the most common causes of reasoning
errors. Conclusion: DeepSeek-R1 demonstrated superior performance in bilingual
complex ophthalmology reasoning tasks than three other state-of-the-art LLMs.
While its clinical applicability remains challenging, it shows promise for
supporting diagnosis and clinical decision-making.