Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models.

Journal: Journal of endourology
Published Date:

Abstract

With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.

Authors

  • Kun-Peng Li
    Department of Urology, Affiliated Hospital of North Sichuan Medical College, Nanchong, China.
  • Li Wang
    College of Marine Electrical Engineering, Dalian Maritime University, Dalian, China.
  • Shun Wan
    Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China.
  • Chen-Yang Wang
    Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China.
  • Si-Yu Chen
    Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China.
  • Shan-Hui Liu
    Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China.
  • Li Yang
    Department of Pharmacy, Beijing Tiantan Hospital, Capital Medical University, Beijing, China.