AI-Powered Laryngoscopy: Exploring the Future With Google Gemini.

Journal: The Laryngoscope
Published Date:

Abstract

Foundation models (FMs) are general-purpose artificial intelligence (AI) neural networks trained on massive datasets, including code, text, audio, images, and video, to handle myriad tasks from generating texts to analyzing images or composing music. We evaluated Google Gemini 1.5 Pro, currently the largest token context window multimodal FM and best-performing commercial model for video analysis, for interpreting laryngoscopy frames and videos from Google Images and YouTube. Gemini recognized the procedure as laryngoscopy in 87/88 frames (98.9%) and in 15/15 video-laryngoscopies (100%), accurately diagnosed a pathology in 55/88 frames (62.5%) and 3/15 videos (20.0%), identified lesion sides in 58/88 frames (65.9%) and 6/15 videos (40%) and narrated two operative video-laryngoscopies without fine-tuning. Findings suggest that Gemini 1.5 Pro shows significant potential for analyzing laryngoscopy, demonstrating the potential for FMs as clinical decision support tools in complex expert tasks in otolaryngology. LEVEL OF EVIDENCE: 3.

Authors

  • Sean A Setzen
    Department of Otolaryngology - Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, New York, New York, USA.
  • Katerina Andreadis
    Department of Population Health, NYU Grossman School of Medicine, New York, New York, United States of America.
  • Olivier Elemento
    Institute for Precision Medicine.
  • Anaïs Rameau
    Department of Otolaryngology - Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, New York, New York, U.S.A.