A Multimodal Approach for Deep-Learning Classification of Vocal Fold Pathologies in Stroboscopy.

Journal: The Laryngoscope
Published Date:

Abstract

OBJECTIVE: To develop and validate a multimodal deep-learning classifier trained on stroboscopic image, voice, and clinicodemographic data, differentiating between three different vocal fold (VF) states: healthy (HVF), unilateral paralysis (UVFP), and VF lesions, including benign and malignant pathologies. METHODS: Patients with UVFP (n = 54), VF lesions (n = 42), and HVF (n = 41) were retrospectively identified. Image frames and voice samples were extracted from stroboscopic videos. Clinicodemographic variables were collected from the electronic health record. Patient-level data was independently divided into training (80%) and testing (20%). Visual features were extracted using a transformer DINOv2 and acoustic features were extracted using Librosa. All three feature modalities were combined using a custom multilayer perceptron. Unimodality models using only image or only voice data were trained for comparison. Accuracy and F1 scores were used to validate the models. RESULTS: On a hold-out test set, the multimodal classifier demonstrated stronger performance (76.9% accuracy) compared to the image classifier (61.5% accuracy) and audio classifier (65.4% accuracy). On an external dataset, the multimodal classifier accuracy dropped to 45%, though still an improvement compared to accuracies of 42% and 31% for the video-only and audio-only modalities, respectively. CONCLUSIONS: In this proof-of-concept study, we successfully developed a multimodal dataset and classifier for VF pathology, demonstrating the potential of combining stroboscopic frames, voice and text data. The multimodal classifier achieved higher accuracy than the image-only model and audio-only models. Future models should validate these findings on larger datasets.

Authors

  • Sruthi Surapaneni
    Michigan State University College of Human Medicine, USA; Glimpse Diagnostics LLC, USA.
  • Rachel B Kutler
    Department of Otolaryngology - Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, New York, New York, USA.
  • Sean A Setzen
    Department of Otolaryngology - Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, New York, New York, USA.
  • Yeo Eun Kim
    Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medicine, 240 East 59th St, New York, NY, 10022, USA.
  • Peter Yao
    Weill Cornell Medical College, Weill Cornell Medicine.
  • Sana H Siddiqui
    Department of Otolaryngology-Head and Neck Surgery, Weill Cornell Medical College, Sean Parker Institute for the Voice, New York, New York, USA.
  • Michael J Pitman
    The Center for Voice and Swallowing, Department of Otolaryngology-Head and Neck Surgery, Columbia University Irving Medical Center, New York, New York, USA.
  • Lucian Sulica
    Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medicine, 240 East 59th St, New York, NY, 10022, USA.
  • Olivier Elemento
    Institute for Precision Medicine.
  • Pegah Khosravi
    Institute for Computational Biomedicine, Weill Cornell Medical College, NY, USA; Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
  • Anaïs Rameau
    Department of Otolaryngology - Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, New York, New York, U.S.A.

Keywords

No keywords available for this article.