A Multimodal Approach for Deep-Learning Classification of Vocal Fold Pathologies in Stroboscopy.

Journal: The Laryngoscope

Published Date: Jan 5, 2026

Abstract

OBJECTIVE: To develop and validate a multimodal deep-learning classifier trained on stroboscopic image, voice, and clinicodemographic data, differentiating between three different vocal fold (VF) states: healthy (HVF), unilateral paralysis (UVFP), and VF lesions, including benign and malignant pathologies. METHODS: Patients with UVFP (n = 54), VF lesions (n = 42), and HVF (n = 41) were retrospectively identified. Image frames and voice samples were extracted from stroboscopic videos. Clinicodemographic variables were collected from the electronic health record. Patient-level data was independently divided into training (80%) and testing (20%). Visual features were extracted using a transformer DINOv2 and acoustic features were extracted using Librosa. All three feature modalities were combined using a custom multilayer perceptron. Unimodality models using only image or only voice data were trained for comparison. Accuracy and F1 scores were used to validate the models. RESULTS: On a hold-out test set, the multimodal classifier demonstrated stronger performance (76.9% accuracy) compared to the image classifier (61.5% accuracy) and audio classifier (65.4% accuracy). On an external dataset, the multimodal classifier accuracy dropped to 45%, though still an improvement compared to accuracies of 42% and 31% for the video-only and audio-only modalities, respectively. CONCLUSIONS: In this proof-of-concept study, we successfully developed a multimodal dataset and classifier for VF pathology, demonstrating the potential of combining stroboscopic frames, voice and text data. The multimodal classifier achieved higher accuracy than the image-only model and audio-only models. Future models should validate these findings on larger datasets.

Authors

Sruthi Surapaneni

Michigan State University College of Human Medicine, USA; Glimpse Diagnostics LLC, USA.
Rachel B Kutler

Department of Otolaryngology - Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, New York, New York, USA.
Sean A Setzen

Department of Otolaryngology - Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, New York, New York, USA.
Yeo Eun Kim

Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medicine, 240 East 59th St, New York, NY, 10022, USA.
Peter Yao

Weill Cornell Medical College, Weill Cornell Medicine.
Sana H Siddiqui

Department of Otolaryngology-Head and Neck Surgery, Weill Cornell Medical College, Sean Parker Institute for the Voice, New York, New York, USA.
Michael J Pitman

The Center for Voice and Swallowing, Department of Otolaryngology-Head and Neck Surgery, Columbia University Irving Medical Center, New York, New York, USA.
Lucian Sulica

Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medicine, 240 East 59th St, New York, NY, 10022, USA.
Olivier Elemento

Institute for Precision Medicine.
Pegah Khosravi

Institute for Computational Biomedicine, Weill Cornell Medical College, NY, USA; Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
Anaïs Rameau

Department of Otolaryngology - Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, New York, New York, U.S.A.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (41489089)

A Multimodal Approach for Deep-Learning Classification of Vocal Fold Pathologies in Stroboscopy.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

A Multimodal Approach for Deep-Learning Classification of Vocal Fold Pathologies in Stroboscopy.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals