Evaluating Few-Shot Prompting for Spectrogram-Based Lung Sound Classification Using a Multimodal Language Model
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Traditional deep learning models for lung sound analysis require large, labeled datasets; multimodal LLMs may offer a flexible, prompt-based alternative. This study aimed to evaluate the utility of a general-purpose multimodal LLM, GPT-4o, for lung sound classification from mel-spectrograms and assess whether a few-shot prompt approach improves performance over zero-shot prompting. Using the ICBHI 2017 Respiratory Sound Database, 6898 annotated respiratory cycles were converted into mel-spectrograms. GPT-4o was prompted to classify each spectrogram in both zero-shot and few-shot settings. Few-shot prompts included labeled examples, while zero-shot prompts did not. Model outputs were evaluated against ground truth labels using performance metrics including accuracy, precision, recall, and F1-score. Few-shot prompting improved overall accuracy (0.363 vs. 0.320) and yielded modest gains in precision (0.316 vs. 0.283), recall (0.300 vs. 0.287), and F1-score (0.308 vs. 0.285) across labels. McNemar’s test indicated a statistically significant difference in performance between prompting strategies (p < 0.001). Model repeatability analysis demonstrated high agreement (κ = 0.76–0.88; agreement: 89–96%), indicating excellent consistency. GPT-4o demonstrated limited but statistically significant performance gains using few-shot prompting for lung sound classification. While not yet suitable for clinical use, this prompt-based approach offers a promising, scalable strategy for medical audio analysis without task-specific training.