Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks
Journal:
arXiv
Published Date:
Apr 29, 2025
Abstract
Automation in agriculture plays a vital role in addressing challenges related
to crop monitoring and disease management, particularly through early detection
systems. This study investigates the effectiveness of combining multimodal
Large Language Models (LLMs), specifically GPT-4o, with Convolutional Neural
Networks (CNNs) for automated plant disease classification using leaf imagery.
Leveraging the PlantVillage dataset, we systematically evaluate model
performance across zero-shot, few-shot, and progressive fine-tuning scenarios.
A comparative analysis between GPT-4o and the widely used ResNet-50 model was
conducted across three resolutions (100, 150, and 256 pixels) and two plant
species (apple and corn). Results indicate that fine-tuned GPT-4o models
achieved slightly better performance compared to the performance of ResNet-50,
achieving up to 98.12% classification accuracy on apple leaf images, compared
to 96.88% achieved by ResNet-50, with improved generalization and near-zero
training loss. However, zero-shot performance of GPT-4o was significantly
lower, underscoring the need for minimal training. Additional evaluations on
cross-resolution and cross-plant generalization revealed the models'
adaptability and limitations when applied to new domains. The findings
highlight the promise of integrating multimodal LLMs into automated disease
detection pipelines, enhancing the scalability and intelligence of precision
agriculture systems while reducing the dependence on large, labeled datasets
and high-resolution sensor infrastructure. Large Language Models, Vision
Language Models, LLMs and CNNs, Disease Detection with Vision Language Models,
VLMs