Reasoning-Driven Food Energy Estimation via Multimodal Large Language Models.
Journal:
Nutrients
PMID:
40218886
Abstract
Image-based food energy estimation is essential for user-friendly food tracking applications, enabling individuals to monitor their dietary intake through smartphones or AR devices. However, existing deep learning approaches struggle to recognize a wide variety of food items, due to the labor-intensive nature of data annotation. Multimodal Large Language Models (MLLMs) possess extensive knowledge and human-like reasoning abilities, making them a promising approach for image-based food energy estimation. Nevertheless, their ability to accurately estimate food energy is hindered by limitations in recognizing food size, a critical factor in energy content assessment. To address this challenge, we propose two approaches: fine-tuning, and volume-aware reasoning with fine-grained estimation prompting. Experimental results on the Nutrition5k dataset demonstrated the effectiveness of these approaches in improving estimation accuracy. We also validated the effectiveness of adapting LoRA to enhance food energy estimation performance. These findings highlight the potential of MLLMs for image-based dietary assessment and emphasize the importance of integrating volume-awareness into food energy estimation models.