Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion
Journal:
arXiv
Published Date:
May 13, 2025
Abstract
Nutrition estimation is an important component of promoting healthy eating
and mitigating diet-related health risks. Despite advances in tasks such as
food classification and ingredient recognition, progress in nutrition
estimation is limited due to the lack of datasets with nutritional annotations.
To address this issue, we introduce FastFood, a dataset with 84,446 images
across 908 fast food categories, featuring ingredient and nutritional
annotations. In addition, we propose a new model-agnostic Visual-Ingredient
Feature Fusion (VIF$^2$) method to enhance nutrition estimation by integrating
visual and ingredient features. Ingredient robustness is improved through
synonym replacement and resampling strategies during training. The
ingredient-aware visual feature fusion module combines ingredient features and
visual representation to achieve accurate nutritional prediction. During
testing, ingredient predictions are refined using large multimodal models by
data augmentation and majority voting. Our experiments on both FastFood and
Nutrition5k datasets validate the effectiveness of our proposed method built in
different backbones (e.g., Resnet, InceptionV3 and ViT), which demonstrates the
importance of ingredient information in nutrition estimation.
https://huiyanqi.github.io/fastfood-nutrition-estimation/.