Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis
Journal:
arXiv
Published Date:
Jul 9, 2025
Abstract
We propose a unified food-domain QA framework that combines a large-scale
multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000
recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate
40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint
fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves
BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by
31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and
LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid
retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\%
adequacy in synthesis. Our results demonstrate that structured knowledge and
multimodal generation together enhance reliability and diversity in food QA.