Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
Journal:
arXiv
Published Date:
Dec 19, 2024
Abstract
Vision-language models (VLMs) have shown impressive abilities across a range
of multi-modal tasks. However, existing metrics for evaluating the quality of
text generated by VLMs typically focus on an overall evaluation for a specific
task, such as image captioning. While the overall evaluation is essential for
any task, the criteria prioritized can differ depending on the task, making it
challenging for current metrics to adapt to multi-task scenarios. To address
this limitation, we propose HarmonicEval, a reference-free comprehensive
evaluation metric that aggregates criterion-wise scores to produce the overall
score in a bottom-up manner. Furthermore, we construct the Multi-task
Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert
human judgments across four multi-modal tasks. Our experiments demonstrate that
HarmonicEval achieves higher correlations with human judgments than
conventional metrics while providing numerical scores for each criterion.