Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs
Journal:
arXiv
Published Date:
May 29, 2025
Abstract
Multimodal Large Language Models (MLLMs) are of great application across many
domains, such as multimodal understanding and generation. With the development
of diffusion models (DM) and unified MLLMs, the performance of image generation
has been significantly improved, however, the study of image screening is rare
and its performance with MLLMs is unsatisfactory due to the lack of data and
the week image aesthetic reasoning ability in MLLMs. In this work, we propose a
complete solution to address these problems in terms of data and methodology.
For data, we collect a comprehensive medical image screening dataset with 1500+
samples, each sample consists of a medical image, four generated images, and a
multiple-choice answer. The dataset evaluates the aesthetic reasoning ability
under four aspects: \textit{(1) Appearance Deformation, (2) Principles of
Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality}.
For methodology, we utilize long chains of thought (CoT) and Group Relative
Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO,
to enhance the image aesthetic reasoning ability of MLLMs. Our experimental
results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o
and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic
reasoning. In contrast, by leveraging the reinforcement learning approach, we
are able to surpass the score of both large-scale models and leading
closed-source models using a much smaller model. We hope our attempt on medical
image screening will serve as a regular configuration in image aesthetic
reasoning in the future.