Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis
Journal:
arXiv
Published Date:
Apr 22, 2025
Abstract
Multimodal aspect-based sentiment classification (MASC) is an emerging task
due to an increase in user-generated multimodal content on social platforms,
aimed at predicting sentiment polarity toward specific aspect targets (i.e.,
entities or attributes explicitly mentioned in text-image pairs). Despite
extensive efforts and significant achievements in existing MASC, substantial
gaps remain in understanding fine-grained visual content and the cognitive
rationales derived from semantic content and impressions (cognitive
interpretations of emotions evoked by image content). In this study, we present
Chimera: a cognitive and aesthetic sentiment causality understanding framework
to derive fine-grained holistic features of aspects and infer the fundamental
drivers of sentiment expression from both semantic perspectives and
affective-cognitive resonance (the synergistic effect between emotional
responses and cognitive interpretations). Specifically, this framework first
incorporates visual patch features for patch-word alignment. Meanwhile, it
extracts coarse-grained visual features (e.g., overall image representation)
and fine-grained visual regions (e.g., aspect-related regions) and translates
them into corresponding textual descriptions (e.g., facial, aesthetic).
Finally, we leverage the sentimental causes and impressions generated by a
large language model (LLM) to enhance the model's awareness of sentimental cues
evoked by semantic content and affective-cognitive resonance. Experimental
results on standard MASC datasets demonstrate the effectiveness of the proposed
model, which also exhibits greater flexibility to MASC compared to LLMs such as
GPT-4o. We have publicly released the complete implementation and dataset at
https://github.com/Xillv/Chimera