EmoSEM: Segment and Explain Emotion Stimuli in Visual Art
Journal:
arXiv
Published Date:
Apr 20, 2025
Abstract
This paper focuses on a key challenge in visual art understanding: given an
art image, the model pinpoints pixel regions that trigger a specific human
emotion, and generates linguistic explanations for the emotional arousal.
Despite recent advances in art understanding, pixel-level emotion understanding
still faces a dual challenge: first, the subjectivity of emotion makes it
difficult for general segmentation models like SAM to adapt to emotion-oriented
segmentation tasks; and second, the abstract nature of art expression makes it
difficult for captioning models to balance pixel-level semantic understanding
and emotion reasoning. To solve the above problems, this paper proposes the
Emotion stimuli Segmentation and Explanation Model (EmoSEM) to endow the
segmentation model SAM with emotion comprehension capability. First, to enable
the model to perform segmentation under the guidance of emotional intent well,
we introduce an emotional prompt with a learnable mask token as the conditional
input for segmentation decoding. Then, we design an emotion projector to
establish the association between emotion and visual features. Next, more
importantly, to address emotion-visual stimuli alignment, we develop a
lightweight prefix projector, a module that fuses the learned emotional mask
with the corresponding emotion into a unified representation compatible with
the language model. Finally, we input the joint visual, mask, and emotional
tokens into the language model and output the emotional explanations. It
ensures that the generated interpretations remain semantically and emotionally
coherent with the visual stimuli. The method innovatively realizes end-to-end
modeling from low-level pixel features to high-level emotion interpretation,
providing the first interpretable fine-grained analysis framework for artistic
emotion computing. Extensive experiments validate the effectiveness of our
model.