Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks
Journal:
arXiv
Published Date:
Mar 17, 2025
Abstract
Visual perceptual tasks aim to predict human judgment of images (e.g.,
emotions invoked by images, image quality assessment). Unlike objective tasks
such as object/scene recognition, perceptual tasks rely on subjective human
assessments, making its data-labeling difficult. The scarcity of such
human-annotated data results in small datasets leading to poor generalization.
Typically, specialized models were designed for each perceptual task, tailored
to its unique characteristics and its own training dataset. We propose a
unified architectural framework for solving multiple different perceptual tasks
leveraging CLIP as a prior. Our approach is based on recent cognitive findings
which indicate that CLIP correlates well with human judgment. While CLIP was
explicitly trained to align images and text, it implicitly also learned human
inclinations. We attribute this to the inclusion of human-written image
captions in CLIP's training data, which contain not only factual image
descriptions, but inevitably also human sentiments and emotions. This makes
CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest
that minimal adaptation of CLIP suffices for solving a variety of perceptual
tasks. Our simple unified framework employs a lightweight adaptation to
fine-tune CLIP to each task, without requiring any task-specific architectural
changes. We evaluate our approach on three tasks: (i) Image Memorability
Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual
Emotion Analysis. Our model achieves state-of-the-art results on all three
tasks, while demonstrating improved generalization across different datasets.