Dissecting CLIP: Decomposition with a Schur Complement-based Approach
Journal:
arXiv
Published Date:
Dec 24, 2024
Abstract
The use of CLIP embeddings to assess the alignment of samples produced by
text-to-image generative models has been extensively explored in the
literature. While the widely adopted CLIPScore, derived from the cosine
similarity of text and image embeddings, effectively measures the relevance of
a generated image, it does not quantify the diversity of images generated by a
text-to-image model. In this work, we extend the application of CLIP embeddings
to quantify and interpret the intrinsic diversity of text-to-image models,
which is responsible for generating diverse images from similar text prompts.
To achieve this, we propose a decomposition of the CLIP-based kernel covariance
matrix of image data into text-based and non-text-based components. Using the
Schur complement of the joint image-text kernel covariance matrix, we perform
this decomposition and define the matrix-based entropy of the decomposed
component as the \textit{Schur Complement Entropy (SCE)} score, a measure of
the intrinsic diversity of a text-to-image model based on data collected with
varying text prompts. Additionally, we demonstrate the use of the Schur
complement-based decomposition to nullify the influence of a given prompt in
the CLIP embedding of an image, enabling focus or defocus of embeddings on
specific objects or properties for downstream tasks. We present several
numerical results that apply our Schur complement-based approach to evaluate
text-to-image models and modify CLIP image embeddings. The codebase is
available at https://github.com/aziksh-ospanov/CLIP-DISSECTION