Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment
Journal:
arXiv
Published Date:
May 22, 2025
Abstract
The learning mechanisms by which humans acquire internal representations of
objects are not fully understood. Deep neural networks (DNNs) have emerged as a
useful tool for investigating this question, as they have internal
representations similar to those of humans as a byproduct of optimizing their
objective functions. While previous studies have shown that models trained with
various learning paradigms - such as supervised, self-supervised, and CLIP -
acquire human-like representations, it remains unclear whether their similarity
to human representations is primarily at a coarse category level or extends to
finer details. Here, we employ an unsupervised alignment method based on
Gromov-Wasserstein Optimal Transport to compare human and model object
representations at both fine-grained and coarse-grained levels. The unique
feature of this method compared to conventional representational similarity
analysis is that it estimates optimal fine-grained mappings between the
representation of each object in human and model representations. We used this
unsupervised alignment method to assess the extent to which the representation
of each object in humans is correctly mapped to the corresponding
representation of the same object in models. Using human similarity judgments
of 1,854 objects from the THINGS dataset, we find that models trained with CLIP
consistently achieve strong fine- and coarse-grained matching with human object
representations. In contrast, self-supervised models showed limited matching at
both fine- and coarse-grained levels, but still formed object clusters that
reflected human coarse category structure. Our results offer new insights into
the role of linguistic information in acquiring precise object representations
and the potential of self-supervised learning to capture coarse categorical
structures.