Donate or Create? Comparing Data Collection Strategies for Emotion-labeled Multimodal Social Media Posts
Journal:
arXiv
Published Date:
May 30, 2025
Abstract
Accurate modeling of subjective phenomena such as emotion expression requires
data annotated with authors' intentions. Commonly such data is collected by
asking study participants to donate and label genuine content produced in the
real world, or create content fitting particular labels during the study.
Asking participants to create content is often simpler to implement and
presents fewer risks to participant privacy than data donation. However, it is
unclear if and how study-created content may differ from genuine content, and
how differences may impact models. We collect study-created and genuine
multimodal social media posts labeled for emotion and compare them on several
dimensions, including model performance. We find that compared to genuine
posts, study-created posts are longer, rely more on their text and less on
their images for emotion expression, and focus more on emotion-prototypical
events. The samples of participants willing to donate versus create posts are
demographically different. Study-created data is valuable to train models that
generalize well to genuine data, but realistic effectiveness estimates require
genuine data.