HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction
Journal:
arXiv
Published Date:
Jul 1, 2025
Abstract
Social media popularity prediction plays a crucial role in content
optimization, marketing strategies, and user engagement enhancement across
digital platforms. However, predicting post popularity remains challenging due
to the complex interplay between visual, textual, temporal, and user behavioral
factors. This paper presents HyperFusion, a hierarchical multimodal ensemble
learning framework for social media popularity prediction. Our approach employs
a three-tier fusion architecture that progressively integrates features across
abstraction levels: visual representations from CLIP encoders, textual
embeddings from transformer models, and temporal-spatial metadata with user
characteristics. The framework implements a hierarchical ensemble strategy
combining CatBoost, TabNet, and custom multi-layer perceptrons. To address
limited labeled data, we propose a two-stage training methodology with
pseudo-labeling and iterative refinement. We introduce novel cross-modal
similarity measures and hierarchical clustering features that capture
inter-modal dependencies. Experimental results demonstrate that HyperFusion
achieves competitive performance on the SMP challenge dataset. Our team
achieved third place in the SMP Challenge 2025 (Image Track). The source code
is available at https://anonymous.4open.science/r/SMPDImage.