Augmenting sparse behavior data for user identity linkage with self-generated by model and mixup-generated samples.
Journal:
Neural networks : the official journal of the International Neural Network Society
PMID:
40081271
Abstract
The user identity linkage task aims to associate user accounts belonging to the same individual by utilizing user data. This task is relevant in domains such as recommendation systems, where user-generated content (i.e., behavioral data) serves as the key information for identifying users. However, user identity linkage tasks relying on behavioral data face two primary challenges due to data sparsity: insufficient user behavior data and the presence of low-frequency behavior items. These issues hinder accurate modeling and exacerbate representation errors. To address these challenges, we propose two data augmentation methods: self-generated samples by the model and mixup-generated samples. Collectively, these methods are referred to as SGAMDA (Self-generated by Model and Mixup-generated Samples-based Data Augmentation). The self-generated samples method uses Variational Autoencoders to generate new training data by decoding samples in the representation space. The mixup-generated samples method creates new training data by mixing the behavior data of different user groups, thereby alleviating data sparsity. SGAMDA categorizes user behavior data based on data volume and the proportion of low-frequency behaviors to guide the two data augmentation strategies. We evaluate SGAMDA on the Movies2Books and CDs2Movies datasets for user identity linkage tasks. The results show that SGAMDA significantly improves prediction accuracy, enhancing behavior representation through the proposed data augmentation methods.