ZS-MNET: A zero-shot learning based approach to multimodal named entity typing.
Journal:
Neural networks : the official journal of the International Neural Network Society
PMID:
40010289
Abstract
The task of named entity typing (NET) on social platforms is significant as it involves identifying the various types of named entities within unstructured text. The existing methods for NET only utilize the text modality to classify the types of named entities and ignore the semantic correlation of multimodal data. Moreover, the growing number of multimodal data implies a growing type set and the newly emerged entity types should be recognized without additional training. To address the aforementioned disadvantages, we introduce a zero-shot learning based multimodal NET (ZS-MNET) model that combines textual and visual modalities to recognize previously unseen named entity types in a zero-shot manner. The proposed ZS-MNET utilizes both text and image information to bridge the semantic correlation between multimodal data and label information, as opposed to the traditional zero-shot NET (ZS-NET) models. To incorporate fine-grained multimodal representations, we utilize pre-trained models that incorporate language and vision, particularly BERT and ViT, which are founded on transformer architectures. Besides, we propose the different multimodal representations to focus on fine-grained features for modeling semantic correlation between multimodal data and entity types in a fusion way. The experimental results underscore the utility of multimodal data in the NET field, while our approach surpasses previous ZS-NET models in performance.