A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition
Journal:
arXiv
Published Date:
May 8, 2025
Abstract
The emergence of multimodal content, particularly text and images on social
media, has positioned Multimodal Named Entity Recognition (MNER) as an
increasingly important area of research within Natural Language Processing.
Despite progress in high-resource languages such as English, MNER remains
underexplored for low-resource languages like Urdu. The primary challenges
include the scarcity of annotated multimodal datasets and the lack of
standardized baselines. To address these challenges, we introduce the U-MNER
framework and release the Twitter2015-Urdu dataset, a pioneering resource for
Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated
with Urdu-specific grammar rules. We establish benchmark baselines by
evaluating both text-based and multimodal models on this dataset, providing
comparative analyses to support future research on Urdu MNER. The U-MNER
framework integrates textual and visual context using Urdu-BERT for text
embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion
Module to align and fuse information. Our model achieves state-of-the-art
performance on the Twitter2015-Urdu dataset, laying the groundwork for further
MNER research in low-resource languages.