Pahari POS-tagged corpus: A large-scale linguistic resource for NLP applications.
Journal:
Data in brief
Published Date:
Feb 3, 2026
Abstract
This paper presents the development of a Part-of-Speech (POS) tagged dataset for Pahari, an under-resourced Indo-Aryan language spoken in Azad Jammu and Kashmir, Pakistan, as well as parts of India and Nepal. The lack of linguistic resources for Pahari has hindered the advancement of Natural Language Processing (NLP) tools, limiting its computational analysis. This study addresses this gap by creating a POS-tagged dataset, defining a tag set tailored to Pahari, and establishing annotation guidelines. The Pahari POS tag set was designed by leveraging existing tag sets from Urdu, Hindi, Punjabi, and other Indo-Aryan languages, ensuring linguistic compatibility. A corpus of 200,000 tokens was collected and manually annotated, achieving an inter-annotator agreement of 92.3 % (Cohen's Kappa). This paper explores the key challenges faced during data collection, preprocessing, and annotation, and details the methodologies employed to address them. The resulting dataset represents the first structured linguistic resource developed for Natural Language Processing (NLP) in the Pahari language. It lays a critical foundation for future research in areas such as morphosyntactic analysis, Named Entity Recognition (NER), and the development of machine learning-based NLP applications.
Authors
Keywords
No keywords available for this article.