Arabic Syntactic Diacritics Restoration Using BERT Models.

Journal: Computational intelligence and neuroscience
Published Date:

Abstract

The Arabic syntactic diacritics restoration problem is often solved using long short-term memory (LSTM) networks. Handcrafted features are used to augment these LSTM networks or taggers to improve performance. A transformer-based machine learning technique known as bidirectional encoder representations from transformers (BERT) has become the state-of-the-art method for natural language understanding in recent years. In this paper, we present a novel tagger based on BERT models to restore Arabic syntactic diacritics. We formulated the syntactic diacritics restoration as a token sequence classification task similar to named-entity recognition (NER). Using the Arabic TreeBank (ATB) corpus, the developed BERT tagger achieves a 1.36% absolute case-ending error rate (CEER) over other systems.

Authors

  • Waleed Nazih
    College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al Kharj, Saudi Arabia.
  • Yasser Hifny
    Faculty of Computers and Artificial Intelligence, Helwan University, Cairo, Egypt.