Paraphrase detection for Urdu language text using fine-tune BiLSTM framework.

Journal: Scientific reports
PMID:

Abstract

Automated paraphrase detection is crucial for natural language processing (NL) applications like text summarization, plagiarism detection, and question-answering systems. Detecting paraphrases in Urdu text remains challenging due to the language's complex morphology, distinctive script, and lack of resources such as labelled datasets, pre-trained models, and tailored NLP tools. This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection's intricacies. Our approach employs word embeddings and text preprocessing techniques like tokenization, stop-word removal, and label encoding to effectively handle Urdu's morphological variations. The BiLSTM network sequentially processes the input, leveraging both forward and backward contextual information to encode the complex syntactic and semantic patterns inherent in Urdu text. An essential contribution of this work is the creation of a large-scale Urdu Paraphrased Corpus (UPC) comprising 400,000 potential sentence pair duplicates, with 150,000 pairs manually identified as paraphrases. Our findings reveal a significant improvement in paraphrase detection performance compared to existing methods. We provide insights into the underlying linguistic features and patterns that contribute to the robustness of our framework. This resource facilitates training and evaluating Urdu paraphrase detection models. Experimental evaluations on the custom UPC dataset demonstrate our BiLSTM model's superiority, achieving 94.14% accuracy and outperforming state-of-the-art methods like CNN (83.43%) and LSTM (88.09%). Our model attains an impressive 95.34% accuracy on the benchmark Quora dataset. Furthermore, we incorporate a comprehensive linguistic rule engine to handle exceptional cases during paraphrase analysis, ensuring robust performance across diverse contexts.

Authors

  • Muhammad Ali Aslam
    Department of Computer Science, University of Science and Technology, Bannu, 28100, Pakistan.
  • Khairullah Khan
    Department of Computer Science, University of Science and Technology, Bannu, 28100, Pakistan.
  • Wahab Khan
    Institute of CS & IT, University of Science & Technology, Bannu, Pakistan.
  • Sajid Ullah Khan
    Universiti Malaysia Sarawak, Kota Samarahan, Sarawak, Malaysia.
  • Abdullah Albanyan
    Software Engineering Department, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Alkharj, Saudi Arabia.
  • Shabbab Ali Algamdi
    Department of Software Engineering, College of Computer Science and Engineering, Prince Sattam bin Abdulaziz University, Al Kharj, Saudi Arabia. Electronic address: s.algamdi@psau.edu.sa.