Paraphrase detection for Urdu language text using fine-tune BiLSTM framework.

Journal: Scientific reports

PMID: 40316633

Abstract

Automated paraphrase detection is crucial for natural language processing (NL) applications like text summarization, plagiarism detection, and question-answering systems. Detecting paraphrases in Urdu text remains challenging due to the language's complex morphology, distinctive script, and lack of resources such as labelled datasets, pre-trained models, and tailored NLP tools. This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection's intricacies. Our approach employs word embeddings and text preprocessing techniques like tokenization, stop-word removal, and label encoding to effectively handle Urdu's morphological variations. The BiLSTM network sequentially processes the input, leveraging both forward and backward contextual information to encode the complex syntactic and semantic patterns inherent in Urdu text. An essential contribution of this work is the creation of a large-scale Urdu Paraphrased Corpus (UPC) comprising 400,000 potential sentence pair duplicates, with 150,000 pairs manually identified as paraphrases. Our findings reveal a significant improvement in paraphrase detection performance compared to existing methods. We provide insights into the underlying linguistic features and patterns that contribute to the robustness of our framework. This resource facilitates training and evaluating Urdu paraphrase detection models. Experimental evaluations on the custom UPC dataset demonstrate our BiLSTM model's superiority, achieving 94.14% accuracy and outperforming state-of-the-art methods like CNN (83.43%) and LSTM (88.09%). Our model attains an impressive 95.34% accuracy on the benchmark Quora dataset. Furthermore, we incorporate a comprehensive linguistic rule engine to handle exceptional cases during paraphrase analysis, ensuring robust performance across diverse contexts.

Authors

Muhammad Ali Aslam

Department of Computer Science, University of Science and Technology, Bannu, 28100, Pakistan.
Khairullah Khan

Department of Computer Science, University of Science and Technology, Bannu, 28100, Pakistan.
Wahab Khan

Institute of CS & IT, University of Science & Technology, Bannu, Pakistan.
Sajid Ullah Khan

Universiti Malaysia Sarawak, Kota Samarahan, Sarawak, Malaysia.
Abdullah Albanyan

Software Engineering Department, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Alkharj, Saudi Arabia.
Shabbab Ali Algamdi

Department of Software Engineering, College of Computer Science and Engineering, Prince Sattam bin Abdulaziz University, Al Kharj, Saudi Arabia. Electronic address: s.algamdi@psau.edu.sa.

Keywords

Algorithms Humans Language Natural Language Processing Semantics

External Resources

View on PubMed Access via DOI PubMed (40316633)

Paraphrase detection for Urdu language text using fine-tune BiLSTM framework.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals