Optimising window size of semantic of classification model for identification of in-text citations based on context and intent.

Journal: PloS one
PMID:

Abstract

Citations in scientific literature act as channels for the sharing, transfer, and development of scientific knowledge. However, not all citations hold the same significance. Numerous taxonomies and machine learning models have been developed to analyze citations, but they often overlook the internal context of these citations. Moreover, it is worth noting that selecting the appropriate word embedding and classification models is crucial for achieving superior results. Word embeddings offer n-dimensional distributed representations of text, striving to capture the nuanced meanings of words. Deep learning-based word embedding techniques have garnered significant attention and found application in various Natural Language Processing (NLP) tasks, including text classification, sentiment analysis, and citation analysis. Current state-of-the-art techniques often use small datasets with fixed window sizes, resulting in the loss of contextual meaning. This study leverages two benchmark datasets encompassing a substantial volume of in-text citations to guide the selection of an optimal word embedding window size and classification approaches. A comparative analysis of various window sizes for in-text citations is conducted to identify crucial citations effectively. Additionally, Word2Vec embedding is employed in conjunction with deep learning models and machine learning models such as Convolutional Neural Networks (CNNs), Gated Recurrent Units (GRUs), Long Short-Term Memory (LSTM) networks, Support Vector Machines (SVM), Decision Trees, and Naive Bayes.The evaluation employs precision, recall, F1-score, and accuracy metrics for each combination of window sizes. The findings reveal that, particularly for lengthy in-text citations, larger citation windows are more adept at capturing the semantic essence of the references. Within the scope of this study, window sizes of 10 achieve superior accuracy and precision with both machine and deep learning models.

Authors

  • Arshad Iqbal
    School of Computing Sciences, Pak-Austria Fachhochschule: Institute of Applied Sciences and Technology, Mang, Haripur, Khyber Pakhtunkhwa, Pakistan.
  • Abdul Shahid
    National College of Ireland, Dublin, Ireland.
  • Muhammad Roman
    Institute of Computing Kohat University of Science and Technology, Kohat, Pakistan.
  • Muhammad Tanvir Afzal
  • Umair Ul Hassan
    JE Cairnes School of Business and Economics, University of Galway, Galway, Ireland.