An NLP-based method to mine gene and function relationships from published articles.

Journal: Scientific reports
PMID:

Abstract

Understanding the intricacies of genes function within biological systems is paramount for scientific advancement and medical progress. Owing to the evolving landscape of this research and the complexity of biological processes, however, this task presents challenges. We introduce PATHAK, a natural language processing (NLP)-based method that mines relationships between genes and their functions from published scientific articles. PATHAK utilizes a pre-trained Transformer language model to generate sentence embeddings from a vast dataset of scientific documents. This enables the identification of meaningful associations between genes and their potential functional annotations. Our approach is adaptable and applicable across diverse scientific domains. Applying PATHAK to over 17,000 research articles focused on Arabidopsis thaliana, we assigned approximately 1493 GO terms to 10,976 genes by analyzing article sentences, comparing their embeddings to GO term embeddings, and mapping potential matches. The model demonstrates moderate-to-high predictive accuracy, capturing ~ 57% overlap of GO terms (6258 out of 10,976) between predicted and known annotations on TAIR, including 1271 and 161 exact matches and 4826 partially related terms. This method promises to significantly advance our understanding of gene functionality and potentially accelerate discoveries in the context of plant development, growth and stress responses in plants and other systems.

Authors

  • Nilesh Kumar
    1 Department of Biology, and.
  • M Shahid Mukhtar
    1 Department of Biology, and.