A Transformer based method for the Cap Analysis of Gene Expression and Gene Expression Tag associated 5’ cap site prediction in RNA

Journal: bioRxiv
Published Date:

Abstract

5’ RNA capping is one of the major post-transcriptional modifications for the mobility and stability of RNA molecules. Measuring 5’ caps of RNAs can help quantify expression levels of mRNAs and lncRNAs. One of the most successful RNAseq methods that have used capping as a tool to quantify expression of transcription is Cap Analysis of Gene Expression(CAGE). Computational prediction of capping can therefore be used as a precursor to the prediction of transcriptional expression. Unfortunately, there is hardly any computational technique that has focused purely on predicting 5’ capping. We have developed a transformer-based method for computational prediction of capping from DNA sequences. Our Llama and ReLoRA-based pre-training model, and Llama and LoRA-based fine-tuning model predict 5’ cap sites. We have used Leave-one-chromosome-out-cross-validation for our model. The average accuracy, and F1-score after fine-tuning the human genome hg19(mouse genome mm9) for sequence classification is 79.12%(78.09%), and 78.11%(76.17%), respectively. We noted attention peak-based motifs having an aggregate Wilcoxon rank-sum p-value of 1.075e-10 between the attention peak region and the entire context window for the predicted positive motifs; an aggregate p-value of 7.17e-18 for the predicted negative motifs; and an aggregate p-value of 6.70e-08 between the attention peaks of the predicted positive and the predicted negative motifs. Our Llama-based approach aims to create a sequence-based framework to identify 5’ capping sites corresponding to CAGE peaks. Our analysis reveals statistically significant motifs from the regions of peak attention scores, which demonstrates biological relevance for some through their resident sites matching with known TF motifs.

Authors

  • Dibya Kanti Haldar; Avik Pramanick; Chandrama Mukherjee; Pralay Mitra