Deciphering genomic codes using advanced natural language processing techniques: a scoping review.

Journal: Journal of the American Medical Informatics Association : JAMIA
PMID:

Abstract

OBJECTIVES: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of natural language processing (NLP) techniques, particularly large language models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data.

Authors

  • Shuyan Cheng
    Department of Population Health Science, Weill Cornell Medical College, New York, NY 10065, USA.
  • Yishu Wei
    Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States.
  • Yiliang Zhou
    Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States.
  • Zihan Xu
    Shenzhen Sixcarbon Technology, Shenzhen 518106, China.
  • Drew N Wright
    Samuel J. Wood Library & C.V. Starr Biomedical Information Center, Weill Cornell Medical College, New York, New York, USA.
  • Jinze Liu
  • Yifan Peng
    Department of Population Health Sciences, Weill Cornell Medicine, New York, USA.